parallelizing a loop over file reads

已回答

Permanently deleted user

创建于 2017年03月15日 04:37

I am using PyCharm 2016.3.2 with Python 3.6 as the interpreter to convert PDF files to .TXT The code I have (see below) works fine, but it converts files sequentially and slowly.

from tika import parser
from os import listdir
for filename in listdir("C:\\Dropbox\\Data"):
    text = parser.from_file('C:\\Dropbox\\Data'+filename)
    with open('C:\\Dropbox\\Data\\textoutput\\'+filename+'.txt', 'w+') as outfile : 
        outfile.write(text["content"])

I am very new to Python coding so any help in parallelizing this block of code will help, since I'm dealing with >100,000 files (65 GB+)

1 条评论

Permanently deleted user

创建于 2017年03月15日 08:19

http://chriskiehl.com/article/parallelism-in-one-line/

Here is a simple way to use multithreading to improve speed.

Since you are trying to learn, why aren't you asking this question on stackoverflow?

请先登录再写评论。