well
lets say i have about a thounsand files to be proccessed .. i need to
extract text out of them , whatever file type it is (i use Linux
"strings") command .
i want to do in multi processed way , which works on multi-core pcs too.
this is my current implementation :
import subprocess,shlex
def __forcedParsing(fname):
cmd = 'strings "%s"' % (fname)
#print cmd
args= shlex.split(cmd)
try:
sp = subprocess.Popen( args, shell = False, stdout =
subprocess.PIPE, stderr = subprocess.PIPE )
out, err = sp.communicate()
except OSError:
print "Error no %s Message %s" %
(OSError.errno,OSError.message)
pass
if sp.returncode== 0:
#print "Processed %s" %fname
return out
def parseDocs():
rows_to_parse = [i for i in range( 0,len(SESSION.all_docs))]
row_ids = [x[0] for x in SESSION.all_docs ]
res=[]
for rowID in rows_to_parse:
file_id, fname, ftype, dir = SESSION.all_docs[int( rowID ) ]
fp = os.path.join( dir, fname )
res.append(__forcedParsing(fp))
well the problem is i need output from subprocess so i have to read
using sp.communicate(). i need that to be multiprocessed (via forking?
poll?)
so here are my thoughs :
1) without using fork() , could I do multiple ajax posts by
iterating the huge list of files at client side to server , each
processes will be multi-threaded because of Rocket right? But may this
suffer performace issue on client side?
2) Forking Current implementation, and read output via polling?
subprocess.poll()
any ideas?