On 20/09/2010 16:19, [email protected] wrote:
My Python script needs to process 45,000 files, but it seems to blow
up after about 10,000. Note that I'm outputting
bazillions of rows to a csv, so that may be part of the issue.
Here's the error I get (I'm running it through IDLE on Windows 7):
Microsoft Visual C++ Runtime Library Runtime Error! Program:
C:\Python26\pythonw.exe This application has requested the Runtime to
terminate it in an usual way.
OK. Just for starters (and to eliminate other possible side-issues)
can you run the same code directly from the command line using
c:\python26\python.exe?
ie just open cmd.exe, cd to the directory where your code is and type:
c:\python26\python.exe <mystuff.py>
I assume that the same thing will happen, but if it doesn't then
that points the finger at IDLE or, possibly, at pythonw.exe. (And
might also give you a workaround).
I think this might be because I don't specifically close the files
I'm reading. Except that I'm not quite sure where to put the close.
On this note, it's worth learning about context managers. Or, rather,
the fact that files can be context-managed. That means that you can
open a file using the with ...: construct and it will automatically
close:
<code>
with open ("blah.csv", "wb") as f:
f.write ("blah")
# at this point f will have been closed
</code>
1) During the self.string here:
class ReviewFile: # In our movie corpus, each movie is one text file.
That means that each text file has some "info" about the movie
(genre, director, name, etc), followed by a bunch of reviews. This
class extracts the relevant information about the movie, which is
then attached to review-specific information. def __init__(self,
filename): self.filename = filename self.string =
codecs.open(filename, "r", "utf8").read() self.info =
self.get_fields(self.get_field(self.string, "info")[0])
review_strings = self.get_field(self.string, "review") review_dicts =
map(self.get_fields, review_strings) self.reviews = map(Review,
review_dicts)
So that could become:
<code>
with codecs.open (filename, "r", "utf8") as f:
self.string = f.read ()
</code>
2) Maybe here? def reviewFile ( file, args): for file in
glob.iglob("*.txt"): print " Reviewing...." + file rf =
ReviewFile(file)
Here, with that many files, I'd strongly recommend using the
FindFilesIterator exposed in the win32file module of the pywin32
extensions. glob.iglob simply does a glob (creating a moderately
large in-memory list) whose iterator it then returns. The
FindFilesIterator actually calls underlying Windows code to
iterate lazily over the files in the directory.
3) Or maybe here?
def reviewDirectory ( args, dirname, filenames ): print
'Directory',dirname for fileName in filenames: reviewFile(
dirname+'/'+fileName, args ) def
main(top_level_dir,csv_out_file_name): csv_out_file =
open(str(csv_out_file_name), "wb") writer = csv.writer(csv_out_file,
delimiter=',') os.path.walk(top_level_dir, reviewDirectory, writer )
main(".","output.csv")
Again, here, you might use:
<code>
with open (str (csv_out_file_name), "wb") as csv_out_file:
writer = csv.writer (csv_out_file, delimiter=",")
</code>
I'm fairly sure that the default delimiter is already "," so you
shouldn't need that, and I'm not sure where csv_out_file_name
is coming from but you almost certainly don't need to convert it
explicitly to a string.
Note, also, that the os.walk (*not* os.path.walk) function is often
an easier fit since it iterates lazily over directories, yielding a
(dirpath, dirnames, filenames) 3-tuple which you can then act upon.
However, you may prefer the callback style of the older os.path.walk.
TJG
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor