> I want to utilize the power of cores on my server and read big files > (> 50Gb) simultaneously by seeking to N locations. Process each > separate chunk and merge the output. Very similar to MapReduce > concept. > > What I want to know is the best way to read a file concurrently. I > have read about file-handle.seek(), os.lseek() but not sure if thats > the way to go. Any used cases would be of help. > > PS: did find some links on stackoverflow but it was not clear to me if > I found the right solution. >
Have you done any testing in this space? I would assume you would be memory/IO bound and not CPU bound. Using multiple cores would not help non-CPU bound tasks. I would try and write an initial program that does what you want without attempting to optimize and then do some profiling to see if you are using waiting on the CPU or if you are (as I suspect) waiting on hard disk / memory. Actually, if you only need small chunks of the file at a time and you iterate over the file (for line in file-handle:) instead of using file-handle.readlines() you will probably only be IO bound due to the way Python file handling works. But either way, test first then optimize. :) Ramit Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology 712 Main Street | Houston, TX 77002 work phone: 713 - 216 - 5423 -- This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor