On Mon, Nov 04, 2013 at 04:54:16PM +0000, Alan Gauld wrote: > On 04/11/13 16:34, Amal Thomas wrote: > >@Joel: The code runs for weeks..input file which I have to process in > >very huge(in 50 gbs). So its not a matter of hours.its matter of days > >and weeks.. > > OK, but that's not down to reading the file from disk. > Reading a 50G file will only take a few minutes if you have enough RAM, > which seems to be the case.
Not really. There is still some uncertainty (at least in my mind!). For instance, I assume that Amal doesn't have sole access to the server. So there could be another dozen users all trying to read 50GB files at once, in a machine with only 100GB of memory... Once the server starts paging, performance will plummett. > If it's taking days/weeks you must be doing > some incredibly time consuming processing. Well, yes, it's biology :-) > It's probably worth putting some more timing statements into your code > to see where the time is going because it's not the reading from the > disk that's the problem. The first thing I would do is run the code on three smaller sample files: 50MB 100MB 200MB The time taken should approximately double as you double the size of the file: say it takes 2 hours to process the 50MB file, 4 hours for the 100MB file and 8 hours for the 200 MB file, that's linear performance and isn't too bad. But if performance isn't linear, say 2 hours, 4 hours, 16 hours, then you're in trouble and you *desperately* need to reconsider the algorithm being used. Either that, or just accept that this is an inherently slow calculation and it will take a week or two. Amal, another thing you should try is use the Python profiler on your code (again, on a smaller sample file). The profiler will show you where the time is being spent. Unfortunately the profiler may slow your code down, so it is important to use it on manageable sized data. The profiler is explained here: http://docs.python.org/3/library/profile.html If you need any help, don't hesitate to ask. > >trying to optimize my code to get the outputs in less time and memory > >efficiently. > > Memory efficiency is easy, do it line by line off the disk. This assumes that you can process one line at a time, sequentially. I expect that is not the case. -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor