Re: Implementing file reading in C/Python
John Machin wrote: The factor of 30 indeed does not seem right -- I have done somewhat similar stuff (calculating Levenshtein distance [edit distance] on words read from very large files), coded the same algorithm in pure Python and C++ (using linked lists in C++) and Python version was 2.5 times slower. Levenshtein distance using linked lists? That's novel. Care to divulge? I meant: using linked lists to store words that are compared. I found using vectors was slow. Regards, mk -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
Johannes Bauer dfnsonfsdu...@gmx.de writes: Yup, I changed the Python code to behave the same way the C code did - however overall it's not much of an improvement: Takes about 15 minutes to execute (still factor 23). Not sure this is completely fair if you're only looking for a pure Python solution, but to be honest, looping through a gazillion individual bytes of information sort of begs for trying to offload that into a library that can execute faster, while maintaining the convenience of Python outside of the pure number crunching. I'd assume numeric/numpy might have applicable functions, but I don't use those libraries much, whereas I've been using OpenCV recently for a lot of image processing work, and it has matrix/histogram support, which seems to be a good match for your needs. For example, assuming the OpenCV library and ctypes-opencv wrapper, add the following before the file I/O loop: from opencv import * # Histogram for each file chunk hist = cvCreateHist([256], CV_HIST_ARRAY, [(0,256)]) then, replace (using one of your posted methods as a sample): datamap = { } for i in data: datamap[i] = datamap.get(i, 0) + 1 array = sorted([(b, a) for (a, b) in datamap.items()], reverse=True) most = ord(array[0][1]) with: matrix = cvMat(1, len(data), CV_8UC1, data) cvCalcHist([matrix], hist) most = cvGetMinMaxHistValue(hist, min_val = False, max_val = False, min_idx = False, max_idx = True) should give you your results in a fraction of the time. I didn't run with a full size data file, but for a smaller one using smaller chunks the OpenCV varient ran in about 1/10 of the time, and that was while leaving all the other remaining Python code in place. Note that it may not be identical results to some of your other methods in the case of multiple values with the same counts, as the OpenCV histogram min/max call will always pick the lower value in such cases, whereas some of your code (such as above) will pick the upper value, or your original code depended on the order of information returned by dict.items. This sort of small dedicated high performance choke point is probably also perfect for something like Pyrex/Cython, although that would require a compiler to build the extension for the histogram code. -- David -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Mon, 12 Jan 2009 21:26:27 -0500, Steve Holden wrote: The very idea of mapping part of a process's virtual address space onto an area in which low-level system code resides, so writing to this region may corrupt the system, with potentially catastrophic consequences seems to be asking for trouble to me. That's why those regions are usually write protected and no execution allowed from the code in the user area of the virtual address space. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
Grant Edwards inva...@invalid wrote: On 2009-01-09, Sion Arrowsmith si...@chiark.greenend.org.uk wrote: Grant Edwards inva...@invalid wrote: If I were you, I'd try mmap()ing the file instead of reading it into string objects one chunk at a time. You've snipped the bit further on in that sentence where the OP says that the file of interest is 2GB. Do you still want to try mmap'ing it? Sure. The larger the file, the more you gain from mmap'ing it. 2GB should easily fit within the process's virtual memory space. Assuming you're in a 64bit world. Me, I've only got 2GB of address space available to play in -- mmap'ing all of it out of the question. But I supposed that mmap'ing it chunk at a time instead of reading chunk at a time might be worth considering. -- \S -- si...@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/ Frankly I have no feelings towards penguins one way or the other -- Arthur C. Clarke her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Jan 9, 6:41 pm, Sion Arrowsmith si...@chiark.greenend.org.uk wrote: You've snipped the bit further on in that sentence where the OP says that the file of interest is 2GB. Do you still want to try mmap'ing it? Python's mmap object does not take an offset parameter. If it did, one could mmap smaller portions of the file. -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
In case the cancel didn't get through: Sion Arrowsmith si...@chiark.greenend.org.uk wrote: Grant Edwards inva...@invalid wrote: 2GB should easily fit within the process's virtual memory space. Assuming you're in a 64bit world. Me, I've only got 2GB of address space available to play in -- mmap'ing all of it out of the question. And today's moral is: try it before posting. Yeah, I can map a 2GB file no problem, complete with associated 2GB+ allocated VM. The addressing is clearly not working how I was expecting it too. -- \S -- si...@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/ Frankly I have no feelings towards penguins one way or the other -- Arthur C. Clarke her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Jan 12, 1:52 pm, Sion Arrowsmith si...@chiark.greenend.org.uk wrote: And today's moral is: try it before posting. Yeah, I can map a 2GB file no problem, complete with associated 2GB+ allocated VM. The addressing is clearly not working how I was expecting it too. The virtual memory space of a 32 bit process is 4 GB. -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
sturlamolden sturlamol...@yahoo.no writes: On Jan 9, 6:41 pm, Sion Arrowsmith si...@chiark.greenend.org.uk wrote: You've snipped the bit further on in that sentence where the OP says that the file of interest is 2GB. Do you still want to try mmap'ing it? Python's mmap object does not take an offset parameter. If it did, one could mmap smaller portions of the file. As of 2.6 it does, but that might not be of much use if you're using 2.5.x or earlier. If you speak Python/C and really need offset, you could backport the mmap module from 2.6 and compile it under a different name for 2.5. -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On 2009-01-12, Sion Arrowsmith si...@chiark.greenend.org.uk wrote: Grant Edwards inva...@invalid wrote: On 2009-01-09, Sion Arrowsmith si...@chiark.greenend.org.uk wrote: Grant Edwards inva...@invalid wrote: If I were you, I'd try mmap()ing the file instead of reading it into string objects one chunk at a time. You've snipped the bit further on in that sentence where the OP says that the file of interest is 2GB. Do you still want to try mmap'ing it? Sure. The larger the file, the more you gain from mmap'ing it. 2GB should easily fit within the process's virtual memory space. Assuming you're in a 64bit world. Me, I've only got 2GB of address space available to play in -- mmap'ing all of it out of the question. Oh. I assumed that decent 32-bit OSes would provide at least 3-4GB of address space to user processes. What OS are you using? But I supposed that mmap'ing it chunk at a time instead of reading chunk at a time might be worth considering. I'd try mmap'ing it in large chunks (512MB maybe). -- Grant Edwards grante Yow! I feel like a wet at parking meter on Darvon! visi.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On 2009-01-12, Sion Arrowsmith si...@chiark.greenend.org.uk wrote: In case the cancel didn't get through: Sion Arrowsmith si...@chiark.greenend.org.uk wrote: Grant Edwards inva...@invalid wrote: 2GB should easily fit within the process's virtual memory space. Assuming you're in a 64bit world. Me, I've only got 2GB of address space available to play in -- mmap'ing all of it out of the question. And today's moral is: try it before posting. Yeah, I can map a 2GB file no problem, complete with associated 2GB+ allocated VM. The addressing is clearly not working how I was expecting it too. Cool. I'd be very interested to to know how the performance compares to open/read. -- Grant Edwards grante Yow! I think I am an at overnight sensation right visi.comnow!! -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
sturlamolden wrote: On Jan 12, 1:52 pm, Sion Arrowsmith si...@chiark.greenend.org.uk wrote: And today's moral is: try it before posting. Yeah, I can map a 2GB file no problem, complete with associated 2GB+ allocated VM. The addressing is clearly not working how I was expecting it too. The virtual memory space of a 32 bit process is 4 GB. I believe, though, that in some environments 2GB of that is mapped onto the operating system, to allow system calls to access OS memory structures without any VM remapping being required - see http://blogs.technet.com/markrussinovich/archive/2008/11/17/3155406.aspx. Things have, however, improved if we are to believe what we read in http://www.tenouk.com/WinVirtualAddressSpace.html The very idea of mapping part of a process's virtual address space onto an area in which low-level system code resides, so writing to this region may corrupt the system, with potentially catastrophic consequences seems to be asking for trouble to me. It's surprising things used to don't go wrong with Windows all the time, really. Oh, wait a minute, they did, didn't they? Still do for that matter ... getting-sicker-of-vista-by-the-minute-ly yr's - steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
sturlamolden wrote: On Jan 12, 1:52 pm, Sion Arrowsmith si...@chiark.greenend.org.uk wrote: And today's moral is: try it before posting. Yeah, I can map a 2GB file no problem, complete with associated 2GB+ allocated VM. The addressing is clearly not working how I was expecting it too. The virtual memory space of a 32 bit process is 4 GB. After my last post I should also point out a) That was specific to 32-bit processes, and b) http://regions.cmg.org/regions/mcmg/Virtual%20memory%20constraints%20in%2032bit%20Windows.pdf describes the situation better, and outliones some steps you can take to get relief. regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On 2009-01-13, Steve Holden st...@holdenweb.com wrote: sturlamolden wrote: On Jan 12, 1:52 pm, Sion Arrowsmith si...@chiark.greenend.org.uk wrote: And today's moral is: try it before posting. Yeah, I can map a 2GB file no problem, complete with associated 2GB+ allocated VM. The addressing is clearly not working how I was expecting it too. The virtual memory space of a 32 bit process is 4 GB. I believe, though, that in some environments 2GB of that is mapped onto the operating system, to allow system calls to access OS memory structures without any VM remapping being required IIRC, in Linux the default for the past several years has been 3GB user, 1GB kernel. But, there are kernel configuration options to enable different configurations. getting-sicker-of-vista-by-the-minute-ly yr's - steve Haven't had to touch Vista yet -- I'm annoyed enough by XP. -- Grant -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, 09 Jan 2009 15:34:17 +, MRAB wrote: Marc 'BlackJack' Rintsch wrote: On Fri, 09 Jan 2009 04:04:41 +0100, Johannes Bauer wrote: As this was horribly slow (20 Minutes for a 2GB file) I coded the whole thing in C also: Yours took ~37 minutes for 2 GiB here. This just ~15 minutes: #!/usr/bin/env python from __future__ import division, with_statement import os import sys from collections import defaultdict from functools import partial from itertools import imap def iter_max_values(blocks, block_count): for i, block in enumerate(blocks): histogram = defaultdict(int) for byte in block: histogram[byte] += 1 yield max((count, byte) for value, count in histogram.iteritems())[1] [snip] Would it be faster if histogram was a list initialised to [0] * 256? I tried it on my computer, also getting character codes with struct.unpack, like this: histogram = [0,]*256 for byte in struct.unpack( '%dB'%len(block), block ): histogram[byte] +=1 yield max(( count, byte ) for idx, count in enumerate(histogram))[1] and I also removed the map( ord ... ) statement in main program, since iter_max_values mow returns character codes directly. The result is 10 minutes against the 13 of the original 'BlackJack's code on my PC (iMac Intel python 2.6.1). Strangely, using histogram = array.array( 'i', [0,]*256 ) gives again 13 minutes, even if I create the array outside the loop and then use histogram[:] = zero_array to reset the values. Ciao - FB -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, 09 Jan 2009 04:04:41 +0100, Johannes Bauer wrote: I've first tried Python. Please don't beat me, it's slow as hell and probably a horrible solution: #!/usr/bin/python import sys import os f = open(sys.argv[1], r) Mode should be 'rb'. filesize = os.stat(sys.argv[1])[6] `os.path.getsize()` is a little bit more readable. width = 1024 height = 1024 pixels = width * height blocksize = filesize / width / height print(Filesize : %d % (filesize)) print(Image size : %dx%d % (width, height)) print(Bytes per Pixel: %d % (blocksize)) Why parentheses around ``print``\s argument? In Python 3 ``print`` is a statement and not a function. picture = { } havepixels = 0 while True: data = f.read(blocksize) if len(data) = 0: break if data: break is enough. datamap = { } for i in range(len(data)): datamap[ord(data[i])] = datamap.get(data[i], 0) + 1 Here you are creating a list full of integers to use them as index into `data` (twice) instead of iterating directly over the elements in `data`. And you are calling `ord()` for *every* byte in the file although you just need it for one value in each block. If it's possible to write the raw PGM format this conversion wouldn't be necessary at all. For the `datamap` a `collections.defaultdict()` might be faster. maxchr = None maxcnt = None for (char, count) in datamap.items(): if (maxcnt is None) or (count maxcnt): maxcnt = count maxchr = char Untested: maxchr = max((i, c) for c, i in datamap.iteritems())[1] most = maxchr Why? posx = havepixels % width posy = havepixels / width posx, posy = divmod(havepixels, width) Don't know if this is faster though. havepixels += 1 if (havepixels % 1024) == 0: print(Progresss %s: %.1f%% % (sys.argv[1], 100.0 * havepixels / pixels)) picture[(posx, posy)] = most Why are you using a dictionary as 2d array? In the C code you simply write the values sequentially, why can't you just use a flat list and append here? Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, Jan 9, 2009 at 7:15 PM, Marc 'BlackJack' Rintsch bj_...@gmx.net wrote: print(Filesize : %d % (filesize)) print(Image size : %dx%d % (width, height)) print(Bytes per Pixel: %d % (blocksize)) Why parentheses around ``print``\s argument? In Python 3 ``print`` is a statement and not a function. Not true as of 2.6+ and 3.0+ print is now a function. cheers James -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, 09 Jan 2009 19:33:53 +1000, James Mills wrote: On Fri, Jan 9, 2009 at 7:15 PM, Marc 'BlackJack' Rintsch bj_...@gmx.net wrote: Why parentheses around ``print``\s argument? In Python 3 ``print`` is a statement and not a function. Not true as of 2.6+ and 3.0+ print is now a function. Please read again what I wrote. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, Jan 9, 2009 at 7:41 PM, Marc 'BlackJack' Rintsch bj_...@gmx.net wrote: Please read again what I wrote. Lol I thought 3 was a smiley! :) Sorry! cheers James -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, 09 Jan 2009 04:04:41 +0100, Johannes Bauer wrote: datamap = { } for i in range(len(data)): datamap[ord(data[i])] = datamap.get(data[i], 0) + 1 Here is an error by the way: You call `ord()` just on the left side of the ``=``, so all keys in the dictionary are mapped to ones after the loop which gives a pretty boring PGM. :-) Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, 09 Jan 2009 19:33:53 +1000, James Mills wrote: On Fri, Jan 9, 2009 at 7:15 PM, Marc 'BlackJack' Rintsch bj_...@gmx.net wrote: print(Filesize : %d % (filesize)) print(Image size : %dx%d % (width, height)) print(Bytes per Pixel: %d % (blocksize)) Why parentheses around ``print``\s argument? In Python 3 ``print`` is a statement and not a function. Not true as of 2.6+ and 3.0+ print is now a function. Not so. print is still a statement in 2.6. $ python2.6 Python 2.6.1 (r261:67515, Dec 24 2008, 00:33:13) [GCC 4.1.2 20070502 (Red Hat 4.1.2-12)] on linux2 Type help, copyright, credits or license for more information. print 23 23 -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, 09 Jan 2009 09:15:20 +, Marc 'BlackJack' Rintsch wrote: picture = { } havepixels = 0 while True: data = f.read(blocksize) if len(data) = 0: break if data: break is enough. You've reversed the sense of the test. The OP exits the loop when data is empty, you exit the loop when it *isn't* empty. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, 09 Jan 2009 04:04:41 +0100, Johannes Bauer wrote: As this was horribly slow (20 Minutes for a 2GB file) I coded the whole thing in C also: Yours took ~37 minutes for 2 GiB here. This just ~15 minutes: #!/usr/bin/env python from __future__ import division, with_statement import os import sys from collections import defaultdict from functools import partial from itertools import imap def iter_max_values(blocks, block_count): for i, block in enumerate(blocks): histogram = defaultdict(int) for byte in block: histogram[byte] += 1 yield max((count, byte) for value, count in histogram.iteritems())[1] if i % 1024 == 0: print 'Progresss: %.1f%%' % (100 * i / block_count) def write_pgm(filename, width, height, pixel_values): with open(filename, 'w') as pgm_file: pgm_file.write('P2\n' '# CREATOR: Crappyass Python Script\n' '%d %d\n' '255\n' % (width, height)) pgm_file.writelines('%d\n' % value for value in pixel_values) def main(): filename = sys.argv[1] filesize = os.path.getsize(filename) width = 1024 height = 1024 pixels = width * height blocksize = filesize // width // height print 'Filesize : %d' % filesize print 'Image size : %dx%d' % (width, height) print 'Bytes per Pixel: %d' % blocksize with open(filename, 'rb') as data_file: blocks = iter(partial(data_file.read, blocksize), '') pixel_values = imap(ord, iter_max_values(blocks, pixels)) write_pgm(filename + '.pgm', width, height, pixel_values) if __name__ == '__main__': main() Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
Marc 'BlackJack' Rintsch wrote: On Fri, 09 Jan 2009 04:04:41 +0100, Johannes Bauer wrote: [...] print(Filesize : %d % (filesize)) print(Image size : %dx%d % (width, height)) print(Bytes per Pixel: %d % (blocksize)) Why parentheses around ``print``\s argument? In Python 3 ``print`` is a statement and not a function. Portability? regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
Steven D'Aprano wrote: On Fri, 09 Jan 2009 19:33:53 +1000, James Mills wrote: On Fri, Jan 9, 2009 at 7:15 PM, Marc 'BlackJack' Rintsch bj_...@gmx.net wrote: print(Filesize : %d % (filesize)) print(Image size : %dx%d % (width, height)) print(Bytes per Pixel: %d % (blocksize)) Why parentheses around ``print``\s argument? In Python 3 ``print`` is a statement and not a function. Not true as of 2.6+ and 3.0+ print is now a function. Not so. print is still a statement in 2.6. $ python2.6 Python 2.6.1 (r261:67515, Dec 24 2008, 00:33:13) [GCC 4.1.2 20070502 (Red Hat 4.1.2-12)] on linux2 Type help, copyright, credits or license for more information. print 23 23 C:\Users\sholden\python26\python Python 2.6 (r26:66721, Oct 2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. print built-in function print OK, I confess I missed out from __future__ import print_function regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
Johannes Bauer wrote: Which takes about 40 seconds. I want the niceness of Python but a little more speed than I'm getting (I'd settle for factor 2 or 3 slower, but factor 30 is just too much). This probably doesn't contribute much, but have you tried using Python profiler? You might have *something* wrong that eats up a lot of time in the code. The factor of 30 indeed does not seem right -- I have done somewhat similar stuff (calculating Levenshtein distance [edit distance] on words read from very large files), coded the same algorithm in pure Python and C++ (using linked lists in C++) and Python version was 2.5 times slower. Regards, mk -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
Marc 'BlackJack' Rintsch schrieb: f = open(sys.argv[1], r) Mode should be 'rb'. Check. filesize = os.stat(sys.argv[1])[6] `os.path.getsize()` is a little bit more readable. Check. print(Filesize : %d % (filesize)) print(Image size : %dx%d % (width, height)) print(Bytes per Pixel: %d % (blocksize)) Why parentheses around ``print``\s argument? In Python 3 ``print`` is a statement and not a function. I write all new code to work under Python3.0. Actually I develop on Python 3.0 but the code is currently deployed onto 2.6. picture = { } havepixels = 0 while True: data = f.read(blocksize) if len(data) = 0: break if data: break is enough. datamap = { } for i in range(len(data)): datamap[ord(data[i])] = datamap.get(data[i], 0) + 1 Here you are creating a list full of integers to use them as index into `data` (twice) instead of iterating directly over the elements in `data`. And you are calling `ord()` for *every* byte in the file although you just need it for one value in each block. If it's possible to write the raw PGM format this conversion wouldn't be necessary at all. OK, those two are just stupid, you're right. I changed it to: datamap = { } for i in data: datamap[i] = datamap.get(i, 0) + 1 array = sorted([(b, a) for (a, b) in datamap.items()], reverse=True) most = ord(array[0][1]) pic.write(%d\n % (most)) For the `datamap` a `collections.defaultdict()` might be faster. Tried that, not much of a change. maxchr = None maxcnt = None for (char, count) in datamap.items(): if (maxcnt is None) or (count maxcnt): maxcnt = count maxchr = char Untested: maxchr = max((i, c) for c, i in datamap.iteritems())[1] This is nice, I use it - the sort thing was a workaround anyways. most = maxchr Why? I don't really know anymore :-\ posx = havepixels % width posy = havepixels / width posx, posy = divmod(havepixels, width) That's a nice one. Why are you using a dictionary as 2d array? In the C code you simply write the values sequentially, why can't you just use a flat list and append here? Yup, I changed the Python code to behave the same way the C code did - however overall it's not much of an improvement: Takes about 15 minutes to execute (still factor 23). Thanks for all your pointers! Kind regards, Johannes -- Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit, verlästerung von Gott, Bibel und mir und bewusster Blasphemie. -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik 48d8bf1d$0$7510$54022...@news.sunrise.ch -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
James Mills schrieb: What does this little tool do anyway ? It's very interesting the images it creates out of files. What is this called ? It has no particular name. I was toying around with the Princeton Cold Boot Attack (http://citp.princeton.edu/memory/). In particular I was interested in how much memory is erased when I would (on my system) enable the slow POST (which counts through all RAM three times). I downloaded the provided utitilities, dumped my system memory via PXE boot onto another system after resetting it hard in the middle of a running Linux session. I did sync, though. Praise all journaling filesystems. As a 2GB file is not really of much use for telling where something is and where isn't, I thought of that picture coloring. In a 1024x1024 picture a pixel is 2048 bytes with 2GB of RAM, so exactly half a page. This is sufficiently high resolution to detect what's in there. I'm curious :) I haven't had much tiem to optimize it yet - I'll try to when I get home from work. Thanks for your effort, I appreciate it... hope my work leads to some meaningful results. Currently it looks (*cough* if there aren't bugs in my picture code) as if my PC would reset the whole RAM anyways, although I do not have any ECC. Strange. Kind regards, Johannes -- Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit, verlästerung von Gott, Bibel und mir und bewusster Blasphemie. -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik 48d8bf1d$0$7510$54022...@news.sunrise.ch -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
Marc 'BlackJack' Rintsch schrieb: On Fri, 09 Jan 2009 04:04:41 +0100, Johannes Bauer wrote: As this was horribly slow (20 Minutes for a 2GB file) I coded the whole thing in C also: Yours took ~37 minutes for 2 GiB here. This just ~15 minutes: Ah, ok... when implementing your suggestions int he other post, I did not get such a drastic performance increase. I really will have a look at it and try to locate where I'm wasting the time. Thanks a lot, Kind regards, Johannes -- Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit, verlästerung von Gott, Bibel und mir und bewusster Blasphemie. -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik 48d8bf1d$0$7510$54022...@news.sunrise.ch -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
mk schrieb: Johannes Bauer wrote: Which takes about 40 seconds. I want the niceness of Python but a little more speed than I'm getting (I'd settle for factor 2 or 3 slower, but factor 30 is just too much). This probably doesn't contribute much, but have you tried using Python profiler? You might have *something* wrong that eats up a lot of time in the code. No - and I've not known there was a profiler yet have found anything meaningful (there seems to be an profiling C interface, but that won't get me anywhere). Is that a seperate tool or something? Could you provide a link? The factor of 30 indeed does not seem right -- I have done somewhat similar stuff (calculating Levenshtein distance [edit distance] on words read from very large files), coded the same algorithm in pure Python and C++ (using linked lists in C++) and Python version was 2.5 times slower. Yup, that was about what I had expected (and what I could well live with, it's a tradeoff). Thanks, Kind regards, Johannes -- Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit, verlästerung von Gott, Bibel und mir und bewusster Blasphemie. -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik 48d8bf1d$0$7510$54022...@news.sunrise.ch -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Jan 9, 8:48 am, Johannes Bauer dfnsonfsdu...@gmx.de wrote: No - and I've not known there was a profiler yet have found anything meaningful (there seems to be an profiling C interface, but that won't get me anywhere). Is that a seperate tool or something? Could you provide a link? Thanks, Kind regards, Johannes It is part of the python standard library: http://docs.python.org/library/profile.html -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
Marc 'BlackJack' Rintsch wrote: On Fri, 09 Jan 2009 04:04:41 +0100, Johannes Bauer wrote: As this was horribly slow (20 Minutes for a 2GB file) I coded the whole thing in C also: Yours took ~37 minutes for 2 GiB here. This just ~15 minutes: #!/usr/bin/env python from __future__ import division, with_statement import os import sys from collections import defaultdict from functools import partial from itertools import imap def iter_max_values(blocks, block_count): for i, block in enumerate(blocks): histogram = defaultdict(int) for byte in block: histogram[byte] += 1 yield max((count, byte) for value, count in histogram.iteritems())[1] [snip] Would it be faster if histogram was a list initialised to [0] * 256? -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Jan 9, 6:48 am, Johannes Bauer dfnsonfsdu...@gmx.de wrote: mk schrieb: The factor of 30 indeed does not seem right -- I have done somewhat similar stuff (calculating Levenshtein distance [edit distance] on words read from very large files), coded the same algorithm in pure Python and C++ (using linked lists in C++) and Python version was 2.5 times slower. Yup, that was about what I had expected (and what I could well live with, it's a tradeoff). The rule-of-thumb I use is that Python is generally 5 to 50 times slower than C. It is considered blasphemy to say it in this group, but Python is slow. It does of course have many compensating advantages that make using it advantageous when runtime speed is not of primary importance. -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On 2009-01-09, Johannes Bauer dfnsonfsdu...@gmx.de wrote: I've come from C/C++ and am now trying to code some Python because I absolutely love the language. However I still have trouble getting Python code to run efficiently. Right now I have a easy task: Get a file, If I were you, I'd try mmap()ing the file instead of reading it into string objects one chunk at a time. -- Grant Edwards grante Yow! I'm DESPONDENT ... I at hope there's something visi.comDEEP-FRIED under this miniature DOMED STADIUM ... -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
Johannes Bauer, I was about to start writing a faster version. I think with some care and Psyco you can go about as 5 times slower than C or something like that. To do that you need to use almost the same code for the C version, with a list of 256 ints for the frequencies, not using max() but a manual loop, not using itertools or generators, maybe splitting code in two functions to allow Psyco to optimize better, maybe using another array(...) for the frequences too. The data can be read into an array.array(B), and so on. But I think all this work is a waste of time. I like Python, but that C code, after some cleaning and polishing looks fine for this job. Of course there are other languages that may give you a little nicer code for this program, like D, and there may be ways to use numpy too to speed up the computation of the mode, but they don't look so much important this time. Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
Grant Edwards inva...@invalid wrote: On 2009-01-09, Johannes Bauer dfnsonfsdu...@gmx.de wrote: I've come from C/C++ and am now trying to code some Python because I absolutely love the language. However I still have trouble getting Python code to run efficiently. Right now I have a easy task: Get a file, If I were you, I'd try mmap()ing the file instead of reading it into string objects one chunk at a time. You've snipped the bit further on in that sentence where the OP says that the file of interest is 2GB. Do you still want to try mmap'ing it? -- \S -- si...@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/ Frankly I have no feelings towards penguins one way or the other -- Arthur C. Clarke her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On 2009-01-09, Sion Arrowsmith si...@chiark.greenend.org.uk wrote: Grant Edwards inva...@invalid wrote: On 2009-01-09, Johannes Bauer dfnsonfsdu...@gmx.de wrote: I've come from C/C++ and am now trying to code some Python because I absolutely love the language. However I still have trouble getting Python code to run efficiently. Right now I have a easy task: Get a file, If I were you, I'd try mmap()ing the file instead of reading it into string objects one chunk at a time. You've snipped the bit further on in that sentence where the OP says that the file of interest is 2GB. Do you still want to try mmap'ing it? Sure. The larger the file, the more you gain from mmap'ing it. 2GB should easily fit within the process's virtual memory space. When you mmap a file, it doesn't take up any physical memory. As you access different parts of it, pages are swapped in/out by the OS's VM system. If you're using a decent OS, the demand-paged VM system will handle things far more efficiently than creating millions of strings and letting the Python garbage collector clean them up. Or does mmap in Python mean something completely different than mmap in the C library? -- Grant Edwards grante Yow! Everybody gets free at BORSCHT! visi.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, 09 Jan 2009 15:34:17 +, MRAB wrote: Marc 'BlackJack' Rintsch wrote: def iter_max_values(blocks, block_count): for i, block in enumerate(blocks): histogram = defaultdict(int) for byte in block: histogram[byte] += 1 yield max((count, byte) for value, count in histogram.iteritems())[1] [snip] Would it be faster if histogram was a list initialised to [0] * 256? Don't know. Then for every byte in the 2 GiB we have to call `ord()`. Maybe the speedup from the list compensates this, maybe not. I think that we have to to something with *every* byte of that really large file *at Python level* is the main problem here. In C that's just some primitive numbers. Python has all the object overhead. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On 2009-01-09, Marc 'BlackJack' Rintsch bj_...@gmx.net wrote: On Fri, 09 Jan 2009 15:34:17 +, MRAB wrote: Marc 'BlackJack' Rintsch wrote: def iter_max_values(blocks, block_count): for i, block in enumerate(blocks): histogram = defaultdict(int) for byte in block: histogram[byte] += 1 yield max((count, byte) for value, count in histogram.iteritems())[1] [snip] Would it be faster if histogram was a list initialised to [0] * 256? Don't know. Then for every byte in the 2??GiB we have to call `ord()`. Maybe the speedup from the list compensates this, maybe not. I think that we have to to something with *every* byte of that really large file *at Python level* is the main problem here. In C that's just some primitive numbers. Python has all the object overhead. Using buffers or arrays of bytes instead of strings/lists would probably reduce the overhead quite a bit. -- Grant Edwards grante Yow! I've got an IDEA!! at Why don't I STARE at you visi.comso HARD, you forget your SOCIAL SECURITY NUMBER!! -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Jan 9, 9:56 pm, mk mrk...@gmail.com wrote: The factor of 30 indeed does not seem right -- I have done somewhat similar stuff (calculating Levenshtein distance [edit distance] on words read from very large files), coded the same algorithm in pure Python and C++ (using linked lists in C++) and Python version was 2.5 times slower. Levenshtein distance using linked lists? That's novel. Care to divulge? And if C++ is using linked lists and Python isn't, it's not really the same algorithm, is it? Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Jan 9, 2:14 pm, Marc 'BlackJack' Rintsch bj_...@gmx.net wrote: On Fri, 09 Jan 2009 15:34:17 +, MRAB wrote: Marc 'BlackJack' Rintsch wrote: def iter_max_values(blocks, block_count): for i, block in enumerate(blocks): histogram = defaultdict(int) for byte in block: histogram[byte] += 1 yield max((count, byte) for value, count in histogram.iteritems())[1] [snip] Would it be faster if histogram was a list initialised to [0] * 256? Don't know. Then for every byte in the 2 GiB we have to call `ord()`. Maybe the speedup from the list compensates this, maybe not. I think that we have to to something with *every* byte of that really large file *at Python level* is the main problem here. In C that's just some primitive numbers. Python has all the object overhead. struct's B format might help here. Also, struct.unpack_from could probably be combined with mmap to avoid copying the input. Not to mention that the 0..256 ints are all saved and won't be allocated/ deallocated. -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
Johannes Bauer wrote: Hello group, I've come from C/C++ and am now trying to code some Python because I absolutely love the language. However I still have trouble getting Python code to run efficiently. Right now I have a easy task: Get a file, split it up into a million chunks, count the most prominent character in each chunk and output that value into a file - in other words: Say we have a 2 GB file, we evaluate what character is most prominent in filepos [0, 2048[ - say it's a A, then put a 65 in there (ord(A)). I've first tried Python. Please don't beat me, it's slow as hell and probably a horrible solution: #!/usr/bin/python import sys import os f = open(sys.argv[1], r) filesize = os.stat(sys.argv[1])[6] width = 1024 height = 1024 pixels = width * height blocksize = filesize / width / height print(Filesize : %d % (filesize)) print(Image size : %dx%d % (width, height)) print(Bytes per Pixel: %d % (blocksize)) picture = { } havepixels = 0 while True: data = f.read(blocksize) if len(data) = 0: break datamap = { } for i in range(len(data)): datamap[ord(data[i])] = datamap.get(data[i], 0) + 1 maxchr = None maxcnt = None for (char, count) in datamap.items(): if (maxcnt is None) or (count maxcnt): maxcnt = count maxchr = char most = maxchr posx = havepixels % width posy = havepixels / width havepixels += 1 if (havepixels % 1024) == 0: print(Progresss %s: %.1f%% % (sys.argv[1], 100.0 * havepixels / pixels)) picture[(posx, posy)] = most pic = open(sys.argv[1] + .pgm, w) pic.write(P2\n) pic.write(# CREATOR: Crappyass Python Script\n) pic.write(%d %d\n % (width, height)) pic.write(255\n) for y in range(height): for x in range(width): pos = (x, y) most = picture.get(pos, -1) pic.write(%d\n % (most)) As this was horribly slow (20 Minutes for a 2GB file) I coded the whole thing in C also: #include stdio.h #include errno.h #include string.h #include stdlib.h #define BLOCKSIZE 2048 int main(int argc, char **argv) { unsigned int count[256]; int width, height; FILE *f; FILE *in; width = 1024; height = 1024; char temp[2048]; if (argc != 2) { fprintf(stderr, Argument?\n); exit(2); } in = fopen(argv[1], r); if (!in) { perror(fopen); exit(1); } snprintf(temp, 255, %s.pgm, argv[1]); f = fopen(temp, w); if (!f) { perror(fopen); exit(1); } fprintf(f, P2\n); fprintf(f, # CREATOR: C\n); fprintf(f, %d %d\n, width, height); fprintf(f, 255\n); width = 1024; height = 1024; while (fread(temp, 1, sizeof(temp), in) == sizeof(temp)) { int i; memset(count, 0, sizeof(count)); for (i = 0; i sizeof(temp); i++) { count[(int)temp[i]]++; } int greatest; int maxcount; greatest = 0; maxcount = count[0]; for (i = 1; i 256; i++) { if (count[i] maxcount) { maxcount = count[i]; greatest = i; } } fprintf(f, %d\n, greatest); } fclose(f); fclose(in); return 0; } Which takes about 40 seconds. I want the niceness of Python but a little more speed than I'm getting (I'd settle for factor 2 or 3 slower, but factor 30 is just too much). Can anyone point out how to solve this efficiently in Python? Have a look at psyco: http://psyco.sourceforge.net/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, Jan 9, 2009 at 1:04 PM, Johannes Bauer dfnsonfsdu...@gmx.de wrote: Hello group, Hello. (...) Which takes about 40 seconds. I want the niceness of Python but a little more speed than I'm getting (I'd settle for factor 2 or 3 slower, but factor 30 is just too much). Can anyone point out how to solve this efficiently in Python? Johannes, your 2 programs, 1 in Python and the other in C do _not_ produce the same result. I have tested this against a randomly generated file from /dev/urandom (10M). Yes the Python one is much slower, but I believe it's bebcause the Python implementation is _correct_ where teh C one is _wrong_ :) The resulting test.bin.pgm from python is exactly 3.5M (from 10M). The resulting test.bin.pgm from the C version is 16K. Something is not quite right here :) cheers James -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
James Mills schrieb: I have tested this against a randomly generated file from /dev/urandom (10M). Yes the Python one is much slower, but I believe it's bebcause the Python implementation is _correct_ where teh C one is _wrong_ :) The resulting test.bin.pgm from python is exactly 3.5M (from 10M). The resulting test.bin.pgm from the C version is 16K. Something is not quite right here :) Uhh, yes, you're right there... I must admit that I was too lazy to include all the stat headers and to a proper st_size check in the C version (just a quick hack), so it's practically hardcoded. With files of exactly 2GB in size the results should be the same (more or less, +- 1 line doesn't matter really), because 2 GB / 2048 (the buffer) = 1 Million. Sorry I didn't mention that, it was really kind of sloppy, quick-and-dirty C writing on my part. But you're right, the Python implementation does what is actually supposed to happen. Kind regards, Johannes -- Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit, verlästerung von Gott, Bibel und mir und bewusster Blasphemie. -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik 48d8bf1d$0$7510$54022...@news.sunrise.ch -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, Jan 9, 2009 at 3:13 PM, Johannes Bauer dfnsonfsdu...@gmx.de wrote: Uhh, yes, you're right there... I must admit that I was too lazy to include all the stat headers and to a proper st_size check in the C version (just a quick hack), so it's practically hardcoded. With files of exactly 2GB in size the results should be the same (more or less, +- 1 line doesn't matter really), because 2 GB / 2048 (the buffer) = 1 Million. Sorry I didn't mention that, it was really kind of sloppy, quick-and-dirty C writing on my part. But you're right, the Python implementation does what is actually supposed to happen. I shall attempt to optimize this :) I have a funny feeling you might be caught up with some features of Python - one notable one being that some things in Python are immutable. psyco might help here though ... cheers James -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
On Fri, Jan 9, 2009 at 2:29 PM, James Mills prolo...@shortcircuit.net.au wrote: I shall attempt to optimize this :) I have a funny feeling you might be caught up with some features of Python - one notable one being that some things in Python are immutable. psyco might help here though ... What does this little tool do anyway ? It's very interesting the images it creates out of files. What is this called ? I'm curious :) I haven't had much tiem to optimize it yet - I'll try to when I get home from work. cheers James -- http://mail.python.org/mailman/listinfo/python-list
Re: Implementing file reading in C/Python
MRAB wrote: Johannes Bauer wrote: Hello group, [and about 200 other lines there was no need to quote] [...] Have a look at psyco: http://psyco.sourceforge.net/ Have a little consideration for others when making a short reply to a long post, please. Trim what isn't necessary. Thanks. regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list