Re: very large dictionary
Simon Strobl a écrit : (snip) I would prefer to be able to use the same type of scripts with data of all sizes, though. Since computers have a limited RAM, this is to remain a wish. You can't obviously expect to deal with terabytes of data like you do with a 1kb text file. -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
Bruno Desthuilliers wrote: Simon Strobl a écrit : (snip) I would prefer to be able to use the same type of scripts with data of all sizes, though. Since computers have a limited RAM, this is to remain a wish. You can't obviously expect to deal with terabytes of data like you do with a 1kb text file. -- http://mail.python.org/mailman/listinfo/python-list You can, you just start off handling the multi GB case and your set. databases are really easy, I often use them for manipulating pretty small amounts of data because its just an easy way to group and join etc. -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
Have you considered that the operating system imposes per-process limits on memory usage? You say that your server has 128 GB of memory, but that doesn't mean the OS will make anything like that available. According to our system administrator, I can use all of the 128G. I thought it would be practical not to create the dictionary from a text file each time I needed it. I.e. I thought loading the .pyc-file should be faster. Yet, Python failed to create a .pyc-file Probably a good example of premature optimization. Well, as I was using Python, I did not expect to have to care about the language's internal affairs that much. I thought I could simply do always the same no matter how large my files get. In other words, I thought Python was really scalable. Out of curiosity, how long does it take to create it from a text file? I do not remember this exactly. But I think it was not much more than an hour. -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
On Tue, 05 Aug 2008 01:20:08 -0700, Simon Strobl wrote: I thought it would be practical not to create the dictionary from a text file each time I needed it. I.e. I thought loading the .pyc-file should be faster. Yet, Python failed to create a .pyc-file Probably a good example of premature optimization. Well, as I was using Python, I did not expect to have to care about the language's internal affairs that much. I thought I could simply do always the same no matter how large my files get. In other words, I thought Python was really scalable. Yeah, it really is a pain when abstractions leak. http://www.joelonsoftware.com/articles/LeakyAbstractions.html Out of curiosity, how long does it take to create it from a text file? I do not remember this exactly. But I think it was not much more than an hour. Hmmm... longer than I expected. Perhaps not as premature as I thought. Have you tried the performance of the pickle and marshal modules? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
Simon Strobl [EMAIL PROTECTED] wrote: Well, as I was using Python, I did not expect to have to care about the language's internal affairs that much. I thought I could simply do always the same no matter how large my files get. In other words, I thought Python was really scalable. It's not Python here. It's just how computers work. IMHO having a gargantuan dictionary in memory is not a good idea (unless explicitly proven otherwise): this is the kind of job databases have been created for. Besides, this is not a matter of Python. If you were using C or another language, I would have sugested to use databases in order to manipulate GB's of data. Luckily enought, using databases in Python is far easier than doing so in C/C++/Java. *And* there are thin abstractions over databases so you don't even need to know how to use them (though, I suggest that you *do* learn something about DB's and expecially relational DB's, SQL is not *that* bad). -- -riko -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
En Mon, 04 Aug 2008 11:02:16 -0300, Simon Strobl [EMAIL PROTECTED] escribió: I created a python file that contained the dictionary. The size of this file was 6.8GB. I thought it would be practical not to create the dictionary from a text file each time I needed it. I.e. I thought loading the .pyc-file should be faster. Yet, Python failed to create a .pyc-file Looks like the marshal format (used to create the .pyc file) can't handle sizes so big - and that limitation will stay for a while: http://mail.python.org/pipermail/python-dev/2007-May/073161.html So follow any of the previous suggestions and store your dictionary as data, not code. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
Simon Strobl wrote: Well, as I was using Python, I did not expect to have to care about the language's internal affairs that much. I thought I could simply do always the same no matter how large my files get. In other words, I thought Python was really scalable. Python the language is indefinitely scalable. Finite implementations are not. CPython is a C program compiled to a system executable. Most OSes run executables with a fairly limited call stack space. CPython programs are, when possible, cached as .pyc files. The existence and format of .pyc's is an internal affair of the CPython implementation. They are most definitely not a language requirement or language feature. Have you tried feeding multigigabytes source code files to other compilers? Most, if not all, could be broken by the 'right' big-enough code. tjr -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
On 4 Aug., 00:51, Avinash Vora [EMAIL PROTECTED] wrote: On Aug 4, 2008, at 4:12 AM, Jörgen Grahn wrote: (You might want to post this to comp.lang.python rather than to me -- I am just another c.l.p reader. If you already have done to, please disregard this.) Yeah, I hit reply by mistake and didn't realize it. My bad. (I assume here that Berkeley DB supports 7GB data sets.) If I remember correctly, BerkeleyDB is limited to a single file size of 2GB. Sounds likely. But with some luck maybe they have increased this in later releases? There seem to be many competing Berkeley releases. It's worth investigating, but that leads me to: I haven't caught the earlier parts of this thread, but do I understand correctly that someone wants to load a 7GB dataset into the form of a dictionary? Yes, he claimed the dictionary was 6.8 GB. How he measured that, I don't know. To the OP: how did you measure this? I created a python file that contained the dictionary. The size of this file was 6.8GB. I thought it would be practical not to create the dictionary from a text file each time I needed it. I.e. I thought loading the .pyc-file should be faster. Yet, Python failed to create a .pyc-file Simon -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
On Mon, 04 Aug 2008 07:02:16 -0700, Simon Strobl wrote: I created a python file that contained the dictionary. The size of this file was 6.8GB. Ah, that's what I thought you had done. That's not a dictionary. That's a text file containing the Python code to create a dictionary. My guess is that a 7GB text file will require significantly more memory once converted to an actual dictionary: in my earlier post, I estimated about 5GB for pointers. Total size of the dictionary is impossible to estimate accurately without more information, but I'd guess that 10GB or 20GB wouldn't be unreasonable. Have you considered that the operating system imposes per-process limits on memory usage? You say that your server has 128 GB of memory, but that doesn't mean the OS will make anything like that available. And I don't know how to even start estimating how much temporary memory is required to parse and build such an enormous Python program. Not only is it a 7GB program, but it is 7GB in one statement. I thought it would be practical not to create the dictionary from a text file each time I needed it. I.e. I thought loading the .pyc-file should be faster. Yet, Python failed to create a .pyc-file Probably a good example of premature optimization. Out of curiosity, how long does it take to create it from a text file? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl [EMAIL PROTECTED] wrote: What does load a dictionary mean? I had a file bigrams.py with a content like below: bigrams = { , djy : 75 , , djz : 57 , , djzoom : 165 , , dk : 28893 , , dk.au : 854 , , dk.b. : 3668 , ... } In another file I said: from bigrams import bigrams How about using a database instead of a dictionary? If there is no other way to do it, I will have to learn how to use databases in Python. If you use Berkeley DB (import bsddb), you don't have to learn much. These databases look very much like dictionaries string:string, only they are disk-backed. (I assume here that Berkeley DB supports 7GB data sets.) /Jorgen -- // Jorgen Grahn grahn@Ph'nglui mglw'nafh Cthulhu \X/ snipabacken.se R'lyeh wgah'nagl fhtagn! -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
On 3 Aug 2008 20:36:33 GMT, Jorgen Grahn [EMAIL PROTECTED] wrote: On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl [EMAIL PROTECTED] wrote: ... If there is no other way to do it, I will have to learn how to use databases in Python. If you use Berkeley DB (import bsddb), you don't have to learn much. These databases look very much like dictionaries string:string, only they are disk-backed. ... all of which Sean pointed out elsewhere in the thread. Oh well. I guess pointing it out twice doesn't hurt. bsddb has been very pleasant to work with for me. I normally avoid database programming like the plague. /Jorgen -- // Jorgen Grahn grahn@Ph'nglui mglw'nafh Cthulhu \X/ snipabacken.se R'lyeh wgah'nagl fhtagn! -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
On 3 Aug 2008 20:40:02 GMT, Jorgen Grahn [EMAIL PROTECTED] wrote: On 3 Aug 2008 20:36:33 GMT, Jorgen Grahn [EMAIL PROTECTED] wrote: On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl [EMAIL PROTECTED] wrote: ... If there is no other way to do it, I will have to learn how to use databases in Python. If you use Berkeley DB (import bsddb), you don't have to learn much. These databases look very much like dictionaries string:string, only they are disk-backed. ... all of which Sean pointed out elsewhere in the thread. Oh well. I guess pointing it out twice doesn't hurt. bsddb has been very pleasant to work with for me. I normally avoid database programming like the plague. 13.4 shelve -- Python object persistence A ``shelf'' is a persistent, dictionary-like object. The difference with ``dbm'' databases is that the values (not the keys!) in a shelf can be essentially arbitrary Python objects -- anything that the pickle module can handle. This includes most class instances, recursive data types, and objects containing lots of shared sub-objects. The keys are ordinary strings [...] -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
On Aug 4, 2008, at 4:12 AM, Jörgen Grahn wrote: (You might want to post this to comp.lang.python rather than to me -- I am just another c.l.p reader. If you already have done to, please disregard this.) Yeah, I hit reply by mistake and didn't realize it. My bad. (I assume here that Berkeley DB supports 7GB data sets.) If I remember correctly, BerkeleyDB is limited to a single file size of 2GB. Sounds likely. But with some luck maybe they have increased this in later releases? There seem to be many competing Berkeley releases. It's worth investigating, but that leads me to: I haven't caught the earlier parts of this thread, but do I understand correctly that someone wants to load a 7GB dataset into the form of a dictionary? Yes, he claimed the dictionary was 6.8 GB. How he measured that, I don't know. To the OP: how did you measure this? -- Avi -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
On Fri, 01 Aug 2008 00:46:09 -0700, Simon Strobl wrote: Hello, I tried to load a 6.8G large dictionary on a server that has 128G of memory. I got a memory error. I used Python 2.5.2. How can I load my data? How do you know the dictionary takes 6.8G? I'm going to guess an answer to my own question. In a later post, Simon wrote: [quote] I had a file bigrams.py with a content like below: bigrams = { , djy : 75 , , djz : 57 , , djzoom : 165 , , dk : 28893 , , dk.au : 854 , , dk.b. : 3668 , ... } [end quote] I'm guessing that the file is 6.8G of *text*. How much memory will it take to import that? I don't know, but probably a lot more than 6.8G. The compiler has to read the whole file in one giant piece, analyze it, create all the string and int objects, and only then can it create the dict. By my back-of-the-envelope calculations, the pointers alone will require about 5GB, nevermind the objects they point to. I suggest trying to store your data as data, not as Python code. Create a text file bigrams.txt with one key/value per line, like this: djy : 75 djz : 57 djzoom : 165 dk : 28893 ... Then import it like such: bigrams = {} for line in open('bigrams.txt', 'r'): key, value = line.split(':') bigrams[key.strip()] = int(value.strip()) This will be slower, but because it only needs to read the data one line at a time, it might succeed where trying to slurp all 6.8G in one piece will fail. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
very large dictionary
Hello, I tried to load a 6.8G large dictionary on a server that has 128G of memory. I got a memory error. I used Python 2.5.2. How can I load my data? SImon -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
On Fri, 01 Aug 2008 00:46:09 -0700, Simon Strobl wrote: I tried to load a 6.8G large dictionary on a server that has 128G of memory. I got a memory error. I used Python 2.5.2. How can I load my data? What does load a dictionary mean? Was it saved with the `pickle` module? How about using a database instead of a dictionary? Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
What does load a dictionary mean? I had a file bigrams.py with a content like below: bigrams = { , djy : 75 , , djz : 57 , , djzoom : 165 , , dk : 28893 , , dk.au : 854 , , dk.b. : 3668 , ... } In another file I said: from bigrams import bigrams How about using a database instead of a dictionary? If there is no other way to do it, I will have to learn how to use databases in Python. I would prefer to be able to use the same type of scripts with data of all sizes, though. -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
Simon Strobl: I had a file bigrams.py with a content like below: bigrams = { , djy : 75 , , djz : 57 , , djzoom : 165 , , dk : 28893 , , dk.au : 854 , , dk.b. : 3668 , ... } In another file I said: from bigrams import bigrams Probably there's a limit in the module size here. You can try to change your data format on disk, creating a text file like this: , djy 75 , djz 57 , djzoom 165 ... Then in a module you can create an empty dict, read the lines of the data with: for line in somefile: part, n = .rsplit( , 1) somedict[part.strip('')] = int(n) Otherwise you may have to use a BigTable, a DB, etc. If there is no other way to do it, I will have to learn how to use databases in Python. I would prefer to be able to use the same type of scripts with data of all sizes, though. I understand, I don't know if there are documented limits for the dicts of the 64-bit Python. Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
Simon Strobl [EMAIL PROTECTED] wrote: I tried to load a 6.8G large dictionary on a server that has 128G of memory. I got a memory error. I used Python 2.5.2. How can I load my data? Let's just eliminate one thing here: this server is running a 64-bit OS, isn't it? Because if it's a 32-bit OS, the blunt answer is You can't, no matter how much physical memory you have and you're going to have to go down the database route (or some approach which stores the mapping on disk and only loads items into memory on demand). -- \S -- [EMAIL PROTECTED] -- http://www.chaos.org.uk/~sion/ Frankly I have no feelings towards penguins one way or the other -- Arthur C. Clarke her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
On Fri, 01 Aug 2008 14:47:17 +0100, Sion Arrowsmith wrote: Simon Strobl [EMAIL PROTECTED] wrote: I tried to load a 6.8G large dictionary on a server that has 128G of memory. I got a memory error. I used Python 2.5.2. How can I load my data? Let's just eliminate one thing here: this server is running a 64-bit OS, isn't it? Because if it's a 32-bit OS, the blunt answer is You can't, no matter how much physical memory you have and you're going to have to go down the database route (or some approach which stores the mapping on disk and only loads items into memory on demand). I very highly doubt he has 128GB of main memory and is running a 32bit OS. -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
On Fri, 01 Aug 2008 14:47:17 +0100, Sion Arrowsmith wrote: Simon Strobl [EMAIL PROTECTED] wrote: I tried to load a 6.8G large dictionary on a server that has 128G of memory. I got a memory error. I used Python 2.5.2. How can I load my data? Let's just eliminate one thing here: this server is running a 64-bit OS, isn't it? Because if it's a 32-bit OS, [etc...] I very highly doubt he has 128GB of main memory and is running a 32bit OS. -- http://mail.python.org/mailman/listinfo/python-list
Re: very large dictionary
Simon Strobl wrote: Hello, I tried to load a 6.8G large dictionary on a server that has 128G of memory. I got a memory error. I used Python 2.5.2. How can I load my data? SImon Take a look at the python bsddb module. Uing btree tables is fast, and it has the benefit that once the table is open, the programing interface is identical to a normal dictionary. http://docs.python.org/lib/bsddb-objects.html Sean -- http://mail.python.org/mailman/listinfo/python-list