Re: very large dictionary

2008-08-06 Thread Bruno Desthuilliers

Simon Strobl a écrit :
(snip)
 I would prefer to be able to use the same type of

scripts with data of all sizes, though.


Since computers have a limited RAM, this is to remain a wish. You can't 
obviously expect to deal with terabytes of data like you do with a 1kb 
text file.

--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-06 Thread Jake Anderson

Bruno Desthuilliers wrote:

Simon Strobl a écrit :
(snip)
 I would prefer to be able to use the same type of

scripts with data of all sizes, though.


Since computers have a limited RAM, this is to remain a wish. You 
can't obviously expect to deal with terabytes of data like you do with 
a 1kb text file.

--
http://mail.python.org/mailman/listinfo/python-list

You can, you just start off handling the multi GB case and your set.
databases are really easy, I often use them for manipulating pretty 
small amounts of data because its just an easy way to group and join etc.


--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-05 Thread Simon Strobl
 Have you considered that the operating system imposes per-process limits
 on memory usage? You say that your server has 128 GB of memory, but that
 doesn't mean the OS will make anything like that available.

According to our system administrator, I can use all of the 128G.

  I thought it would be practical not to create the
  dictionary from a text file each time I needed it. I.e. I thought
  loading the .pyc-file should be faster. Yet, Python failed to create a
  .pyc-file

 Probably a good example of premature optimization.

Well, as I was using Python, I did not expect to have to care about
the language's internal affairs that much. I thought I could simply do
always the same no matter how large my files get. In other words, I
thought Python was really scalable.

 Out of curiosity, how
 long does it take to create it from a text file?

I do not remember this exactly. But I think it was not much more than
an hour.



--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-05 Thread Steven D'Aprano
On Tue, 05 Aug 2008 01:20:08 -0700, Simon Strobl wrote:


  I thought it would be practical not to create the dictionary from a
  text file each time I needed it. I.e. I thought loading the .pyc-file
  should be faster. Yet, Python failed to create a .pyc-file

 Probably a good example of premature optimization.
 
 Well, as I was using Python, I did not expect to have to care about the
 language's internal affairs that much. I thought I could simply do
 always the same no matter how large my files get. In other words, I
 thought Python was really scalable.

Yeah, it really is a pain when abstractions leak.

http://www.joelonsoftware.com/articles/LeakyAbstractions.html


 Out of curiosity, how
 long does it take to create it from a text file?
 
 I do not remember this exactly. But I think it was not much more than an
 hour.

Hmmm... longer than I expected. Perhaps not as premature as I thought. 
Have you tried the performance of the pickle and marshal modules?



-- 
Steven
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-05 Thread Enrico Franchi
Simon Strobl [EMAIL PROTECTED] wrote:

 Well, as I was using Python, I did not expect to have to care about
 the language's internal affairs that much. I thought I could simply do
 always the same no matter how large my files get. In other words, I
 thought Python was really scalable.

It's not Python here. It's just how computers work. IMHO having a
gargantuan dictionary in memory is not a good idea (unless explicitly
proven otherwise): this is the kind of job databases have been created
for. 

Besides, this is not a matter of Python. If you were using C or another
language, I would have sugested to use databases in order to manipulate
GB's of data. Luckily enought, using databases in Python is far easier
than doing so in C/C++/Java. *And* there are thin abstractions over
databases so you don't even need to know how to use them (though, I
suggest that you *do* learn something about DB's and expecially
relational DB's, SQL is not *that* bad). 


-- 
-riko
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-05 Thread Gabriel Genellina
En Mon, 04 Aug 2008 11:02:16 -0300, Simon Strobl [EMAIL PROTECTED]  
escribió:



I created a python file that contained the dictionary. The size of
this file was 6.8GB. I thought it would be practical not to create the
dictionary from a text file each time I needed it. I.e. I thought
loading the .pyc-file should be faster. Yet, Python failed to create
a .pyc-file


Looks like the marshal format (used to create the .pyc file) can't handle  
sizes so big - and that limitation will stay for a while:

http://mail.python.org/pipermail/python-dev/2007-May/073161.html
So follow any of the previous suggestions and store your dictionary as  
data, not code.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-05 Thread Terry Reedy



Simon Strobl wrote:


Well, as I was using Python, I did not expect to have to care about
the language's internal affairs that much. I thought I could simply do
always the same no matter how large my files get. In other words, I
thought Python was really scalable.


Python the language is indefinitely scalable.  Finite implementations 
are not.  CPython is a C program compiled to a system executable.  Most 
OSes run executables with a fairly limited call stack space.


CPython programs are, when possible, cached as .pyc files.  The 
existence and format of .pyc's is an internal affair of the CPython 
implementation.  They are most definitely not a language requirement or 
language feature.


Have you tried feeding multigigabytes source code files to other 
compilers?  Most, if not all, could be broken by the 'right' big-enough 
code.


tjr

--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-04 Thread Simon Strobl
On 4 Aug., 00:51, Avinash Vora [EMAIL PROTECTED] wrote:
 On Aug 4, 2008, at 4:12 AM, Jörgen Grahn wrote:

  (You might want to post this to comp.lang.python rather than to me --
  I am just another c.l.p reader.  If you already have done to, please
  disregard this.)

 Yeah, I hit reply by mistake and didn't realize it.  My bad.

  (I assume here that Berkeley DB supports 7GB data sets.)

  If I remember correctly, BerkeleyDB is limited to a single file size
  of 2GB.

  Sounds likely.  But with some luck maybe they have increased this in
  later releases?  There seem to be many competing Berkeley releases.

 It's worth investigating, but that leads me to:

  I haven't caught the earlier parts of this thread, but do I
  understand correctly that someone wants to load a 7GB dataset into
  the
  form of a dictionary?

  Yes, he claimed the dictionary was 6.8 GB.  How he measured that, I
  don't know.

 To the OP: how did you measure this?

I created a python file that contained the dictionary. The size of
this file was 6.8GB. I thought it would be practical not to create the
dictionary from a text file each time I needed it. I.e. I thought
loading the .pyc-file should be faster. Yet, Python failed to create
a .pyc-file

Simon
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-04 Thread Steven D'Aprano
On Mon, 04 Aug 2008 07:02:16 -0700, Simon Strobl wrote:

 I created a python file that contained the dictionary. The size of this
 file was 6.8GB. 

Ah, that's what I thought you had done. That's not a dictionary. That's a 
text file containing the Python code to create a dictionary.

My guess is that a 7GB text file will require significantly more memory 
once converted to an actual dictionary: in my earlier post, I estimated 
about 5GB for pointers. Total size of the dictionary is impossible to 
estimate accurately without more information, but I'd guess that 10GB or  
20GB wouldn't be unreasonable.

Have you considered that the operating system imposes per-process limits 
on memory usage? You say that your server has 128 GB of memory, but that 
doesn't mean the OS will make anything like that available.

And I don't know how to even start estimating how much temporary memory 
is required to parse and build such an enormous Python program. Not only 
is it a 7GB program, but it is 7GB in one statement.


 I thought it would be practical not to create the
 dictionary from a text file each time I needed it. I.e. I thought
 loading the .pyc-file should be faster. Yet, Python failed to create a
 .pyc-file

Probably a good example of premature optimization. Out of curiosity, how 
long does it take to create it from a text file?



-- 
Steven
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-03 Thread Jorgen Grahn
On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl [EMAIL PROTECTED] wrote:
 What does load a dictionary mean?

 I had a file bigrams.py with a content like below:

 bigrams = {
 , djy : 75 ,
 , djz : 57 ,
 , djzoom : 165 ,
 , dk : 28893 ,
 , dk.au : 854 ,
 , dk.b. : 3668 ,
 ...

 }

 In another file I said:

 from bigrams import bigrams

 How about using a database instead of a dictionary?

 If there is no other way to do it, I will have to learn how to use
 databases in Python.

If you use Berkeley DB (import bsddb), you don't have to learn much.
These databases look very much like dictionaries string:string, only
they are disk-backed.

(I assume here that Berkeley DB supports 7GB data sets.)

/Jorgen

-- 
  // Jorgen Grahn grahn@Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se  R'lyeh wgah'nagl fhtagn!
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-03 Thread Jorgen Grahn
On 3 Aug 2008 20:36:33 GMT, Jorgen Grahn [EMAIL PROTECTED] wrote:
 On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl [EMAIL PROTECTED] 
 wrote:
...
 If there is no other way to do it, I will have to learn how to use
 databases in Python.

 If you use Berkeley DB (import bsddb), you don't have to learn much.
 These databases look very much like dictionaries string:string, only
 they are disk-backed.

... all of which Sean pointed out elsewhere in the thread.

Oh well. I guess pointing it out twice doesn't hurt.  bsddb has been
very pleasant to work with for me. I normally avoid database
programming like the plague.

/Jorgen

-- 
  // Jorgen Grahn grahn@Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se  R'lyeh wgah'nagl fhtagn!
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-03 Thread member thudfoo
On 3 Aug 2008 20:40:02 GMT, Jorgen Grahn [EMAIL PROTECTED] wrote:
 On 3 Aug 2008 20:36:33 GMT, Jorgen Grahn [EMAIL PROTECTED] wrote:
   On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl [EMAIL PROTECTED] 
 wrote:

 ...

  If there is no other way to do it, I will have to learn how to use
   databases in Python.
  
   If you use Berkeley DB (import bsddb), you don't have to learn much.
   These databases look very much like dictionaries string:string, only
   they are disk-backed.


 ... all of which Sean pointed out elsewhere in the thread.

  Oh well. I guess pointing it out twice doesn't hurt.  bsddb has been
  very pleasant to work with for me. I normally avoid database
  programming like the plague.



13.4 shelve -- Python object persistence

 A ``shelf'' is a persistent, dictionary-like object. The difference
with ``dbm'' databases is that the values (not the keys!) in a shelf
can be essentially arbitrary Python objects -- anything that the
pickle module can handle. This includes most class instances,
recursive data types, and objects containing lots of shared
sub-objects. The keys are ordinary strings

[...]
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-03 Thread Avinash Vora


On Aug 4, 2008, at 4:12 AM, Jörgen Grahn wrote:


(You might want to post this to comp.lang.python rather than to me --
I am just another c.l.p reader.  If you already have done to, please
disregard this.)


Yeah, I hit reply by mistake and didn't realize it.  My bad.


(I assume here that Berkeley DB supports 7GB data sets.)


If I remember correctly, BerkeleyDB is limited to a single file size
of 2GB.


Sounds likely.  But with some luck maybe they have increased this in
later releases?  There seem to be many competing Berkeley releases.


It's worth investigating, but that leads me to:


I haven't caught the earlier parts of this thread, but do I
understand correctly that someone wants to load a 7GB dataset into  
the

form of a dictionary?


Yes, he claimed the dictionary was 6.8 GB.  How he measured that, I
don't know.



To the OP: how did you measure this?

--
Avi

--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-02 Thread Steven D'Aprano
On Fri, 01 Aug 2008 00:46:09 -0700, Simon Strobl wrote:

 Hello,
 
 I tried to load a 6.8G large dictionary on a server that has 128G of
 memory. I got a memory error. I used Python 2.5.2. How can I load my
 data?

How do you know the dictionary takes 6.8G?

I'm going to guess an answer to my own question. In a later post, Simon 
wrote:

[quote]
I had a file bigrams.py with a content like below:

bigrams = {
, djy : 75 ,
, djz : 57 ,
, djzoom : 165 ,
, dk : 28893 ,
, dk.au : 854 ,
, dk.b. : 3668 ,
...

}
[end quote]


I'm guessing that the file is 6.8G of *text*. How much memory will it 
take to import that? I don't know, but probably a lot more than 6.8G. The 
compiler has to read the whole file in one giant piece, analyze it, 
create all the string and int objects, and only then can it create the 
dict. By my back-of-the-envelope calculations, the pointers alone will 
require about 5GB, nevermind the objects they point to.

I suggest trying to store your data as data, not as Python code. Create a 
text file bigrams.txt with one key/value per line, like this:

djy : 75
djz : 57
djzoom : 165
dk : 28893
...

Then import it like such:

bigrams = {}
for line in open('bigrams.txt', 'r'):
key, value = line.split(':')
bigrams[key.strip()] = int(value.strip())


This will be slower, but because it only needs to read the data one line 
at a time, it might succeed where trying to slurp all 6.8G in one piece 
will fail.



-- 
Steven
--
http://mail.python.org/mailman/listinfo/python-list


very large dictionary

2008-08-01 Thread Simon Strobl
Hello,

I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?

SImon
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-01 Thread Marc 'BlackJack' Rintsch
On Fri, 01 Aug 2008 00:46:09 -0700, Simon Strobl wrote:

 I tried to load a 6.8G large dictionary on a server that has 128G of
 memory. I got a memory error. I used Python 2.5.2. How can I load my
 data?

What does load a dictionary mean?  Was it saved with the `pickle` 
module?

How about using a database instead of a dictionary?

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-01 Thread Simon Strobl
 What does load a dictionary mean?

I had a file bigrams.py with a content like below:

bigrams = {
, djy : 75 ,
, djz : 57 ,
, djzoom : 165 ,
, dk : 28893 ,
, dk.au : 854 ,
, dk.b. : 3668 ,
...

}

In another file I said:

from bigrams import bigrams

 How about using a database instead of a dictionary?

If there is no other way to do it, I will have to learn how to use
databases in Python. I would prefer to be able to use the same type of
scripts with data of all sizes, though.
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-01 Thread bearophileHUGS
Simon Strobl:
 I had a file bigrams.py with a content like below:
 bigrams = {
 , djy : 75 ,
 , djz : 57 ,
 , djzoom : 165 ,
 , dk : 28893 ,
 , dk.au : 854 ,
 , dk.b. : 3668 ,
 ...
 }
 In another file I said:
 from bigrams import bigrams

Probably there's a limit in the module size here. You can try to
change your data format on disk, creating a text file like this:
, djy 75
, djz 57
, djzoom 165
...
Then in a module you can create an empty dict, read the lines of the
data with:
for line in somefile:
  part, n = .rsplit( , 1)
  somedict[part.strip('')] = int(n)

Otherwise you may have to use a BigTable, a DB, etc.


 If there is no other way to do it, I will have to learn how to use
 databases in Python. I would prefer to be able to use the same type of
 scripts with data of all sizes, though.

I understand, I don't know if there are documented limits for the
dicts of the 64-bit Python.

Bye,
bearophile
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-01 Thread Sion Arrowsmith
Simon Strobl  [EMAIL PROTECTED] wrote:
I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?

Let's just eliminate one thing here: this server is running a
64-bit OS, isn't it? Because if it's a 32-bit OS, the blunt
answer is You can't, no matter how much physical memory you
have and you're going to have to go down the database route
(or some approach which stores the mapping on disk and only
loads items into memory on demand).

-- 
\S -- [EMAIL PROTECTED] -- http://www.chaos.org.uk/~sion/
   Frankly I have no feelings towards penguins one way or the other
-- Arthur C. Clarke
   her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump
--
http://mail.python.org/mailman/listinfo/python-list

Re: very large dictionary

2008-08-01 Thread Raja Baz
On Fri, 01 Aug 2008 14:47:17 +0100, Sion Arrowsmith wrote:

 Simon Strobl  [EMAIL PROTECTED] wrote:
I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?
 
 Let's just eliminate one thing here: this server is running a 64-bit OS,
 isn't it? Because if it's a 32-bit OS, the blunt answer is You can't,
 no matter how much physical memory you have and you're going to have to
 go down the database route (or some approach which stores the mapping on
 disk and only loads items into memory on demand).

I very highly doubt he has 128GB of main memory and is running a 32bit OS.
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-01 Thread Raja Baz
On Fri, 01 Aug 2008 14:47:17 +0100, Sion Arrowsmith wrote:

 Simon Strobl  [EMAIL PROTECTED] wrote:
I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?

 Let's just eliminate one thing here: this server is running a 64-bit OS,
 isn't it? Because if it's a 32-bit OS, [etc...]

I very highly doubt he has 128GB of main memory and is running a 32bit OS.
--
http://mail.python.org/mailman/listinfo/python-list


Re: very large dictionary

2008-08-01 Thread Sean

Simon Strobl wrote:

Hello,

I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?

SImon


Take a look at the python bsddb module.  Uing btree tables is fast, and 
it has the benefit that once the table is open, the programing interface 
is identical to a normal dictionary.


http://docs.python.org/lib/bsddb-objects.html

Sean
--
http://mail.python.org/mailman/listinfo/python-list