Re: splitting a large dictionary into smaller ones

2009-03-23 Thread Steven D'Aprano
On Sun, 22 Mar 2009 23:10:21 -0400, Terry Reedy wrote:

 Searching for a key in, say, 10 dicts will be slower than searching for
 it in just one.  The only reason I would do this would be if the dict
 had to be split, say over several machines.  But then, you could query
 them in parallel.

That surely depends on the number of collisions? With millions of keys, 
the number of collisions could be high (although it almost certainly 
won't be). As I understand it, searching a dict is O(1) on average, but 
if there are collisions it can degrade to O(m) (where m is the number of 
items colliding for some hash). In principle, m could be large.

Hypothetically speaking, if you could predict where collisions occur and 
split your data into multiple smaller dicts with fewer collisions, you 
could delegate to one of the smaller dicts and reduce O(m) lookups to 
2*O(1) lookups, which of course is just O(1).

Of course, this entirely assumes that collisions in Python's hash 
function are both predictable and frequent in the OP's data. If they're 
not both, then it's unlikely that this suggestion will be of practical 
use. Could be an interesting experiment though :)



-- 
Steven
--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-23 Thread John Machin
On Mar 23, 1:32 pm, per perfr...@gmail.com wrote:
 hi all,

 i have a very large dictionary object that is built from a text file
 that is about 800 MB -- it contains several million keys.  ideally i
 would like to pickle this object so that i wouldnt have to parse this
 large file to compute the dictionary every time i run my program.
 however currently the pickled file is over 300 MB and takes a very
 long time to write to disk - even longer than recomputing the
 dictionary from scratch.

What version of Python are you using, if 2.X are you using pickle or
cPickle, on what platform, and what *exactly* is the code that you are
using to call *ickle.dump() and *ickle.load()?

What datatype(s) are the keys? the values? If str/unicode, what are
average lengths?
--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-23 Thread Dave Angel
As others have said, a database is probably the right answer.  There, 
the data is kept on disk, and only a few records at a time are read for 
each access, with modification transactions usually being synchronous.


However, there are use cases where your approach makes more sense.  And 
it shouldn't be too hard to do what you describe.  So it's probably 
worth implementing.


Use case:  suppose most accesses are to a 'local' set of keys, in some 
sense.  For example, if the keys were serial numbers, with the most 
recently created serial numbers likely to be accessed much more often 
than those a few years old.  Then if you split the serial numbers up in 
the most obvious way (using the high digits of the number, for example 
to index which external file the data is in), you'd find nearly all 
accesses would be to the file with the highest index.


So you create a wrapper class with an init and two member methods, read 
and write, plus a flush() method.  read takes a key, and returns a 
value, while write takes both key and value, and saves it away.  The 
init method creates a list of dictionaries (initially of length zero) 
and saves some information about the location and naming of the actual 
pickle files (or whatever you're using).


When a read or write method is called, you do the divide (or whatever 
filter you use) to find an index into the list of dictionaries.  If this 
index is beyond the end of the list, extend the list with None till it's 
big enough.  Now, if the particular entry in the list is None, you need 
to read in the corresponding pickle file, and add the created dictionary 
object into this index of the list.


At this point, you read or write your dictionary as usual.  The only 
other tricky thing is to save a 'dirty bit' for each loaded dictionary.  
When your wrapper's flush() is called, you pickle each of the dirty 
dictionaries back to the file system.


Lots of possible variants.  For example, you could use a dictionary of 
dictionary objects instead of a list of dictionaries.  That would be 
useful if the list will be large, and sparse.


The main thing I don't know here is how to overload the [] operator, if 
you need that particular syntax for the read and write methods.


per wrote:

hi all,

i have a very large dictionary object that is built from a text file
that is about 800 MB -- it contains several million keys.  ideally i
would like to pickle this object so that i wouldnt have to parse this
large file to compute the dictionary every time i run my program.
however currently the pickled file is over 300 MB and takes a very
long time to write to disk - even longer than recomputing the
dictionary from scratch.

i would like to split the dictionary into smaller ones, containing
only hundreds of thousands of keys, and then try to pickle them. is
there a way to easily do this? i.e. is there an easy way to make a
wrapper for this such that i can access this dictionary as just one
object, but underneath it's split into several? so that i can write
my_dict[k] and get a value, or set my_dict[m] to some value without
knowing which sub dictionary it's in.

if there aren't known ways to do this, i would greatly apprciate any
advice/examples on how to write this data structure from scratch,
reusing as much of the dict() class as possible.

thanks.

large_dict[a]

  

--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-23 Thread Dave Angel
As others have said, a database is probably the right answer.  There, 
the data is kept on disk, and only a few records at a time are read for 
each access, with modification transactions usually being synchronous.


However, there are use cases where your approach makes more sense.  And 
it shouldn't be too hard to do what you describe.  So it's probably 
worth implementing.


Use case:  suppose most accesses are to a 'local' set of keys, in some 
sense.  For example, if the keys were serial numbers, with the most 
recently created serial numbers likely to be accessed much more often 
than those a few years old.  Then if you split the serial numbers up in 
the most obvious way (using the high digits of the number, for example 
to index which external file the data is in), you'd find nearly all 
accesses would be to the file with the highest index.


So you create a wrapper class with an init and two member methods, read 
and write, plus a flush() method.  read takes a key, and returns a 
value, while write takes both key and value, and saves it away.  The 
init method creates a list of dictionaries (initially of length zero) 
and saves some information about the location and naming of the actual 
pickle files (or whatever you're using).


When a read or write method is called, you do the divide (or whatever 
filter you use) to find an index into the list of dictionaries.  If this 
index is beyond the end of the list, extend the list with None till it's 
big enough.  Now, if the particular entry in the list is None, you need 
to read in the corresponding pickle file, and add the created dictionary 
object into this index of the list.


At this point, you read or write your dictionary as usual.  The only 
other tricky thing is to save a 'dirty bit' for each loaded dictionary.  
When your wrapper's flush() is called, you pickle each of the dirty 
dictionaries back to the file system.


Lots of possible variants.  For example, you could use a dictionary of 
dictionary objects instead of a list of dictionaries.  That would be 
useful if the list will be large, and sparse.


The main thing I don't know here is how to overload the [] operator, if 
you need that particular syntax for the read and write methods.


per wrote:

hi all,

i have a very large dictionary object that is built from a text file
that is about 800 MB -- it contains several million keys.  ideally i
would like to pickle this object so that i wouldnt have to parse this
large file to compute the dictionary every time i run my program.
however currently the pickled file is over 300 MB and takes a very
long time to write to disk - even longer than recomputing the
dictionary from scratch.

i would like to split the dictionary into smaller ones, containing
only hundreds of thousands of keys, and then try to pickle them. is
there a way to easily do this? i.e. is there an easy way to make a
wrapper for this such that i can access this dictionary as just one
object, but underneath it's split into several? so that i can write
my_dict[k] and get a value, or set my_dict[m] to some value without
knowing which sub dictionary it's in.

if there aren't known ways to do this, i would greatly apprciate any
advice/examples on how to write this data structure from scratch,
reusing as much of the dict() class as possible.

thanks.

large_dict[a]

  

--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-23 Thread Tim Chase

i have a very large dictionary object that is built from a text file
that is about 800 MB -- it contains several million keys.  ideally i
would like to pickle this object so that i wouldnt have to parse this
large file to compute the dictionary every time i run my program.
however currently the pickled file is over 300 MB and takes a very
long time to write to disk - even longer than recomputing the
dictionary from scratch.

i would like to split the dictionary into smaller ones, containing
only hundreds of thousands of keys, and then try to pickle them. is
there a way to easily do this? 


While others have suggested databases, they may be a bit 
overkill, depending on your needs.  Python2.5+ supplies not only 
the sqlite3 module, but older versions (at least back to 2.0) 
offer the anydbm module (changed to dbm in 3.0), allowing you 
to create an on-disk string-to-string dictionary:


  import anydbm
  db = anydbm.open(data.db, c)

  # populate some data
  # using db as your dictionary
  import csv
  f = file(800megs.txt)
  data = csv.reader(f, delimiter='\t')
  data.next()  # discard a header row
  for key, value in data:
db[key] = value
  f.close()
  print db[some key]

  db.close()

The resulting DB object is a little sparsely documented, but for 
the most part it can be treated like a dictionary.  The advantage 
is that, if the source data doesn't change, you can parse once 
and then just use your data.db file from there out.


-tkc







--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-23 Thread Steve Holden
per wrote:
 hi all,
 
 i have a very large dictionary object that is built from a text file
 that is about 800 MB -- it contains several million keys.  ideally i
 would like to pickle this object so that i wouldnt have to parse this
 large file to compute the dictionary every time i run my program.
 however currently the pickled file is over 300 MB and takes a very
 long time to write to disk - even longer than recomputing the
 dictionary from scratch.
 
 i would like to split the dictionary into smaller ones, containing
 only hundreds of thousands of keys, and then try to pickle them. is
 there a way to easily do this? i.e. is there an easy way to make a
 wrapper for this such that i can access this dictionary as just one
 object, but underneath it's split into several? so that i can write
 my_dict[k] and get a value, or set my_dict[m] to some value without
 knowing which sub dictionary it's in.
 
 if there aren't known ways to do this, i would greatly apprciate any
 advice/examples on how to write this data structure from scratch,
 reusing as much of the dict() class as possible.
 
You aren't  by any chance running this on Python 3.0, are you? The I/O
implementation for that release is known to be slow, and this would have
its effect on pickle dump/load performance.

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Want to know? Come to PyCon - soon! http://us.pycon.org/

--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-22 Thread Paul Rubin
per perfr...@gmail.com writes:
 i would like to split the dictionary into smaller ones, containing
 only hundreds of thousands of keys, and then try to pickle them. 

That already sounds like the wrong approach.  You want a database.
--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-22 Thread odeits
On Mar 22, 7:32 pm, per perfr...@gmail.com wrote:
 hi all,

 i have a very large dictionary object that is built from a text file
 that is about 800 MB -- it contains several million keys.  ideally i
 would like to pickle this object so that i wouldnt have to parse this
 large file to compute the dictionary every time i run my program.
 however currently the pickled file is over 300 MB and takes a very
 long time to write to disk - even longer than recomputing the
 dictionary from scratch.

 i would like to split the dictionary into smaller ones, containing
 only hundreds of thousands of keys, and then try to pickle them. is
 there a way to easily do this? i.e. is there an easy way to make a
 wrapper for this such that i can access this dictionary as just one
 object, but underneath it's split into several? so that i can write
 my_dict[k] and get a value, or set my_dict[m] to some value without
 knowing which sub dictionary it's in.

 if there aren't known ways to do this, i would greatly apprciate any
 advice/examples on how to write this data structure from scratch,
 reusing as much of the dict() class as possible.

 thanks.

 large_dict[a]

So that I understand the goal, the reason you wish to split the
dictionary into smaller ones is so that you dont have to write all of
the dictionaries to disk if they haven't changed? Or are you trying to
speed up the initial load time?

If you are trying to speed up the initial load time I don't think this
will help. If the save time is what you are after maybe you want to
check out memory mapped files.

--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-22 Thread per
On Mar 22, 10:51 pm, Paul Rubin http://phr...@nospam.invalid wrote:
 per perfr...@gmail.com writes:
  i would like to split the dictionary into smaller ones, containing
  only hundreds of thousands of keys, and then try to pickle them.

 That already sounds like the wrong approach.  You want a database.

fair enough - what native python database would you recommend? i
prefer not to install anything commercial or anything other than
python modules
--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-22 Thread Armin
On Monday 23 March 2009 00:01:40 per wrote:
 On Mar 22, 10:51 pm, Paul Rubin http://phr...@nospam.invalid wrote:
  per perfr...@gmail.com writes:
   i would like to split the dictionary into smaller ones, containing
   only hundreds of thousands of keys, and then try to pickle them.
 
  That already sounds like the wrong approach.  You want a database.

 fair enough - what native python database would you recommend? i
 prefer not to install anything commercial or anything other than
 python modules
 --
 http://mail.python.org/mailman/listinfo/python-list

sqlite3 module  read more about it in python documentation.

-- 
Armin Moradi
--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-22 Thread Paul Rubin
per perfr...@gmail.com writes:
 fair enough - what native python database would you recommend? i
 prefer not to install anything commercial or anything other than
 python modules

I think sqlite is the preferred one these days.
--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-22 Thread Terry Reedy

per wrote:

hi all,

i have a very large dictionary object that is built from a text file
that is about 800 MB -- it contains several million keys.  ideally i
would like to pickle this object so that i wouldnt have to parse this
large file to compute the dictionary every time i run my program.
however currently the pickled file is over 300 MB and takes a very
long time to write to disk - even longer than recomputing the
dictionary from scratch.


But you only write it once.  How does the read and reconstruct time 
compare to the recompute time?


i would like to split the dictionary into smaller ones, containing
only hundreds of thousands of keys, and then try to pickle them.


Do you have any evidence that this would really be faster?

 is

there a way to easily do this? i.e. is there an easy way to make a
wrapper for this such that i can access this dictionary as just one
object, but underneath it's split into several? so that i can write
my_dict[k] and get a value, or set my_dict[m] to some value without
knowing which sub dictionary it's in.


Searching for a key in, say, 10 dicts will be slower than searching for 
it in just one.  The only reason I would do this would be if the dict 
had to be split, say over several machines.  But then, you could query 
them in parallel.



if there aren't known ways to do this, i would greatly apprciate any
advice/examples on how to write this data structure from scratch,
reusing as much of the dict() class as possible.


Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list