Re: Strategy for determing difference between 2 very large dictionaries
On Dec 24, 7:04 pm, "Malcolm Greene" wrote: > Hi Roger, > > By very large dictionary, I mean about 25M items per dictionary. Each > item is a simple integer whose value will never exceed 2^15. In Python-speak about dictionaries, an "item" is a tuple (key, value). I presume from the gross difference between 25M and 2**15 -- more pedantry: 2^15 is 13 :-) -- that you mean that the values satisfy 0 <= value <= 2**15. > > I populate these dictionaries by parsing very large ASCII text files > containing detailed manufacturing events. From each line in my log file > I construct one or more keys and increment the numeric values associated > with these keys using timing info also extracted from each line. > > Some of our log files are generated by separate monitoring equipment > measuring the same process. In theory, these log files should be > identical, but of course they are not. I'm looking for a way to > determine the differences between the 2 dictionaries I will create from > so-called matching sets of log files. You seem to refer to the dictionaries as part of your requirements, not as part of the solution. Do you *need* the two dictionaries after you have obtained the differences? Let's start from the beginning: You have two log files, each of which can be abstracted as containing lots of (k, v) data, where each k describes an event and each v is a compressed 15-bit timestamp. You want to know for each datum whether it is (a) in both files (b) only in file1 (c) only in file2. Is that a fair summary? If so, there's an alternative that doesn't need to use dictionaries. Each file can be represented by a *list* of 32768 sets. Each set contains the ks that happened at the corresponding time... sets[fileno] [v].add(k). Once you've built that, you trundle through the pairs of sets doing set differences etc. Bear in mind that Python dicts and sets grow as required by doubling (IIRC) in size and rehashing from old to new. Consequently you get periodical disturbances which are not usually noticed but may be noticeable with large amounts of data. HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Marc 'BlackJack' Rintsch wrote: On Wed, 24 Dec 2008 03:23:00 -0500, python wrote: collection, I don't see the advantage of using an iterator or a list. I'm sure I'm missing a subtle point here :) `keys()` creates a list in memory, `iterkeys()` does not. With ``set(dict.keys())`` there is a point in time where the dictionary, the list, and the set co-exist in memory. With ``set(dict.iterkeys())`` only the set and the dictionary exist in memory. If you can, consider using 3.0 in which d.keys() is a set-like view of the keys of d. Same for d.values and d.items. The time and space to create such is O(1), I believe. since they are just alternate read-only interfaces to the internal dict storage. There is not even a separate set until you do something like intersect d1.keys() with d2.keys() or d1.keys() - d2.keys(). I think this is an under-appreciated feature of 3.0 which should be really useful with large dicts. tjr -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Gabriel Genellina wrote: En Wed, 24 Dec 2008 06:23:00 -0200, escribió: ... k1 = set(dict1.iterkeys()) You've got an excelent explanation from Marc Rintsch. (Note that in Python 3.0 keys() behaves as iterkeys() in previous versions, so the above code is supposed to be written in Python 2.x) And, in fact, a dictionary iterates its keys, so: k1 = set(dict1) works in 2.4, 2.5, 2.6, and 3.0 --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Peter Otten: > >>> a = dict(a=1, b=2, c=3) > >>> b = dict(b=2, c=30, d=4) > >>> dict(set(a.iteritems()) ^ set(b.iteritems())) For larger sets this may be better, may avoid the creation of the second set: dict(set(a.iteritems()).symmetric_difference(b.iteritems())) Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi Gabriel, > in Python 3.0 keys() behaves as iterkeys() in previous versions, so the above > code is supposed to be written in Python 2.x) I understand. Thank you. > note that dict comprehensions require Python 3.0 I'm relieved to know that I didn't miss that feature in my reading of Python's 2.5/2.6 documentation :) > You might use instead: > > dict((key,(dict1[key],dict2[key])) for key in ...) Excellent. Thank you. Regards, Malcolm - Original message - From: "Gabriel Genellina" To: python-list@python.org Date: Wed, 24 Dec 2008 07:10:16 -0200 Subject: Re: Strategy for determing difference between 2 very large dictionaries En Wed, 24 Dec 2008 06:23:00 -0200, escribió: > Hi Gabriel, > > Thank you very much for your feedback! > >> k1 = set(dict1.iterkeys()) > > I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage > to using an iterator vs. a list as the basis for creating a set? I You've got an excelent explanation from Marc Rintsch. (Note that in Python 3.0 keys() behaves as iterkeys() in previous versions, so the above code is supposed to be written in Python 2.x) >>> can this last step be done via a simple list comprehension? > >> Yes; but isn't a dict comprehension more adequate? >> >> [key: (dict1[key], dict2[key]) for key in common_keys if >> dict1[key]!=dict2[key]} > > Cool!! I'm relatively new to Python and totally missed the ability to > work with dictionary comprehensions. Yes, your dictionary comprehension > technique is much better than the list comprehension approach I was > struggling with. Your dictionary comprehension statement describes > exactly what I wanted to write. This time, note that dict comprehensions require Python 3.0 -- so the code above won't work in Python 2.x. (It's not a good idea to mix both versions in the same post, sorry!) You might use instead: dict((key,(dict1[key],dict2[key])) for key in ...) but it's not as readable. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi Peter, If you are not interested in the intermediate results and the dictionary values are hashable you can get the difference by >>> a = dict(a=1, b=2, c=3) >>> b = dict(b=2, c=30, d=4) >>> dict(set(a.iteritems()) ^ set(b.iteritems())) {'a': 1, 'c': 3, 'd': 4} That's very cool! Thanks for sharing that technique. Regards, Malcolm - Original message - From: "Peter Otten" <__pete...@web.de> To: python-list@python.org Date: Wed, 24 Dec 2008 09:46:51 +0100 Subject: Re: Strategy for determing difference between 2 very large dictionaries Gabriel Genellina wrote: > En Wed, 24 Dec 2008 05:16:36 -0200, escribió: [I didn't see the original post] >> I'm looking for suggestions on the best ('Pythonic') way to >> determine the difference between 2 very large dictionaries >> containing simple key/value pairs. >> By difference, I mean a list of keys that are present in the >> first dictionary, but not the second. And vice versa. And a list >> of keys in common between the 2 dictionaries whose values are >> different. >> The 2 strategies I'm considering are: >> 1. Brute force: Iterate through first dictionary's keys and >> determine which keys it has that are missing from the second >> dictionary. If keys match, then verify that the 2 dictionaries >> have identical values for the same key. Repeat this process for >> the second dictionary. >> 2. Use sets: Create sets from each dictionary's list of keys and >> use Python's set methods to generate a list of keys present in >> one dictionary but not the other (for both dictionaries) as well >> as a set of keys the 2 dictionaries have in common. > > I cannot think of any advantage of the first approach - so I'd use sets. > > k1 = set(dict1.iterkeys()) > k2 = set(dict2.iterkeys()) > k1 - k2 # keys in dict1 not in dict2 > k2 - k1 # keys in dict2 not in dict1 > k1 & k2 # keys in both > >> Using the set >> of keys in common, compare values across dictionaries to >> determine which keys have different values (can this last step be >> done via a simple list comprehension?) If you are not interested in the intermediate results and the dictionary values are hashable you can get the difference by >>> a = dict(a=1, b=2, c=3) >>> b = dict(b=2, c=30, d=4) >>> dict(set(a.iteritems()) ^ set(b.iteritems())) {'a': 1, 'c': 3, 'd': 4} Peter -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
En Wed, 24 Dec 2008 06:23:00 -0200, escribió: Hi Gabriel, Thank you very much for your feedback! k1 = set(dict1.iterkeys()) I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage to using an iterator vs. a list as the basis for creating a set? I You've got an excelent explanation from Marc Rintsch. (Note that in Python 3.0 keys() behaves as iterkeys() in previous versions, so the above code is supposed to be written in Python 2.x) can this last step be done via a simple list comprehension? Yes; but isn't a dict comprehension more adequate? [key: (dict1[key], dict2[key]) for key in common_keys if dict1[key]!=dict2[key]} Cool!! I'm relatively new to Python and totally missed the ability to work with dictionary comprehensions. Yes, your dictionary comprehension technique is much better than the list comprehension approach I was struggling with. Your dictionary comprehension statement describes exactly what I wanted to write. This time, note that dict comprehensions require Python 3.0 -- so the code above won't work in Python 2.x. (It's not a good idea to mix both versions in the same post, sorry!) You might use instead: dict((key,(dict1[key],dict2[key])) for key in ...) but it's not as readable. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi James, > For the purpose of perpetuating the annoying pedantry that has made > usenet great: > > http://docs.python.org/dev/3.0/whatsnew/3.0.html#views-and-iterators-instead-of-lists Great tip! Thank you! Malcolm -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Marc 'BlackJack' Rintsch wrote: On Wed, 24 Dec 2008 03:23:00 -0500, python wrote: Hi Gabriel, Thank you very much for your feedback! k1 = set(dict1.iterkeys()) I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage to using an iterator vs. a list as the basis for creating a set? I understand that an iterator makes sense if you're working with a large set of items one at a time, but if you're creating a non-filtered collection, I don't see the advantage of using an iterator or a list. I'm sure I'm missing a subtle point here :) `keys()` creates a list in memory, `iterkeys()` does not. With ``set(dict.keys())`` there is a point in time where the dictionary, the list, and the set co-exist in memory. With ``set(dict.iterkeys())`` only the set and the dictionary exist in memory. For the purpose of perpetuating the annoying pedantry that has made usenet great: http://docs.python.org/dev/3.0/whatsnew/3.0.html#views-and-iterators-instead-of-lists James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Gabriel Genellina wrote: > En Wed, 24 Dec 2008 05:16:36 -0200, escribió: [I didn't see the original post] >> I'm looking for suggestions on the best ('Pythonic') way to >> determine the difference between 2 very large dictionaries >> containing simple key/value pairs. >> By difference, I mean a list of keys that are present in the >> first dictionary, but not the second. And vice versa. And a list >> of keys in common between the 2 dictionaries whose values are >> different. >> The 2 strategies I'm considering are: >> 1. Brute force: Iterate through first dictionary's keys and >> determine which keys it has that are missing from the second >> dictionary. If keys match, then verify that the 2 dictionaries >> have identical values for the same key. Repeat this process for >> the second dictionary. >> 2. Use sets: Create sets from each dictionary's list of keys and >> use Python's set methods to generate a list of keys present in >> one dictionary but not the other (for both dictionaries) as well >> as a set of keys the 2 dictionaries have in common. > > I cannot think of any advantage of the first approach - so I'd use sets. > > k1 = set(dict1.iterkeys()) > k2 = set(dict2.iterkeys()) > k1 - k2 # keys in dict1 not in dict2 > k2 - k1 # keys in dict2 not in dict1 > k1 & k2 # keys in both > >> Using the set >> of keys in common, compare values across dictionaries to >> determine which keys have different values (can this last step be >> done via a simple list comprehension?) If you are not interested in the intermediate results and the dictionary values are hashable you can get the difference by >>> a = dict(a=1, b=2, c=3) >>> b = dict(b=2, c=30, d=4) >>> dict(set(a.iteritems()) ^ set(b.iteritems())) {'a': 1, 'c': 3, 'd': 4} Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi Marc, > `keys()` creates a list in memory, `iterkeys()` does not. With > ``set(dict.keys())`` there is a point in time where the dictionary, the > list, and the set co-exist in memory. With ``set(dict.iterkeys())`` > only the set and the dictionary exist in memory. Perfect explanation. Thank you! Malcolm - Original message - From: "Marc 'BlackJack' Rintsch" To: python-list@python.org Date: 24 Dec 2008 08:30:41 GMT Subject: Re: Strategy for determing difference between 2 very large dictionaries On Wed, 24 Dec 2008 03:23:00 -0500, python wrote: > Hi Gabriel, > > Thank you very much for your feedback! > >> k1 = set(dict1.iterkeys()) > > I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage > to using an iterator vs. a list as the basis for creating a set? I > understand that an iterator makes sense if you're working with a large > set of items one at a time, but if you're creating a non-filtered > collection, I don't see the advantage of using an iterator or a list. > I'm sure I'm missing a subtle point here :) `keys()` creates a list in memory, `iterkeys()` does not. With ``set(dict.keys())`` there is a point in time where the dictionary, the list, and the set co-exist in memory. With ``set(dict.iterkeys())`` only the set and the dictionary exist in memory. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
On Wed, 24 Dec 2008 03:23:00 -0500, python wrote: > Hi Gabriel, > > Thank you very much for your feedback! > >> k1 = set(dict1.iterkeys()) > > I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage > to using an iterator vs. a list as the basis for creating a set? I > understand that an iterator makes sense if you're working with a large > set of items one at a time, but if you're creating a non-filtered > collection, I don't see the advantage of using an iterator or a list. > I'm sure I'm missing a subtle point here :) `keys()` creates a list in memory, `iterkeys()` does not. With ``set(dict.keys())`` there is a point in time where the dictionary, the list, and the set co-exist in memory. With ``set(dict.iterkeys())`` only the set and the dictionary exist in memory. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi Gabriel, Thank you very much for your feedback! > k1 = set(dict1.iterkeys()) I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage to using an iterator vs. a list as the basis for creating a set? I understand that an iterator makes sense if you're working with a large set of items one at a time, but if you're creating a non-filtered collection, I don't see the advantage of using an iterator or a list. I'm sure I'm missing a subtle point here :) >> can this last step be done via a simple list comprehension? > Yes; but isn't a dict comprehension more adequate? > > [key: (dict1[key], dict2[key]) for key in common_keys if > dict1[key]!=dict2[key]} Cool!! I'm relatively new to Python and totally missed the ability to work with dictionary comprehensions. Yes, your dictionary comprehension technique is much better than the list comprehension approach I was struggling with. Your dictionary comprehension statement describes exactly what I wanted to write. Regards, Malcolm - Original message - From: "Gabriel Genellina" To: python-list@python.org Date: Wed, 24 Dec 2008 05:46:04 -0200 Subject: Re: Strategy for determing difference between 2 very large dictionaries En Wed, 24 Dec 2008 05:16:36 -0200, escribió: > I'm looking for suggestions on the best ('Pythonic') way to > determine the difference between 2 very large dictionaries > containing simple key/value pairs. > By difference, I mean a list of keys that are present in the > first dictionary, but not the second. And vice versa. And a list > of keys in common between the 2 dictionaries whose values are > different. > The 2 strategies I'm considering are: > 1. Brute force: Iterate through first dictionary's keys and > determine which keys it has that are missing from the second > dictionary. If keys match, then verify that the 2 dictionaries > have identical values for the same key. Repeat this process for > the second dictionary. > 2. Use sets: Create sets from each dictionary's list of keys and > use Python's set methods to generate a list of keys present in > one dictionary but not the other (for both dictionaries) as well > as a set of keys the 2 dictionaries have in common. I cannot think of any advantage of the first approach - so I'd use sets. k1 = set(dict1.iterkeys()) k2 = set(dict2.iterkeys()) k1 - k2 # keys in dict1 not in dict2 k2 - k1 # keys in dict2 not in dict1 k1 & k2 # keys in both > Using the set > of keys in common, compare values across dictionaries to > determine which keys have different values (can this last step be > done via a simple list comprehension?) Yes; but isn't a dict comprehension more adequate? [key: (dict1[key], dict2[key]) for key in common_keys if dict1[key]!=dict2[key]} (where common_keys=k1&k2 as above) -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi Roger, By very large dictionary, I mean about 25M items per dictionary. Each item is a simple integer whose value will never exceed 2^15. I populate these dictionaries by parsing very large ASCII text files containing detailed manufacturing events. From each line in my log file I construct one or more keys and increment the numeric values associated with these keys using timing info also extracted from each line. Some of our log files are generated by separate monitoring equipment measuring the same process. In theory, these log files should be identical, but of course they are not. I'm looking for a way to determine the differences between the 2 dictionaries I will create from so-called matching sets of log files. At this point in time, I don't have concerns about memory as I'm running my scripts on a dedicated 64-bit server with 32Gb of RAM (but with budget approval to raise our total RAM to 64Gb if necessary). My main concern is am I applying a reasonably pythonic approach to my problem, eg. am I using appropriate python techniques and data structures? I am also interested in using reasonable techniques that will provide me with the fastest execution time. Thank you for sharing your thoughts with me. Regards, Malcolm - Original message - From: "Roger Binns" To: python-list@python.org Date: Tue, 23 Dec 2008 23:26:49 -0800 Subject: Re: Strategy for determing difference between 2 very large dictionaries -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 pyt...@bdurham.com wrote: > Feedback on my proposed strategies (or better strategies) would be > greatly appreciated. Both strategies will work but I'd recommend the second approach since it uses already tested code written by other people - the chances of it being wrong are far lower than new code. You also neglected to mention what your concerns are or even what "very large" is. Example concerns are memory consumption, cpu consumption, testability, utility of output (eg as a generator getting each result on demand or a single list with complete results). Some people will think a few hundred entries is large. My idea of large is a working set larger than my workstation's 6GB of memory :-) In general the Pythonic approach is: 1 - Get the correct result 2 - Simple code (developer time is precious) 3 - Optimise for your data and environment Step 3 is usually not needed. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAklR5DUACgkQmOOfHg372QSuWACgp0xrdpW+NSB6qqCM3oBY2e/I LIEAn080VgNvmEYj47Mm7BtV69J1GwXN =MKLl -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
On Tue, Dec 23, 2008 at 11:46 PM, Gabriel Genellina wrote: > > Yes; but isn't a dict comprehension more adequate? > > [key: (dict1[key], dict2[key]) for key in common_keys if > dict1[key]!=dict2[key]} That initial [ should be a { of course. Cheers, Chris -- Follow the path of the Iguana... http://rebertia.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
En Wed, 24 Dec 2008 05:16:36 -0200, escribió: I'm looking for suggestions on the best ('Pythonic') way to determine the difference between 2 very large dictionaries containing simple key/value pairs. By difference, I mean a list of keys that are present in the first dictionary, but not the second. And vice versa. And a list of keys in common between the 2 dictionaries whose values are different. The 2 strategies I'm considering are: 1. Brute force: Iterate through first dictionary's keys and determine which keys it has that are missing from the second dictionary. If keys match, then verify that the 2 dictionaries have identical values for the same key. Repeat this process for the second dictionary. 2. Use sets: Create sets from each dictionary's list of keys and use Python's set methods to generate a list of keys present in one dictionary but not the other (for both dictionaries) as well as a set of keys the 2 dictionaries have in common. I cannot think of any advantage of the first approach - so I'd use sets. k1 = set(dict1.iterkeys()) k2 = set(dict2.iterkeys()) k1 - k2 # keys in dict1 not in dict2 k2 - k1 # keys in dict2 not in dict1 k1 & k2 # keys in both Using the set of keys in common, compare values across dictionaries to determine which keys have different values (can this last step be done via a simple list comprehension?) Yes; but isn't a dict comprehension more adequate? [key: (dict1[key], dict2[key]) for key in common_keys if dict1[key]!=dict2[key]} (where common_keys=k1&k2 as above) -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 pyt...@bdurham.com wrote: > Feedback on my proposed strategies (or better strategies) would be > greatly appreciated. Both strategies will work but I'd recommend the second approach since it uses already tested code written by other people - the chances of it being wrong are far lower than new code. You also neglected to mention what your concerns are or even what "very large" is. Example concerns are memory consumption, cpu consumption, testability, utility of output (eg as a generator getting each result on demand or a single list with complete results). Some people will think a few hundred entries is large. My idea of large is a working set larger than my workstation's 6GB of memory :-) In general the Pythonic approach is: 1 - Get the correct result 2 - Simple code (developer time is precious) 3 - Optimise for your data and environment Step 3 is usually not needed. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAklR5DUACgkQmOOfHg372QSuWACgp0xrdpW+NSB6qqCM3oBY2e/I LIEAn080VgNvmEYj47Mm7BtV69J1GwXN =MKLl -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list