optimizing large dictionaries
hello i have an optimization questions about python. i am iterating through a file and counting the number of repeated elements. the file has on the order of tens of millions elements... i create a dictionary that maps elements of the file that i want to count to their number of occurs. so i iterate through the file and for each line extract the elements (simple text operation) and see if it has an entry in the dict: for line in file: try: elt = MyClass(line)# extract elt from line... my_dict[elt] += 1 except KeyError: my_dict[elt] = 1 i am using try/except since it is supposedly faster (though i am not sure about this? is this really true in Python 2.5?). the only 'twist' is that my elt is an instance of a class (MyClass) with 3 fields, all numeric. the class is hashable, and so my_dict[elt] works well. the __repr__ and __hash__ methods of my class simply return str() representation of self, while __str__ just makes everything numeric field into a concatenated string: class MyClass def __str__(self): return %s-%s-%s %(self.field1, self.field2, self.field3) def __repr__(self): return str(self) def __hash__(self): return hash(str(self)) is there anything that can be done to speed up this simply code? right now it is taking well over 15 minutes to process, on a 3 Ghz machine with lots of RAM (though this is all taking CPU power, not RAM at this point.) any general advice on how to optimize large dicts would be great too thanks for your help. -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
thanks to everyone for the excellent suggestions. a few follow up q's: 1] is Try-Except really slower? my dict actually has two layers, so my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where as the bKeys are the ones that are in the millions. so in that case, doing a Try-Except on aKey should be very efficient, since often it will not fail, where as if I do: if aKey in my_dict, that statement will get executed for each aKey. can someone definitely say whether Try-Except is faster or not? My benchmarks aren't conclusive and i hear it both ways from several people (though majority thinks TryExcept is faster). 2] is there an easy way to have nested defaultdicts? ie i want to say that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact that my_dict is a dictionary, whose values are dictionary that map to ints. but that syntax is not valid. 3] more importantly, is there likely to be a big improvement for splitting up one big dictionary into several smaller ones? if so, is there a straight forward elegant way to implement this? the way i am thinking is to just fix a number of dicts and populate them with elements. then during retrieval, try the first dict, if that fails, try the second, if not the third, etc... but i can imagine how that's more likely to lead to bugs / debugging give the way my code is setup so i am wondering whether it is really worth it. if it can lead to a factor of 2 difference, i will definitely implement it -- does anyone have experience with this? On Jan 15, 5:58 pm, Steven D'Aprano st...@remove-this- cybersource.com.au wrote: On Thu, 15 Jan 2009 23:22:48 +0100, Christian Heimes wrote: is there anything that can be done to speed up this simply code? right now it is taking well over 15 minutes to process, on a 3 Ghz machine with lots of RAM (though this is all taking CPU power, not RAM at this point.) class MyClass(object): # a new style class with slots saves some memory __slots__ = (field1, field2, field2) I was curious whether using slots would speed up attribute access. class Parrot(object): ... def __init__(self, a, b, c): ... self.a = a ... self.b = b ... self.c = c ... class SlottedParrot(object): ... __slots__ = 'a', 'b', 'c' ... def __init__(self, a, b, c): ... self.a = a ... self.b = b ... self.c = c ... p = Parrot(23, something, [1, 2, 3]) sp = SlottedParrot(23, something, [1, 2, 3]) from timeit import Timer setup = from __main__ import p, sp t1 = Timer('p.a, p.b, p.c', setup) t2 = Timer('sp.a, sp.b, sp.c', setup) min(t1.repeat()) 0.83308887481689453 min(t2.repeat()) 0.62758088111877441 That's not a bad improvement. I knew that __slots__ was designed to reduce memory consumption, but I didn't realise they were faster as well. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
efficient interval containment lookup
hello, suppose I have two lists of intervals, one significantly larger than the other. For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain thousands of elements while listB (of the same form) might contain hundreds of thousands or millions of elements. I want to count how many intervals in listB are contained within every listA. For example, if listA = [(10, 30), (600, 800)] and listB = [(20, 25), (12, 18)] is the input, then the output should be that (10, 30) has 2 intervals from listB contained within it, while (600, 800) has 0. (Elements of listB can be contained within many intervals in listA, not just one.) What is an efficient way to this? One simple way is: for a_range in listA: for b_range in listB: is_within(b_range, a_range): # accumulate a counter here where is_within simply checks if the first argument is within the second. I'm not sure if it's more efficient to have the iteration over listA be on the outside or listB. But perhaps there's a way to index this that makes things more efficient? I.e. a smart way of indexing listA such that I can instantly get all of its elements that are within some element of listB, maybe? Something like a hash, where this look up can be close to constant time rather than an iteration over all lists... if there's any built-in library functions that can help in this it would be great. any suggestions on this would be awesome. thank you. -- http://mail.python.org/mailman/listinfo/python-list
Re: efficient interval containment lookup
thanks for your replies -- a few clarifications and questions. the is_within operation is containment, i.e. (a,b) is within (c,d) iff a = c and b = d. Note that I am not looking for intervals that overlap... this is why interval trees seem to me to not be relevant, as the overlapping interval problem is way harder than what I am trying to do. Please correct me if I'm wrong on this... Scott Daniels, I was hoping you could elaborate on your comment about bisect. I am trying to use it as follows: I try to grid my space (since my intervals have an upper and lower bound) into segments (e.g. of 100) and then I take these bins and put them into a bisect list, so that it is sorted. Then when a new interval comes in, I try to place it within one of those bins. But this is getting messy: I don't know if I should place it there by its beginning number or end number. Also, if I have an interval that overlaps my boundaries -- i.e. (900, 1010) when my first interval is (0, 1000), I may miss some items from listB when i make my count. Is there an elegant solution to this? Gridding like you said seemed straight forward but now it seems complicated.. I'd like to add that this is *not* a homework problem, by the way. On Jan 12, 4:05 pm, Robert Kern robert.k...@gmail.com wrote: [Apologies for piggybacking, but I think GMane had a hiccup today and missed the original post] [Somebody wrote]: suppose I have two lists of intervals, one significantly larger than the other. For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain thousands of elements while listB (of the same form) might contain hundreds of thousands or millions of elements. I want to count how many intervals in listB are contained within every listA. For example, if listA = [(10, 30), (600, 800)] and listB = [(20, 25), (12, 18)] is the input, then the output should be that (10, 30) has 2 intervals from listB contained within it, while (600, 800) has 0. (Elements of listB can be contained within many intervals in listA, not just one.) Interval trees. http://en.wikipedia.org/wiki/Interval_tree -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: efficient interval containment lookup
On Jan 12, 10:58 pm, Steven D'Aprano ste...@remove.this.cybersource.com.au wrote: On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote: thanks for your replies -- a few clarifications and questions. the is_within operation is containment, i.e. (a,b) is within (c,d) iff a = c and b = d. Note that I am not looking for intervals that overlap... this is why interval trees seem to me to not be relevant, as the overlapping interval problem is way harder than what I am trying to do. Please correct me if I'm wrong on this... To test for contained intervals: a = c and b = d To test for overlapping intervals: not (b c or a d) Not exactly what I would call way harder. -- Steven hi Steven, i found an implementation (which is exactly how i'd write it based on the description) here: http://hackmap.blogspot.com/2008/11/python-interval-tree.html when i use this however, it comes out either significantly slower or equal to a naive search. my naive search just iterates through a smallish list of intervals and for each one says whether they overlap with each of a large set of intervals. here is the exact code i used to make the comparison, plus the code at the link i have above: class Interval(): def __init__(self, start, stop): self.start = start self.stop = stop import random import time num_ints = 3 init_intervals = [] for n in range(0, num_ints): start = int(round(random.random() *1000)) end = start + int(round(random.random()*500+1)) init_intervals.append(Interval(start, end)) num_ranges = 900 ranges = [] for n in range(0, num_ranges): start = int(round(random.random() *1000)) end = start + int(round(random.random()*500+1)) ranges.append((start, end)) #print init_intervals tree = IntervalTree(init_intervals) t1 = time.time() for r in ranges: tree.find(r[0], r[1]) t2 = time.time() print interval tree: %.3f %((t2-t1)*1000.0) t1 = time.time() for r in ranges: naive_find(init_intervals, r[0], r[1]) t2 = time.time() print brute force: %.3f %((t2-t1)*1000.0) on one run, i get: interval tree: 8584.682 brute force: 8201.644 is there anything wrong with this implementation? it seems very right to me but i am no expert. any help on this would be relly helpful. -- http://mail.python.org/mailman/listinfo/python-list
Re: efficient interval containment lookup
i forgot to add, my naive_find is: def naive_find(intervals, start, stop): results = [] for interval in intervals: if interval.start = start and interval.stop = stop: results.append(interval) return results On Jan 12, 11:55 pm, Per Freem perfr...@yahoo.com wrote: On Jan 12, 10:58 pm, Steven D'Aprano ste...@remove.this.cybersource.com.au wrote: On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote: thanks for your replies -- a few clarifications and questions. the is_within operation is containment, i.e. (a,b) is within (c,d) iff a = c and b = d. Note that I am not looking for intervals that overlap... this is why interval trees seem to me to not be relevant, as the overlapping interval problem is way harder than what I am trying to do. Please correct me if I'm wrong on this... To test for contained intervals: a = c and b = d To test for overlapping intervals: not (b c or a d) Not exactly what I would call way harder. -- Steven hi Steven, i found an implementation (which is exactly how i'd write it based on the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html when i use this however, it comes out either significantly slower or equal to a naive search. my naive search just iterates through a smallish list of intervals and for each one says whether they overlap with each of a large set of intervals. here is the exact code i used to make the comparison, plus the code at the link i have above: class Interval(): def __init__(self, start, stop): self.start = start self.stop = stop import random import time num_ints = 3 init_intervals = [] for n in range(0, num_ints): start = int(round(random.random() *1000)) end = start + int(round(random.random()*500+1)) init_intervals.append(Interval(start, end)) num_ranges = 900 ranges = [] for n in range(0, num_ranges): start = int(round(random.random() *1000)) end = start + int(round(random.random()*500+1)) ranges.append((start, end)) #print init_intervals tree = IntervalTree(init_intervals) t1 = time.time() for r in ranges: tree.find(r[0], r[1]) t2 = time.time() print interval tree: %.3f %((t2-t1)*1000.0) t1 = time.time() for r in ranges: naive_find(init_intervals, r[0], r[1]) t2 = time.time() print brute force: %.3f %((t2-t1)*1000.0) on one run, i get: interval tree: 8584.682 brute force: 8201.644 is there anything wrong with this implementation? it seems very right to me but i am no expert. any help on this would be relly helpful. -- http://mail.python.org/mailman/listinfo/python-list
Re: efficient interval containment lookup
hi brent, thanks very much for your informative reply -- didn't realize this about the size of the interval. thanks for the bx-python link. could you (or someone else) explain why the size of the interval makes such a big difference? i don't understand why it affects efficiency so much... thanks. On Jan 13, 12:24 am, brent bpede...@gmail.com wrote: On Jan 12, 8:55 pm, Per Freem perfr...@yahoo.com wrote: On Jan 12, 10:58 pm, Steven D'Aprano ste...@remove.this.cybersource.com.au wrote: On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote: thanks for your replies -- a few clarifications and questions. the is_within operation is containment, i.e. (a,b) is within (c,d) iff a = c and b = d. Note that I am not looking for intervals that overlap... this is why interval trees seem to me to not be relevant, as the overlapping interval problem is way harder than what I am trying to do. Please correct me if I'm wrong on this... To test for contained intervals: a = c and b = d To test for overlapping intervals: not (b c or a d) Not exactly what I would call way harder. -- Steven hi Steven, i found an implementation (which is exactly how i'd write it based on the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html when i use this however, it comes out either significantly slower or equal to a naive search. my naive search just iterates through a smallish list of intervals and for each one says whether they overlap with each of a large set of intervals. here is the exact code i used to make the comparison, plus the code at the link i have above: class Interval(): def __init__(self, start, stop): self.start = start self.stop = stop import random import time num_ints = 3 init_intervals = [] for n in range(0, num_ints): start = int(round(random.random() *1000)) end = start + int(round(random.random()*500+1)) init_intervals.append(Interval(start, end)) num_ranges = 900 ranges = [] for n in range(0, num_ranges): start = int(round(random.random() *1000)) end = start + int(round(random.random()*500+1)) ranges.append((start, end)) #print init_intervals tree = IntervalTree(init_intervals) t1 = time.time() for r in ranges: tree.find(r[0], r[1]) t2 = time.time() print interval tree: %.3f %((t2-t1)*1000.0) t1 = time.time() for r in ranges: naive_find(init_intervals, r[0], r[1]) t2 = time.time() print brute force: %.3f %((t2-t1)*1000.0) on one run, i get: interval tree: 8584.682 brute force: 8201.644 is there anything wrong with this implementation? it seems very right to me but i am no expert. any help on this would be relly helpful. hi, the tree is inefficient when the interval is large. as the size of the interval shrinks to much less than the expanse of the tree, the tree will be faster. changing 500 to 50 in both cases in your script, i get: interval tree: 3233.404 brute force: 9807.787 so the tree will work for limited cases. but it's quite simple. check the tree in bx-python:http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/opera... for a more robust implementation. -brentp -- http://mail.python.org/mailman/listinfo/python-list