optimizing large dictionaries

2009-01-15 Thread Per Freem
hello

i have an optimization questions about python. i am iterating through
a file and counting the number of repeated elements. the file has on
the order
of tens of millions elements...

i create a dictionary that maps elements of the file that i want to
count
to their number of occurs. so i iterate through the file and for each
line
extract the elements (simple text operation) and see if it has an
entry in the dict:

for line in file:
  try:
elt = MyClass(line)# extract elt from line...
my_dict[elt] += 1
  except KeyError:
my_dict[elt] = 1

i am using try/except since it is supposedly faster (though i am not
sure
about this? is this really true in Python 2.5?).

the only 'twist' is that my elt is an instance of a class (MyClass)
with 3 fields, all numeric. the class is hashable, and so my_dict[elt]
works well.
the __repr__ and __hash__ methods of my class simply return str()
representation
of self, while __str__ just makes everything numeric field into a
concatenated string:

class MyClass

  def __str__(self):
return %s-%s-%s %(self.field1, self.field2, self.field3)

  def __repr__(self):
return str(self)

  def __hash__(self):
return hash(str(self))


is there anything that can be done to speed up this simply code? right
now it is taking well over 15 minutes to process, on a 3 Ghz machine
with lots of RAM (though this is all taking CPU power, not RAM at this
point.)

any general advice on how to optimize large dicts would be great too

thanks for your help.
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Per Freem
thanks to everyone for the excellent suggestions. a few follow up q's:

1] is Try-Except really slower? my dict actually has two layers, so
my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where
as the bKeys are the ones that are in the millions.  so in that case,
doing a Try-Except on aKey should be very efficient, since often it
will not fail, where as if I do: if aKey in my_dict, that statement
will get executed for each aKey. can someone definitely say whether
Try-Except is faster or not? My benchmarks aren't conclusive and i
hear it both ways from several people (though majority thinks
TryExcept is faster).

2] is there an easy way to have nested defaultdicts? ie i want to say
that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact
that my_dict is a dictionary, whose values are dictionary that map to
ints. but that syntax is not valid.

3] more importantly, is there likely to be a big improvement for
splitting up one big dictionary into several smaller ones? if so, is
there a straight forward elegant way to implement this? the way i am
thinking is to just fix a number of dicts and populate them with
elements. then during retrieval, try the first dict, if that fails,
try the second, if not the third, etc... but i can imagine how that's
more likely to lead to bugs / debugging give the way my code is setup
so i am wondering whether it is really worth it.
if it can lead to a factor of 2 difference, i will definitely
implement it -- does anyone have experience with this?

On Jan 15, 5:58 pm, Steven D'Aprano st...@remove-this-
cybersource.com.au wrote:
 On Thu, 15 Jan 2009 23:22:48 +0100, Christian Heimes wrote:
  is there anything that can be done to speed up this simply code? right
  now it is taking well over 15 minutes to process, on a 3 Ghz machine
  with lots of RAM (though this is all taking CPU power, not RAM at this
  point.)

  class MyClass(object):
      # a new style class with slots saves some memory
      __slots__ = (field1, field2, field2)

 I was curious whether using slots would speed up attribute access.

  class Parrot(object):

 ...     def __init__(self, a, b, c):
 ...             self.a = a
 ...             self.b = b
 ...             self.c = c
 ... class SlottedParrot(object):

 ...     __slots__ = 'a', 'b', 'c'
 ...     def __init__(self, a, b, c):
 ...             self.a = a
 ...             self.b = b
 ...             self.c = c
 ...

  p = Parrot(23, something, [1, 2, 3])
  sp = SlottedParrot(23, something, [1, 2, 3])

  from timeit import Timer
  setup = from __main__ import p, sp
  t1 = Timer('p.a, p.b, p.c', setup)
  t2 = Timer('sp.a, sp.b, sp.c', setup)
  min(t1.repeat())
 0.83308887481689453
  min(t2.repeat())

 0.62758088111877441

 That's not a bad improvement. I knew that __slots__ was designed to
 reduce memory consumption, but I didn't realise they were faster as well.

 --
 Steven

--
http://mail.python.org/mailman/listinfo/python-list


efficient interval containment lookup

2009-01-12 Thread Per Freem
hello,

suppose I have two lists of intervals, one significantly larger than
the other.
For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
thousands
of elements while listB (of the same form) might contain hundreds of
thousands
or millions of elements.
I want to count how many intervals in listB are contained within every
listA. For example, if listA = [(10, 30), (600, 800)] and listB =
[(20, 25), (12, 18)] is the input, then the output should be that (10,
30) has 2 intervals from listB contained within it, while (600, 800)
has 0. (Elements of listB can be contained within many intervals in
listA, not just one.)

What is an efficient way to this?  One simple way is:

for a_range in listA:
  for b_range in listB:
is_within(b_range, a_range):
  # accumulate a counter here

where is_within simply checks if the first argument is within the
second.

I'm not sure if it's more efficient to have the iteration over listA
be on the outside or listB.  But perhaps there's a way to index this
that makes things more efficient?  I.e. a smart way of indexing listA
such that I can instantly get all of its elements that are within some
element
of listB, maybe?  Something like a hash, where this look up can be
close to constant time rather than an iteration over all lists... if
there's any built-in library functions that can help in this it would
be great.

any suggestions on this would be awesome. thank you.
--
http://mail.python.org/mailman/listinfo/python-list


Re: efficient interval containment lookup

2009-01-12 Thread Per Freem
thanks for your replies -- a few clarifications and questions. the
is_within operation is containment, i.e. (a,b) is within (c,d) iff a
= c and b = d. Note that I am not looking for intervals that
overlap... this is why interval trees seem to me to not be relevant,
as the overlapping interval problem is way harder than what I am
trying to do. Please correct me if I'm wrong on this...

Scott Daniels, I was hoping you could elaborate on your comment about
bisect. I am trying to use it as follows: I try to grid my space
(since my intervals have an upper and lower bound) into segments (e.g.
of 100) and then I take these bins and put them into a bisect list,
so that it is sorted. Then when a new interval comes in, I try to
place it within one of those bins.  But this is getting messy: I don't
know if I should place it there by its beginning number or end
number.  Also, if I have an interval that overlaps my boundaries --
i.e. (900, 1010) when my first interval is (0, 1000), I may miss some
items from listB when i make my count.  Is there an elegant solution
to this?  Gridding like you said seemed straight forward but now it
seems complicated..

I'd like to add that this is *not* a homework problem, by the way.

On Jan 12, 4:05 pm, Robert Kern robert.k...@gmail.com wrote:
 [Apologies for piggybacking, but I think GMane had a hiccup today and missed 
 the
 original post]

 [Somebody wrote]:

  suppose I have two lists of intervals, one significantly larger than
  the other.
  For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
  thousands
  of elements while listB (of the same form) might contain hundreds of
  thousands
  or millions of elements.
  I want to count how many intervals in listB are contained within every
  listA. For example, if listA = [(10, 30), (600, 800)] and listB =
  [(20, 25), (12, 18)] is the input, then the output should be that (10,
  30) has 2 intervals from listB contained within it, while (600, 800)
  has 0. (Elements of listB can be contained within many intervals in
  listA, not just one.)

 Interval trees.

 http://en.wikipedia.org/wiki/Interval_tree

 --
 Robert Kern

 I have come to believe that the whole world is an enigma, a harmless enigma
   that is made terrible by our own mad attempt to interpret it as though it 
 had
   an underlying truth.
    -- Umberto Eco

--
http://mail.python.org/mailman/listinfo/python-list


Re: efficient interval containment lookup

2009-01-12 Thread Per Freem
On Jan 12, 10:58 pm, Steven D'Aprano
ste...@remove.this.cybersource.com.au wrote:
 On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote:
  thanks for your replies -- a few clarifications and questions. the
  is_within operation is containment, i.e. (a,b) is within (c,d) iff a
 = c and b = d. Note that I am not looking for intervals that
  overlap... this is why interval trees seem to me to not be relevant, as
  the overlapping interval problem is way harder than what I am trying to
  do. Please correct me if I'm wrong on this...

 To test for contained intervals:
 a = c and b = d

 To test for overlapping intervals:

 not (b  c or a  d)

 Not exactly what I would call way harder.

 --
 Steven

hi Steven,

i found an implementation (which is exactly how i'd write it based on
the description) here: 
http://hackmap.blogspot.com/2008/11/python-interval-tree.html

when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.

here is the exact code i used to make the comparison, plus the code at
the link i have above:

class Interval():
def __init__(self, start, stop):
self.start = start
self.stop = stop

import random
import time
num_ints = 3
init_intervals = []
for n in range(0,
num_ints):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
  start = int(round(random.random()
*1000))
  end = start + int(round(random.random()*500+1))
  ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
  tree.find(r[0], r[1])
t2 = time.time()
print interval tree: %.3f %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
  naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print brute force: %.3f %((t2-t1)*1000.0)

on one run, i get:
interval tree: 8584.682
brute force: 8201.644

is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.
--
http://mail.python.org/mailman/listinfo/python-list


Re: efficient interval containment lookup

2009-01-12 Thread Per Freem
i forgot to add, my naive_find is:

def naive_find(intervals, start, stop):
  results = []
  for interval in intervals:
if interval.start = start and interval.stop = stop:
  results.append(interval)
  return results

On Jan 12, 11:55 pm, Per Freem perfr...@yahoo.com wrote:
 On Jan 12, 10:58 pm, Steven D'Aprano



 ste...@remove.this.cybersource.com.au wrote:
  On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote:
   thanks for your replies -- a few clarifications and questions. the
   is_within operation is containment, i.e. (a,b) is within (c,d) iff a
  = c and b = d. Note that I am not looking for intervals that
   overlap... this is why interval trees seem to me to not be relevant, as
   the overlapping interval problem is way harder than what I am trying to
   do. Please correct me if I'm wrong on this...

  To test for contained intervals:
  a = c and b = d

  To test for overlapping intervals:

  not (b  c or a  d)

  Not exactly what I would call way harder.

  --
  Steven

 hi Steven,

 i found an implementation (which is exactly how i'd write it based on
 the description) 
 here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html

 when i use this however, it comes out either significantly slower or
 equal to a naive search. my naive search just iterates through a
 smallish list of intervals and for each one says whether they overlap
 with each of a large set of intervals.

 here is the exact code i used to make the comparison, plus the code at
 the link i have above:

 class Interval():
     def __init__(self, start, stop):
         self.start = start
         self.stop = stop

 import random
 import time
 num_ints = 3
 init_intervals = []
 for n in range(0,
 num_ints):
     start = int(round(random.random()
 *1000))
     end = start + int(round(random.random()*500+1))
     init_intervals.append(Interval(start, end))
 num_ranges = 900
 ranges = []
 for n in range(0, num_ranges):
   start = int(round(random.random()
 *1000))
   end = start + int(round(random.random()*500+1))
   ranges.append((start, end))
 #print init_intervals
 tree = IntervalTree(init_intervals)
 t1 = time.time()
 for r in ranges:
   tree.find(r[0], r[1])
 t2 = time.time()
 print interval tree: %.3f %((t2-t1)*1000.0)
 t1 = time.time()
 for r in ranges:
   naive_find(init_intervals, r[0], r[1])
 t2 = time.time()
 print brute force: %.3f %((t2-t1)*1000.0)

 on one run, i get:
 interval tree: 8584.682
 brute force: 8201.644

 is there anything wrong with this implementation? it seems very right
 to me but i am no expert. any help on this would be relly helpful.

--
http://mail.python.org/mailman/listinfo/python-list


Re: efficient interval containment lookup

2009-01-12 Thread Per Freem
hi brent, thanks very much for your informative reply -- didn't
realize this about the size of the interval.

thanks for the bx-python link.  could you (or someone else) explain
why the size of the interval makes such a big difference? i don't
understand why it affects efficiency so much...

thanks.

On Jan 13, 12:24 am, brent bpede...@gmail.com wrote:
 On Jan 12, 8:55 pm, Per Freem perfr...@yahoo.com wrote:



  On Jan 12, 10:58 pm, Steven D'Aprano

  ste...@remove.this.cybersource.com.au wrote:
   On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote:
thanks for your replies -- a few clarifications and questions. the
is_within operation is containment, i.e. (a,b) is within (c,d) iff a
   = c and b = d. Note that I am not looking for intervals that
overlap... this is why interval trees seem to me to not be relevant, as
the overlapping interval problem is way harder than what I am trying to
do. Please correct me if I'm wrong on this...

   To test for contained intervals:
   a = c and b = d

   To test for overlapping intervals:

   not (b  c or a  d)

   Not exactly what I would call way harder.

   --
   Steven

  hi Steven,

  i found an implementation (which is exactly how i'd write it based on
  the description) 
  here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html

  when i use this however, it comes out either significantly slower or
  equal to a naive search. my naive search just iterates through a
  smallish list of intervals and for each one says whether they overlap
  with each of a large set of intervals.

  here is the exact code i used to make the comparison, plus the code at
  the link i have above:

  class Interval():
      def __init__(self, start, stop):
          self.start = start
          self.stop = stop

  import random
  import time
  num_ints = 3
  init_intervals = []
  for n in range(0,
  num_ints):
      start = int(round(random.random()
  *1000))
      end = start + int(round(random.random()*500+1))
      init_intervals.append(Interval(start, end))
  num_ranges = 900
  ranges = []
  for n in range(0, num_ranges):
    start = int(round(random.random()
  *1000))
    end = start + int(round(random.random()*500+1))
    ranges.append((start, end))
  #print init_intervals
  tree = IntervalTree(init_intervals)
  t1 = time.time()
  for r in ranges:
    tree.find(r[0], r[1])
  t2 = time.time()
  print interval tree: %.3f %((t2-t1)*1000.0)
  t1 = time.time()
  for r in ranges:
    naive_find(init_intervals, r[0], r[1])
  t2 = time.time()
  print brute force: %.3f %((t2-t1)*1000.0)

  on one run, i get:
  interval tree: 8584.682
  brute force: 8201.644

  is there anything wrong with this implementation? it seems very right
  to me but i am no expert. any help on this would be relly helpful.

 hi, the tree is inefficient when the interval is large. as the size of
 the interval shrinks to much less than the expanse of the tree, the
 tree will be faster. changing 500 to 50 in both cases in your script,
 i get:
 interval tree: 3233.404
 brute force: 9807.787

 so the tree will work for limited cases. but it's quite simple. check
 the tree in 
 bx-python:http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/opera...
 for a more robust implementation.
 -brentp


--
http://mail.python.org/mailman/listinfo/python-list