Re: [Numpy-discussion] efficient way to manage a set of floats?

2010-05-12 Thread Robert Kern
On Wed, May 12, 2010 at 21:37,   wrote:
> On Wed, May 12, 2010 at 9:27 PM, Robert Kern  wrote:
>> On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman
>>  wrote:
>>>
>>> Warren Weckesser-3 wrote:

 A couple questions:

 How many floats will you be storing?

 When you test for membership, will you want to allow for a numerical
 tolerance, so that if the value 1 - 0.7 is added to the set, a test for
 the value 0.3 returns True?  (0.3 is actually 0.2, while
 1-0.7 is 0.30004)

 Warren

>>>
>>> Anne- Thanks for that absolutely beautiful explanation!!
>>>
>>> Warren- I had not initially thought about numerical tolerance, but this
>>> could potentially be an issue, in which case the management of the data
>>> would have to be completely different.  Thanks for pointing this out!  I
>>> might have as many as 50,000 values.
>>
>> You may want to explain your higher-level problem. Maintaining sets of
>> floating point numbers is almost never the right approach. With sets,
>> comparison must necessarily be by exact equality because fuzzy
>> equality is not transitive.
>
> with consistent scaling, shouldn't something like rounding to a fixed
> precision be enough?

Then you might was well convert to integers and do integer sets. The
problem is that two floats very close to a border (and hence each
other) would end up in rounding to different bins. They will compare
unequal to each other and equal to numbers farther away but in the
same arbitrary bin.

Again, it depends on the higher-level problem.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] efficient way to manage a set of floats?

2010-05-12 Thread Benjamin Root
On Wed, May 12, 2010 at 8:37 PM,  wrote:

> On Wed, May 12, 2010 at 9:27 PM, Robert Kern 
> wrote:
> > On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman
> >  wrote:
> >>
> >> Warren Weckesser-3 wrote:
> >>>
> >>> A couple questions:
> >>>
> >>> How many floats will you be storing?
> >>>
> >>> When you test for membership, will you want to allow for a numerical
> >>> tolerance, so that if the value 1 - 0.7 is added to the set, a test for
> >>> the value 0.3 returns True?  (0.3 is actually 0.2,
> while
> >>> 1-0.7 is 0.30004)
> >>>
> >>> Warren
> >>>
> >>
> >> Anne- Thanks for that absolutely beautiful explanation!!
> >>
> >> Warren- I had not initially thought about numerical tolerance, but this
> >> could potentially be an issue, in which case the management of the data
> >> would have to be completely different.  Thanks for pointing this out!  I
> >> might have as many as 50,000 values.
> >
> > You may want to explain your higher-level problem. Maintaining sets of
> > floating point numbers is almost never the right approach. With sets,
> > comparison must necessarily be by exact equality because fuzzy
> > equality is not transitive.
>
> with consistent scaling, shouldn't something like rounding to a fixed
> precision be enough?
>
> >>> round(1 - 0.7,14) == round(0.3, 14)
> True
> >>> 1 - 0.7 == 0.3
> False
>
> or approx_equal instead of almost_equal
>
> Josef
>
> I have to agree with Robert.  Whenever a fellow student comes to me
describing an issue where they needed to find a floating point number in an
array, the problem can usually be restated in a way that makes much more
sense.

There are so many issues with doing a naive comparison using round()
(largely because it is intransitive as someone else already stated).  As a
quick and dirty solution to very specific issues, they work -- but they are
almost never left as a final solution.

Ben
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] efficient way to manage a set of floats?

2010-05-12 Thread Anne Archibald
On 12 May 2010 20:09, Dr. Phillip M. Feldman  wrote:
>
>
> Warren Weckesser-3 wrote:
>>
>> A couple questions:
>>
>> How many floats will you be storing?
>>
>> When you test for membership, will you want to allow for a numerical
>> tolerance, so that if the value 1 - 0.7 is added to the set, a test for
>> the value 0.3 returns True?  (0.3 is actually 0.2, while
>> 1-0.7 is 0.30004)
>>
>> Warren
>>
>
> Anne- Thanks for that absolutely beautiful explanation!!
>
> Warren- I had not initially thought about numerical tolerance, but this
> could potentially be an issue, in which case the management of the data
> would have to be completely different.  Thanks for pointing this out!  I
> might have as many as 50,000 values.

If you want one-dimensional "sets" with numerical tolerances, then
either a sorted-array implementation looks more appealing. A
sorted-tree implementation will be a little awkward, since you will
often need to explore two branches to find out the nearest neighbour
of a query point. In fact what you have is a one-dimensional kd-tree,
which is helpfully provided by scipy.spatial, albeit without insertion
or deletion operators.

I should also point out that when you start wanting approximate
matches, which you will as soon as you do any sort of arithmetic on
your floats, your idea of a "set" becomes extremely messy. For
example, suppose you try to insert a float that's one part in a
million different from one that's in the table. Does it get inserted
too or is it "equal" to what's there? When it comes time to remove it,
your query will probably have a value slightly different from either
previous value - which one, or both, do you remove? Or do you raise an
exception? Resolving these questions satisfactorily will probably
require you to know the scales that are relevant in your problem and
implement sensible handling of scales larger or smaller than this (but
beware of the "teapot in a stadium problem", of wildly different
scales in the same data set). Even so, you will want to write
algorithms that are robust to imprecision, duplication, and
disappearance of points in your sets.

(If this sounds like the voice of bitter experience, well, I
discovered while writing a commercial ray-tracer that when you shoot
billions of rays into millions of triangles, all sorts of astonishing
limitations of floating-point turn into graphical artifacts. Which are
always *highly* visible. It was during this period that the
interval-arithmetic camp nearly gained a convert.)

Anne

> Phillip
> --
> View this message in context: 
> http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28542439.html
> Sent from the Numpy-discussion mailing list archive at Nabble.com.
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] efficient way to manage a set of floats?

2010-05-12 Thread josef . pktd
On Wed, May 12, 2010 at 9:27 PM, Robert Kern  wrote:
> On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman
>  wrote:
>>
>> Warren Weckesser-3 wrote:
>>>
>>> A couple questions:
>>>
>>> How many floats will you be storing?
>>>
>>> When you test for membership, will you want to allow for a numerical
>>> tolerance, so that if the value 1 - 0.7 is added to the set, a test for
>>> the value 0.3 returns True?  (0.3 is actually 0.2, while
>>> 1-0.7 is 0.30004)
>>>
>>> Warren
>>>
>>
>> Anne- Thanks for that absolutely beautiful explanation!!
>>
>> Warren- I had not initially thought about numerical tolerance, but this
>> could potentially be an issue, in which case the management of the data
>> would have to be completely different.  Thanks for pointing this out!  I
>> might have as many as 50,000 values.
>
> You may want to explain your higher-level problem. Maintaining sets of
> floating point numbers is almost never the right approach. With sets,
> comparison must necessarily be by exact equality because fuzzy
> equality is not transitive.

with consistent scaling, shouldn't something like rounding to a fixed
precision be enough?

>>> round(1 - 0.7,14) == round(0.3, 14)
True
>>> 1 - 0.7 == 0.3
False

or approx_equal instead of almost_equal

Josef


> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  -- Umberto Eco
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] efficient way to manage a set of floats?

2010-05-12 Thread Robert Kern
On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman
 wrote:
>
> Warren Weckesser-3 wrote:
>>
>> A couple questions:
>>
>> How many floats will you be storing?
>>
>> When you test for membership, will you want to allow for a numerical
>> tolerance, so that if the value 1 - 0.7 is added to the set, a test for
>> the value 0.3 returns True?  (0.3 is actually 0.2, while
>> 1-0.7 is 0.30004)
>>
>> Warren
>>
>
> Anne- Thanks for that absolutely beautiful explanation!!
>
> Warren- I had not initially thought about numerical tolerance, but this
> could potentially be an issue, in which case the management of the data
> would have to be completely different.  Thanks for pointing this out!  I
> might have as many as 50,000 values.

You may want to explain your higher-level problem. Maintaining sets of
floating point numbers is almost never the right approach. With sets,
comparison must necessarily be by exact equality because fuzzy
equality is not transitive.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] efficient way to manage a set of floats?

2010-05-12 Thread Dr. Phillip M. Feldman


Warren Weckesser-3 wrote:
> 
> A couple questions:
> 
> How many floats will you be storing?
> 
> When you test for membership, will you want to allow for a numerical 
> tolerance, so that if the value 1 - 0.7 is added to the set, a test for 
> the value 0.3 returns True?  (0.3 is actually 0.2, while 
> 1-0.7 is 0.30004)
> 
> Warren
> 

Anne- Thanks for that absolutely beautiful explanation!!

Warren- I had not initially thought about numerical tolerance, but this
could potentially be an issue, in which case the management of the data
would have to be completely different.  Thanks for pointing this out!  I
might have as many as 50,000 values.

Phillip
-- 
View this message in context: 
http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28542439.html
Sent from the Numpy-discussion mailing list archive at Nabble.com.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] efficient way to manage a set of floats?

2010-05-10 Thread Anne Archibald
On 10 May 2010 21:56, Dr. Phillip M. Feldman  wrote:
>
>
> Anne Archibald-2 wrote:
>>
>> on a 32-bit machine,
>> the space overhead is roughly a 32-bit object pointer or two for each
>> float, plus about twice the number of floats times 32-bit pointers for
>> the table.
>>
>
> Hello Anne,
>
> I'm a bit confused by the above.  It sounds as though the hash table
> approach might occupy 4 times as much storage as a single array.  Is that
> right?

Probably. Hash tables usually operate at about half-full for
efficiency, so you'd have twice as many entries as you do objects. If
you're using a python hash table, you also have type tagging and
malloc information for each object. If you had a custom
implementation, you'd have to have some way to mark hash cells as
empty. In any case, expect at least a doubling of the space needed for
a simple array.

> Also, I don't understand why hashing would be useful for the set
> application.

The reason hash tables rely on hashing is not just to obtain a number
for a potentially complex object; a good hash function should produce
effectively random numbers for the user's objects. These numbers are
reduced modulo the size of the hash table to determine where the
object should go. If the user supplies a whole bunch of objects that
all happen to hash to the same value, they'll all try to go into the
same bin, and the hash table degenerates to an O(n) object as it has
to search through the whole list each time.

If you are using floating-point objects, well, for example the
exponent may well be all the same, or take only a few values. If,
after reduction modulo the table size, it's what gets used to
determine where your numbers go, they'll all go in the same bin. Or
you could be using all integers, which usually end with lots of zeros
in floating-point representation, so that all your numbers go in
exactly the same bin. You could try using non-power-of-two table
sizes, but that just sort of hides the problem: someone someday will
provide a collection of numbers, let's say a, a+b, a+2b, ... that
reduce to the same value modulo your table size, and suddenly your
hash table is agonizingly slow. There's kind of an art to designing
good general-purpose hash functions; it's very handy that python
provides one.

> It seems as though a red-black tree might be a good implementation for a set
> of floats if all that one wants to do is add, delete, and test membership.
> (I will also need to do unions).

If you're implementing it from scratch, you could go with a red-black
tree, but a hash table is probably faster. I'd go with a simple hash
table with linked-list buckets. Managing insertions and deletions
should be only a minor pain compared to implementing a whole tree
structure. You can probably find a nice hash function for floats with
a bit of googling the literature.


I should say, try just using python sets first, and only go into all
this if they prove to be the slowest part of your program.

> Thanks for the help!

Good luck,
Anne


> Phillip
> --
> View this message in context: 
> http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28519085.html
> Sent from the Numpy-discussion mailing list archive at Nabble.com.
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] efficient way to manage a set of floats?

2010-05-10 Thread Warren Weckesser
Dr. Phillip M. Feldman wrote:
> Anne Archibald-2 wrote:
>   
>> on a 32-bit machine,
>> the space overhead is roughly a 32-bit object pointer or two for each
>> float, plus about twice the number of floats times 32-bit pointers for
>> the table.
>>
>> 
>
> Hello Anne,
>
> I'm a bit confused by the above.  It sounds as though the hash table
> approach might occupy 4 times as much storage as a single array.  Is that
> right?
>
> Also, I don't understand why hashing would be useful for the set
> application.
>
> It seems as though a red-black tree might be a good implementation for a set
> of floats if all that one wants to do is add, delete, and test membership. 
> (I will also need to do unions).
>
>   

A couple questions:

How many floats will you be storing?

When you test for membership, will you want to allow for a numerical 
tolerance, so that if the value 1 - 0.7 is added to the set, a test for 
the value 0.3 returns True?  (0.3 is actually 0.2, while 
1-0.7 is 0.30004)


Warren

> Thanks for the help!
>
> Phillip
>   

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] efficient way to manage a set of floats?

2010-05-10 Thread Dr. Phillip M. Feldman


Anne Archibald-2 wrote:
> 
> on a 32-bit machine,
> the space overhead is roughly a 32-bit object pointer or two for each
> float, plus about twice the number of floats times 32-bit pointers for
> the table.
> 

Hello Anne,

I'm a bit confused by the above.  It sounds as though the hash table
approach might occupy 4 times as much storage as a single array.  Is that
right?

Also, I don't understand why hashing would be useful for the set
application.

It seems as though a red-black tree might be a good implementation for a set
of floats if all that one wants to do is add, delete, and test membership. 
(I will also need to do unions).

Thanks for the help!

Phillip
-- 
View this message in context: 
http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28519085.html
Sent from the Numpy-discussion mailing list archive at Nabble.com.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] efficient way to manage a set of floats?

2010-05-10 Thread Anne Archibald
On 10 May 2010 18:53, Dr. Phillip M. Feldman  wrote:
>
> I have an application that involves managing sets of floats.  I can use
> Python's built-in set type, but a data structure that is optimized for
> fixed-size objects that can be compared without hashing should be more
> efficient than a more general set construct.  Is something like this
> available?

You might not find this as useful as you think - on a 32-bit machine,
the space overhead is roughly a 32-bit object pointer or two for each
float, plus about twice the number of floats times 32-bit pointers for
the table. And hashing might be worthwhile anyway - you could easily
have a series of floats with related bit patterns you'd want to
scatter all over hash space. Plus python's set object has seen a fair
amount of performance tweaing.

That said, there's support in numpy for many operations which use a
sorted 1D array to represent a set of floats. There's searchsorted for
lookup, plus IIRC union and intersection operators; I'm not sure about
set difference. The big thing missing is updates, though if you can
batch them, concatenate followed by sorting should be reasonable.
Removal can be done with fancy indexing, though again batching is
recommended. Maybe these should be regarded as analogous to python's
frozensets.

In terms of speed, the numpy functions are obviously not as
asymptotically efficient as hash tables, though I suspect memory
coherence is more of an issue than O(1) versus O(log(n)). The numpy
functions allow vectorized lookups and vector operations on sets,
which could be handy.


Anne

P.S. if you want sets with fuzzy queries, it occurs to me that scipy's
kdtrees will actually do an okay job, and in compiled code. No updates
there either, though. -A

> View this message in context: 
> http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28518014.html
> Sent from the Numpy-discussion mailing list archive at Nabble.com.
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] efficient way to manage a set of floats?

2010-05-10 Thread Dr. Phillip M. Feldman

I have an application that involves managing sets of floats.  I can use
Python's built-in set type, but a data structure that is optimized for
fixed-size objects that can be compared without hashing should be more
efficient than a more general set construct.  Is something like this
available?
-- 
View this message in context: 
http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28518014.html
Sent from the Numpy-discussion mailing list archive at Nabble.com.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion