Re: [Numpy-discussion] efficient way to manage a set of floats?
On Wed, May 12, 2010 at 21:37, wrote: > On Wed, May 12, 2010 at 9:27 PM, Robert Kern wrote: >> On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman >> wrote: >>> >>> Warren Weckesser-3 wrote: A couple questions: How many floats will you be storing? When you test for membership, will you want to allow for a numerical tolerance, so that if the value 1 - 0.7 is added to the set, a test for the value 0.3 returns True? (0.3 is actually 0.2, while 1-0.7 is 0.30004) Warren >>> >>> Anne- Thanks for that absolutely beautiful explanation!! >>> >>> Warren- I had not initially thought about numerical tolerance, but this >>> could potentially be an issue, in which case the management of the data >>> would have to be completely different. Thanks for pointing this out! I >>> might have as many as 50,000 values. >> >> You may want to explain your higher-level problem. Maintaining sets of >> floating point numbers is almost never the right approach. With sets, >> comparison must necessarily be by exact equality because fuzzy >> equality is not transitive. > > with consistent scaling, shouldn't something like rounding to a fixed > precision be enough? Then you might was well convert to integers and do integer sets. The problem is that two floats very close to a border (and hence each other) would end up in rounding to different bins. They will compare unequal to each other and equal to numbers farther away but in the same arbitrary bin. Again, it depends on the higher-level problem. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] efficient way to manage a set of floats?
On Wed, May 12, 2010 at 8:37 PM, wrote: > On Wed, May 12, 2010 at 9:27 PM, Robert Kern > wrote: > > On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman > > wrote: > >> > >> Warren Weckesser-3 wrote: > >>> > >>> A couple questions: > >>> > >>> How many floats will you be storing? > >>> > >>> When you test for membership, will you want to allow for a numerical > >>> tolerance, so that if the value 1 - 0.7 is added to the set, a test for > >>> the value 0.3 returns True? (0.3 is actually 0.2, > while > >>> 1-0.7 is 0.30004) > >>> > >>> Warren > >>> > >> > >> Anne- Thanks for that absolutely beautiful explanation!! > >> > >> Warren- I had not initially thought about numerical tolerance, but this > >> could potentially be an issue, in which case the management of the data > >> would have to be completely different. Thanks for pointing this out! I > >> might have as many as 50,000 values. > > > > You may want to explain your higher-level problem. Maintaining sets of > > floating point numbers is almost never the right approach. With sets, > > comparison must necessarily be by exact equality because fuzzy > > equality is not transitive. > > with consistent scaling, shouldn't something like rounding to a fixed > precision be enough? > > >>> round(1 - 0.7,14) == round(0.3, 14) > True > >>> 1 - 0.7 == 0.3 > False > > or approx_equal instead of almost_equal > > Josef > > I have to agree with Robert. Whenever a fellow student comes to me describing an issue where they needed to find a floating point number in an array, the problem can usually be restated in a way that makes much more sense. There are so many issues with doing a naive comparison using round() (largely because it is intransitive as someone else already stated). As a quick and dirty solution to very specific issues, they work -- but they are almost never left as a final solution. Ben ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] efficient way to manage a set of floats?
On 12 May 2010 20:09, Dr. Phillip M. Feldman wrote: > > > Warren Weckesser-3 wrote: >> >> A couple questions: >> >> How many floats will you be storing? >> >> When you test for membership, will you want to allow for a numerical >> tolerance, so that if the value 1 - 0.7 is added to the set, a test for >> the value 0.3 returns True? (0.3 is actually 0.2, while >> 1-0.7 is 0.30004) >> >> Warren >> > > Anne- Thanks for that absolutely beautiful explanation!! > > Warren- I had not initially thought about numerical tolerance, but this > could potentially be an issue, in which case the management of the data > would have to be completely different. Thanks for pointing this out! I > might have as many as 50,000 values. If you want one-dimensional "sets" with numerical tolerances, then either a sorted-array implementation looks more appealing. A sorted-tree implementation will be a little awkward, since you will often need to explore two branches to find out the nearest neighbour of a query point. In fact what you have is a one-dimensional kd-tree, which is helpfully provided by scipy.spatial, albeit without insertion or deletion operators. I should also point out that when you start wanting approximate matches, which you will as soon as you do any sort of arithmetic on your floats, your idea of a "set" becomes extremely messy. For example, suppose you try to insert a float that's one part in a million different from one that's in the table. Does it get inserted too or is it "equal" to what's there? When it comes time to remove it, your query will probably have a value slightly different from either previous value - which one, or both, do you remove? Or do you raise an exception? Resolving these questions satisfactorily will probably require you to know the scales that are relevant in your problem and implement sensible handling of scales larger or smaller than this (but beware of the "teapot in a stadium problem", of wildly different scales in the same data set). Even so, you will want to write algorithms that are robust to imprecision, duplication, and disappearance of points in your sets. (If this sounds like the voice of bitter experience, well, I discovered while writing a commercial ray-tracer that when you shoot billions of rays into millions of triangles, all sorts of astonishing limitations of floating-point turn into graphical artifacts. Which are always *highly* visible. It was during this period that the interval-arithmetic camp nearly gained a convert.) Anne > Phillip > -- > View this message in context: > http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28542439.html > Sent from the Numpy-discussion mailing list archive at Nabble.com. > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] efficient way to manage a set of floats?
On Wed, May 12, 2010 at 9:27 PM, Robert Kern wrote: > On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman > wrote: >> >> Warren Weckesser-3 wrote: >>> >>> A couple questions: >>> >>> How many floats will you be storing? >>> >>> When you test for membership, will you want to allow for a numerical >>> tolerance, so that if the value 1 - 0.7 is added to the set, a test for >>> the value 0.3 returns True? (0.3 is actually 0.2, while >>> 1-0.7 is 0.30004) >>> >>> Warren >>> >> >> Anne- Thanks for that absolutely beautiful explanation!! >> >> Warren- I had not initially thought about numerical tolerance, but this >> could potentially be an issue, in which case the management of the data >> would have to be completely different. Thanks for pointing this out! I >> might have as many as 50,000 values. > > You may want to explain your higher-level problem. Maintaining sets of > floating point numbers is almost never the right approach. With sets, > comparison must necessarily be by exact equality because fuzzy > equality is not transitive. with consistent scaling, shouldn't something like rounding to a fixed precision be enough? >>> round(1 - 0.7,14) == round(0.3, 14) True >>> 1 - 0.7 == 0.3 False or approx_equal instead of almost_equal Josef > -- > Robert Kern > > "I have come to believe that the whole world is an enigma, a harmless > enigma that is made terrible by our own mad attempt to interpret it as > though it had an underlying truth." > -- Umberto Eco > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] efficient way to manage a set of floats?
On Wed, May 12, 2010 at 20:09, Dr. Phillip M. Feldman wrote: > > Warren Weckesser-3 wrote: >> >> A couple questions: >> >> How many floats will you be storing? >> >> When you test for membership, will you want to allow for a numerical >> tolerance, so that if the value 1 - 0.7 is added to the set, a test for >> the value 0.3 returns True? (0.3 is actually 0.2, while >> 1-0.7 is 0.30004) >> >> Warren >> > > Anne- Thanks for that absolutely beautiful explanation!! > > Warren- I had not initially thought about numerical tolerance, but this > could potentially be an issue, in which case the management of the data > would have to be completely different. Thanks for pointing this out! I > might have as many as 50,000 values. You may want to explain your higher-level problem. Maintaining sets of floating point numbers is almost never the right approach. With sets, comparison must necessarily be by exact equality because fuzzy equality is not transitive. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] efficient way to manage a set of floats?
Warren Weckesser-3 wrote: > > A couple questions: > > How many floats will you be storing? > > When you test for membership, will you want to allow for a numerical > tolerance, so that if the value 1 - 0.7 is added to the set, a test for > the value 0.3 returns True? (0.3 is actually 0.2, while > 1-0.7 is 0.30004) > > Warren > Anne- Thanks for that absolutely beautiful explanation!! Warren- I had not initially thought about numerical tolerance, but this could potentially be an issue, in which case the management of the data would have to be completely different. Thanks for pointing this out! I might have as many as 50,000 values. Phillip -- View this message in context: http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28542439.html Sent from the Numpy-discussion mailing list archive at Nabble.com. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] efficient way to manage a set of floats?
On 10 May 2010 21:56, Dr. Phillip M. Feldman wrote: > > > Anne Archibald-2 wrote: >> >> on a 32-bit machine, >> the space overhead is roughly a 32-bit object pointer or two for each >> float, plus about twice the number of floats times 32-bit pointers for >> the table. >> > > Hello Anne, > > I'm a bit confused by the above. It sounds as though the hash table > approach might occupy 4 times as much storage as a single array. Is that > right? Probably. Hash tables usually operate at about half-full for efficiency, so you'd have twice as many entries as you do objects. If you're using a python hash table, you also have type tagging and malloc information for each object. If you had a custom implementation, you'd have to have some way to mark hash cells as empty. In any case, expect at least a doubling of the space needed for a simple array. > Also, I don't understand why hashing would be useful for the set > application. The reason hash tables rely on hashing is not just to obtain a number for a potentially complex object; a good hash function should produce effectively random numbers for the user's objects. These numbers are reduced modulo the size of the hash table to determine where the object should go. If the user supplies a whole bunch of objects that all happen to hash to the same value, they'll all try to go into the same bin, and the hash table degenerates to an O(n) object as it has to search through the whole list each time. If you are using floating-point objects, well, for example the exponent may well be all the same, or take only a few values. If, after reduction modulo the table size, it's what gets used to determine where your numbers go, they'll all go in the same bin. Or you could be using all integers, which usually end with lots of zeros in floating-point representation, so that all your numbers go in exactly the same bin. You could try using non-power-of-two table sizes, but that just sort of hides the problem: someone someday will provide a collection of numbers, let's say a, a+b, a+2b, ... that reduce to the same value modulo your table size, and suddenly your hash table is agonizingly slow. There's kind of an art to designing good general-purpose hash functions; it's very handy that python provides one. > It seems as though a red-black tree might be a good implementation for a set > of floats if all that one wants to do is add, delete, and test membership. > (I will also need to do unions). If you're implementing it from scratch, you could go with a red-black tree, but a hash table is probably faster. I'd go with a simple hash table with linked-list buckets. Managing insertions and deletions should be only a minor pain compared to implementing a whole tree structure. You can probably find a nice hash function for floats with a bit of googling the literature. I should say, try just using python sets first, and only go into all this if they prove to be the slowest part of your program. > Thanks for the help! Good luck, Anne > Phillip > -- > View this message in context: > http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28519085.html > Sent from the Numpy-discussion mailing list archive at Nabble.com. > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] efficient way to manage a set of floats?
Dr. Phillip M. Feldman wrote: > Anne Archibald-2 wrote: > >> on a 32-bit machine, >> the space overhead is roughly a 32-bit object pointer or two for each >> float, plus about twice the number of floats times 32-bit pointers for >> the table. >> >> > > Hello Anne, > > I'm a bit confused by the above. It sounds as though the hash table > approach might occupy 4 times as much storage as a single array. Is that > right? > > Also, I don't understand why hashing would be useful for the set > application. > > It seems as though a red-black tree might be a good implementation for a set > of floats if all that one wants to do is add, delete, and test membership. > (I will also need to do unions). > > A couple questions: How many floats will you be storing? When you test for membership, will you want to allow for a numerical tolerance, so that if the value 1 - 0.7 is added to the set, a test for the value 0.3 returns True? (0.3 is actually 0.2, while 1-0.7 is 0.30004) Warren > Thanks for the help! > > Phillip > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] efficient way to manage a set of floats?
Anne Archibald-2 wrote: > > on a 32-bit machine, > the space overhead is roughly a 32-bit object pointer or two for each > float, plus about twice the number of floats times 32-bit pointers for > the table. > Hello Anne, I'm a bit confused by the above. It sounds as though the hash table approach might occupy 4 times as much storage as a single array. Is that right? Also, I don't understand why hashing would be useful for the set application. It seems as though a red-black tree might be a good implementation for a set of floats if all that one wants to do is add, delete, and test membership. (I will also need to do unions). Thanks for the help! Phillip -- View this message in context: http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28519085.html Sent from the Numpy-discussion mailing list archive at Nabble.com. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] efficient way to manage a set of floats?
On 10 May 2010 18:53, Dr. Phillip M. Feldman wrote: > > I have an application that involves managing sets of floats. I can use > Python's built-in set type, but a data structure that is optimized for > fixed-size objects that can be compared without hashing should be more > efficient than a more general set construct. Is something like this > available? You might not find this as useful as you think - on a 32-bit machine, the space overhead is roughly a 32-bit object pointer or two for each float, plus about twice the number of floats times 32-bit pointers for the table. And hashing might be worthwhile anyway - you could easily have a series of floats with related bit patterns you'd want to scatter all over hash space. Plus python's set object has seen a fair amount of performance tweaing. That said, there's support in numpy for many operations which use a sorted 1D array to represent a set of floats. There's searchsorted for lookup, plus IIRC union and intersection operators; I'm not sure about set difference. The big thing missing is updates, though if you can batch them, concatenate followed by sorting should be reasonable. Removal can be done with fancy indexing, though again batching is recommended. Maybe these should be regarded as analogous to python's frozensets. In terms of speed, the numpy functions are obviously not as asymptotically efficient as hash tables, though I suspect memory coherence is more of an issue than O(1) versus O(log(n)). The numpy functions allow vectorized lookups and vector operations on sets, which could be handy. Anne P.S. if you want sets with fuzzy queries, it occurs to me that scipy's kdtrees will actually do an okay job, and in compiled code. No updates there either, though. -A > View this message in context: > http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28518014.html > Sent from the Numpy-discussion mailing list archive at Nabble.com. > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] efficient way to manage a set of floats?
I have an application that involves managing sets of floats. I can use Python's built-in set type, but a data structure that is optimized for fixed-size objects that can be compared without hashing should be more efficient than a more general set construct. Is something like this available? -- View this message in context: http://old.nabble.com/efficient-way-to-manage-a-set-of-floats--tp28518014p28518014.html Sent from the Numpy-discussion mailing list archive at Nabble.com. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion