Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-14 Thread Rick White
On Dec 14, 2006, at 2:56 AM, Cameron Walsh wrote: At some point I might try and test different cache sizes for different data-set sizes and see what the effect is. For now, 65536 seems a good number and I would be happy to see this replace the current numpy.histogram. I experimented a

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-14 Thread Brian Granger
This same idea could be used to parallelize the histogram computation. Then you could really get into large (many Gb/TB/PB) data sets. I might try to find time to do this with ipython1, but someone else could do this as well. Brian On 12/13/06, Rick White [EMAIL PROTECTED] wrote: On Dec 12,

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-14 Thread eric jones
Rick White wrote: Just so we don't get too smug about the speed, if I do this in IDL on the same machine it is 10 times faster (0.28 seconds instead of 4 seconds). I'm sure the IDL version uses the much faster approach of just sweeping through the array once, incrementing counts in the

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-14 Thread David Huard
Hi, I spent some time a while ago on an histogram function for numpy. It uses digitize and bincount instead of sorting the data. If I remember right, it was significantly faster than numpy's histogram, but I don't know how it will behave with very large data sets. I attached the file if you

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-13 Thread Rick White
On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote: I'm trying to generate histograms of extremely large datasets. I've tried a few methods, listed below, all with their own shortcomings. Mailing-list archive and google searches have not revealed any solutions. The numpy.histogram function

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-13 Thread Cameron Walsh
Hi all, Absolutely gorgeous, I confirm the 1.6x speed-up over the weave version, i.e. a 25x speed-up over the existing version. It would be good if the redefinition of the range function could be changed in the numpy modules, before it goes into subversion, to avoid the need for Rick's line