Re: [Numpy-discussion] categorical distributions
On 2010-11-22, at 2:51 AM, Hagen Fürstenau wrote: but this is bound to be inefficient as soon as the vector of probabilities gets large, especially if you want to draw multiple samples. Have I overlooked something or should this be added? I think you misunderstand the point of multinomial distributions. A sample from a multinomial is simply a sample from n i.i.d. categoricals, reported as the counts for each category in the N observations. It's very easy to recover the 'categorical' samples from a 'multinomial' sample. import numpy as np a = np.random.multinomial(50, [.3, .3, .4]) b = np.zeros(50, dtype=int) upper = np.cumsum(a); lower = upper - a for value in range(len(a)): b[lower[value]:upper[value]] = value # mix up the order, in-place, if you care about them not being sorted np.random.shuffle(b) then b is a sample from the corresponding 'categorical' distribution. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] categorical distributions
ISTM that this elementary functionality deserves an implementation that's as fast as it can be. To substantiate this, I just wrote a simple implementation of categorical in numpy/random/mtrand.pyx and it's more than 8x faster than your version for multiple samples of the same distribution and more than 3x faster than using multinomial(1, ...) for multiple samples of different distributions (each time tested with 1000 samples drawn from distributions over 1000 categories). I can provide it as a patch if there's any interest. - Hagen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] categorical distributions
On Mon, Nov 22, 2010 at 6:05 AM, Hagen Fürstenau ha...@zhuliguan.net wrote: ISTM that this elementary functionality deserves an implementation that's as fast as it can be. To substantiate this, I just wrote a simple implementation of categorical in numpy/random/mtrand.pyx and it's more than 8x faster than your version for multiple samples of the same distribution and more than 3x faster than using multinomial(1, ...) for multiple samples of different distributions (each time tested with 1000 samples drawn from distributions over 1000 categories). I can provide it as a patch if there's any interest. Can you compare the speed of your cython solution with the version of Chuck -- For instance, weight 0..3 by 1..4, then In [14]: w = arange(1,5) In [15]: p = cumsum(w)/float(w.sum()) In [16]: bincount(p.searchsorted(random(100)))/1e6 Out[16]: array([ 0.100336, 0.200382, 0.299132, 0.40015 ]) - from numpy mailing list thread Weighted random integers, sep. 10 Using searchsorted hat roughly a 10 times speedup compared to my multinomial version Josef - Hagen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Slow element-by-element access?
On Mon, Nov 22, 2010 at 2:30 AM, Pauli Virtanen p...@iki.fi wrote: Sun, 21 Nov 2010 23:26:37 -0700, Charles R Harris wrote: [clip] Yes, indexing is known to be slow, although I don't recall the precise reason for that. Something to do with way integers are handled or some such. There was some discussion on the list many years ago... It could be useful if someone spent time on profiling the call overhead for integer indexing. From what I remember from that part of the code, I'm fairly sure it can be optimized. I was thinking the same. Profiling the code could be very useful. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does np.std() make two passes through the data?
On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern robert.k...@gmail.com wrote: On Sun, Nov 21, 2010 at 19:49, Keith Goodman kwgood...@gmail.com wrote: But this sample gives a difference: a = np.random.rand(100) a.var() 0.080232196646619805 var(a) 0.080232196646619791 As you know, I'm trying to make a drop-in replacement for scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in part. Either that, or suck it up and store the damn mean. The difference is less than eps. Quite possibly, the one-pass version is even closer to the true value than the two-pass version. I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays: a = np.random.rand(100) # numpy; doesn't take care of NaNs a.std() 0.28852169850186793 # cython of np.sqrt(((arr - arr.mean())**2).mean()) nanstd_twopass(a, ddof=0) 0.28852169850186798 # cython of np.sqrt((arr*arr).mean() - arr.mean()**2) nanstd_simple(a, ddof=0) 0.28852169850187437 # http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm nanstd_online(a, ddof=0) 0.28852169850187243 # My target, scipy version from scipy.stats import nanstd nanstd(a, bias=True) 0.28852169850186798 Timing: timeit nanstd(a, bias=True) 10 loops, best of 3: 27.8 ms per loop timeit a.std() 100 loops, best of 3: 11.5 ms per loop timeit nanstd_twopass(a, ddof=0) 100 loops, best of 3: 3.24 ms per loop timeit nanstd_simple(a, ddof=0) 1000 loops, best of 3: 1.6 ms per loop timeit nanstd_online(a, ddof=0) 100 loops, best of 3: 10.8 ms per loop nanstd_simple is the fastest but I assume the algorithm is no good for general use? I think I'll go with nanstd_twopass. It will most closely match numpy/scipy, is more robust than nanstd_simple, and is the second fastest. Here's the code. Improvements welcomed. @cython.boundscheck(False) @cython.wraparound(False) def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, ai for i in range(a0): ai = a[i] if ai == ai: asum += ai a2sum += ai * ai count += 1 if count 0: asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN) @cython.boundscheck(False) @cython.wraparound(False) def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], n = 0 cdef np.float64_t mean = 0, M2 = 0, delta, x for i in range(a0): x = a[i] if x == x: n += 1 delta = x - mean mean = mean + delta / n M2 = M2 + delta * (x - mean) if n 0: return np.float64(sqrt(M2 / (n - ddof))) else: return np.float64(NAN) @cython.boundscheck(False) @cython.wraparound(False) def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, amean, ai, da for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: da = ai - amean asum += da a2sum += (da * da) asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does np.std() make two passes through the data?
On Mon, Nov 22, 2010 at 11:03 AM, Keith Goodman kwgood...@gmail.com wrote: On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern robert.k...@gmail.com wrote: On Sun, Nov 21, 2010 at 19:49, Keith Goodman kwgood...@gmail.com wrote: But this sample gives a difference: a = np.random.rand(100) a.var() 0.080232196646619805 var(a) 0.080232196646619791 As you know, I'm trying to make a drop-in replacement for scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in part. Either that, or suck it up and store the damn mean. The difference is less than eps. Quite possibly, the one-pass version is even closer to the true value than the two-pass version. I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays: a = np.random.rand(100) # numpy; doesn't take care of NaNs a.std() 0.28852169850186793 # cython of np.sqrt(((arr - arr.mean())**2).mean()) nanstd_twopass(a, ddof=0) 0.28852169850186798 # cython of np.sqrt((arr*arr).mean() - arr.mean()**2) nanstd_simple(a, ddof=0) 0.28852169850187437 # http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm nanstd_online(a, ddof=0) 0.28852169850187243 # My target, scipy version from scipy.stats import nanstd nanstd(a, bias=True) 0.28852169850186798 Timing: timeit nanstd(a, bias=True) 10 loops, best of 3: 27.8 ms per loop timeit a.std() 100 loops, best of 3: 11.5 ms per loop timeit nanstd_twopass(a, ddof=0) 100 loops, best of 3: 3.24 ms per loop timeit nanstd_simple(a, ddof=0) 1000 loops, best of 3: 1.6 ms per loop timeit nanstd_online(a, ddof=0) 100 loops, best of 3: 10.8 ms per loop nanstd_simple is the fastest but I assume the algorithm is no good for general use? I think I'll go with nanstd_twopass. It will most closely match numpy/scipy, is more robust than nanstd_simple, and is the second fastest. Here's the code. Improvements welcomed. @cython.boundscheck(False) @cython.wraparound(False) def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, ai for i in range(a0): ai = a[i] if ai == ai: asum += ai a2sum += ai * ai count += 1 if count 0: asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN) @cython.boundscheck(False) @cython.wraparound(False) def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], n = 0 cdef np.float64_t mean = 0, M2 = 0, delta, x for i in range(a0): x = a[i] if x == x: n += 1 delta = x - mean mean = mean + delta / n M2 = M2 + delta * (x - mean) if n 0: return np.float64(sqrt(M2 / (n - ddof))) else: return np.float64(NAN) @cython.boundscheck(False) @cython.wraparound(False) def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, amean, ai, da for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: da = ai - amean asum += da a2sum += (da * da) asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN) I wonder how the results would change if the size of the array was larger than the processor cache? I still can't seem to wrap my head around the idea that a two-pass algorithm would be faster than a single-pass. Is this just a big-O thing where sometimes one algorithm will be faster than the other based on the size of the problem? Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does np.std() make two passes through the data?
On Mon, Nov 22, 2010 at 12:07 PM, Benjamin Root ben.r...@ou.edu wrote: On Mon, Nov 22, 2010 at 11:03 AM, Keith Goodman kwgood...@gmail.com wrote: On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern robert.k...@gmail.com wrote: On Sun, Nov 21, 2010 at 19:49, Keith Goodman kwgood...@gmail.com wrote: But this sample gives a difference: a = np.random.rand(100) a.var() 0.080232196646619805 var(a) 0.080232196646619791 As you know, I'm trying to make a drop-in replacement for scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in part. Either that, or suck it up and store the damn mean. The difference is less than eps. Quite possibly, the one-pass version is even closer to the true value than the two-pass version. I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays: a = np.random.rand(100) # numpy; doesn't take care of NaNs a.std() 0.28852169850186793 # cython of np.sqrt(((arr - arr.mean())**2).mean()) nanstd_twopass(a, ddof=0) 0.28852169850186798 # cython of np.sqrt((arr*arr).mean() - arr.mean()**2) nanstd_simple(a, ddof=0) 0.28852169850187437 # http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm nanstd_online(a, ddof=0) 0.28852169850187243 # My target, scipy version from scipy.stats import nanstd nanstd(a, bias=True) 0.28852169850186798 Timing: timeit nanstd(a, bias=True) 10 loops, best of 3: 27.8 ms per loop timeit a.std() 100 loops, best of 3: 11.5 ms per loop timeit nanstd_twopass(a, ddof=0) 100 loops, best of 3: 3.24 ms per loop timeit nanstd_simple(a, ddof=0) 1000 loops, best of 3: 1.6 ms per loop timeit nanstd_online(a, ddof=0) 100 loops, best of 3: 10.8 ms per loop nanstd_simple is the fastest but I assume the algorithm is no good for general use? I think I'll go with nanstd_twopass. It will most closely match numpy/scipy, is more robust than nanstd_simple, and is the second fastest. Here's the code. Improvements welcomed. @cython.boundscheck(False) @cython.wraparound(False) def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, ai for i in range(a0): ai = a[i] if ai == ai: asum += ai a2sum += ai * ai count += 1 if count 0: asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN) @cython.boundscheck(False) @cython.wraparound(False) def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], n = 0 cdef np.float64_t mean = 0, M2 = 0, delta, x for i in range(a0): x = a[i] if x == x: n += 1 delta = x - mean mean = mean + delta / n M2 = M2 + delta * (x - mean) if n 0: return np.float64(sqrt(M2 / (n - ddof))) else: return np.float64(NAN) @cython.boundscheck(False) @cython.wraparound(False) def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, amean, ai, da for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: da = ai - amean asum += da a2sum += (da * da) asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN) I wonder how the results would change if the size of the array was larger than the processor cache? I still can't seem to wrap my head around the idea that a two-pass algorithm would be faster than a single-pass. Is this just a big-O thing where sometimes one algorithm will be faster than the other based on the size of the problem? nanstd_online requires too many divisions according to the Wikipedia article, and is useful mainly if the array doesn't fit in memory. Two pass would provide precision that we would expect in numpy, but I don't know if anyone ever tested the NIST problems for basic statistics. Josef Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does np.std() make two passes through the data?
On Mon, Nov 22, 2010 at 9:13 AM, josef.p...@gmail.com wrote: Two pass would provide precision that we would expect in numpy, but I don't know if anyone ever tested the NIST problems for basic statistics. Here are the results for their most difficult dataset. But I guess running one test doesn't mean anything. http://www.itl.nist.gov/div898/strd/univ/addinfo/numacc4.html np.absolute(a.std(ddof=1) - 0.1) 5.5884095961911129e-10 np.absolute(nanstd_online(a, ddof=1) - 0.1) 5.5890501948763216e-10 np.absolute(nanstd_simple(a, ddof=1) - 0.1) nan # Ha! np.absolute(nanstd_twopass(a, ddof=1) - 0.1) 5.5879308125117433e-10 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does np.std() make two passes through the data?
On Mon, Nov 22, 2010 at 10:51 AM, josef.p...@gmail.com wrote: On Mon, Nov 22, 2010 at 1:39 PM, Keith Goodman kwgood...@gmail.com wrote: On Mon, Nov 22, 2010 at 10:32 AM, josef.p...@gmail.com wrote: On Mon, Nov 22, 2010 at 1:26 PM, Keith Goodman kwgood...@gmail.com wrote: On Mon, Nov 22, 2010 at 9:03 AM, Keith Goodman kwgood...@gmail.com wrote: @cython.boundscheck(False) @cython.wraparound(False) def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, amean, ai, da for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: da = ai - amean asum += da a2sum += (da * da) asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN) This is 5% faster: @cython.boundscheck(False) @cython.wraparound(False) def nanstd_1d_float64_axis0_2(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, amean, ai for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: ai -= amean asum += (ai * ai) return sqrt(asum / (count - ddof)) else: return np.float64(NAN) I think it would be better to write nanvar instead of nanstd and take the square root only in a delegating nanstd, instead of the other way around. (Also a change that should be made in scipy.stats) Yeah, I noticed that numpy does that. I was planning to have separate var and std functions. Here's why (from the readme file, but maybe I should template it, the sqrt automatically converts large ddof to NaN): I'm not sure what you are saying, dropping the squareroot in the function doesn't require nan handling in the inner loop. If you want to return nan when count-ddof=0, then you could just replace if count 0: ... by if count -ddof 0: ... Or am I missing the point? Yes, sorry. Ignore the sqrt/nan comment. The point is that I want to be able to return the underlying, low-level cython function (see below). So I need separate var and std versions to do that (unless I can modify default kwargs on the fly). Under the hood Nanny uses a separate Cython function for each combination of ndim, dtype, and axis. A lot of the overhead in ny.nanmax, for example, is in checking that your axis is within range, converting non-array data to an array, and selecting the function to use to calculate nanmax. You can get rid of the overhead by doing all this before you, say, enter an inner loop: arr = np.random.rand(10,10) axis = 0 func, a = ny.func.nanmax_selector(arr, axis) func.__name__ 'nanmax_2d_float64_axis0' Let's see how much faster than runs: timeit np.nanmax(arr, axis=0) 1 loops, best of 3: 25.7 us per loop timeit ny.nanmax(arr, axis=0) 10 loops, best of 3: 5.25 us per loop timeit func(a) 10 loops, best of 3: 2.5 us per loop Note that func is faster than the Numpy's non-nan version of max: timeit arr.max(axis=0) 10 loops, best of 3: 3.28 us per loop So adding NaN protection to your inner loops has a negative cost! ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does np.std() make two passes through the data?
On Mon, Nov 22, 2010 at 1:51 PM, josef.p...@gmail.com wrote: On Mon, Nov 22, 2010 at 1:39 PM, Keith Goodman kwgood...@gmail.com wrote: On Mon, Nov 22, 2010 at 10:32 AM, josef.p...@gmail.com wrote: On Mon, Nov 22, 2010 at 1:26 PM, Keith Goodman kwgood...@gmail.com wrote: On Mon, Nov 22, 2010 at 9:03 AM, Keith Goodman kwgood...@gmail.com wrote: @cython.boundscheck(False) @cython.wraparound(False) def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, amean, ai, da for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: da = ai - amean asum += da a2sum += (da * da) asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN) This is 5% faster: @cython.boundscheck(False) @cython.wraparound(False) def nanstd_1d_float64_axis0_2(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, amean, ai for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: ai -= amean asum += (ai * ai) return sqrt(asum / (count - ddof)) else: return np.float64(NAN) I think it would be better to write nanvar instead of nanstd and take the square root only in a delegating nanstd, instead of the other way around. (Also a change that should be made in scipy.stats) Yeah, I noticed that numpy does that. I was planning to have separate var and std functions. Here's why (from the readme file, but maybe I should template it, the sqrt automatically converts large ddof to NaN): I don't think that works for complex numbers. (statsmodels has now a preference that calculations work also for complex numbers) Josef I'm not sure what you are saying, dropping the squareroot in the function doesn't require nan handling in the inner loop. If you want to return nan when count-ddof=0, then you could just replace if count 0: ... by if count -ddof 0: ... Or am I missing the point? Josef Under the hood Nanny uses a separate Cython function for each combination of ndim, dtype, and axis. A lot of the overhead in ny.nanmax, for example, is in checking that your axis is within range, converting non-array data to an array, and selecting the function to use to calculate nanmax. You can get rid of the overhead by doing all this before you, say, enter an inner loop: arr = np.random.rand(10,10) axis = 0 func, a = ny.func.nanmax_selector(arr, axis) func.__name__ 'nanmax_2d_float64_axis0' Let's see how much faster than runs: timeit np.nanmax(arr, axis=0) 1 loops, best of 3: 25.7 us per loop timeit ny.nanmax(arr, axis=0) 10 loops, best of 3: 5.25 us per loop timeit func(a) 10 loops, best of 3: 2.5 us per loop Note that func is faster than the Numpy's non-nan version of max: timeit arr.max(axis=0) 10 loops, best of 3: 3.28 us per loop So adding NaN protection to your inner loops has a negative cost! ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does np.std() make two passes through the data?
On Mon, Nov 22, 2010 at 11:00 AM, josef.p...@gmail.com wrote: I don't think that works for complex numbers. (statsmodels has now a preference that calculations work also for complex numbers) I'm only supporting int32, int64, float64 for now. Getting the other ints and floats should be easy. I don't have plans for complex numbers. If it's not a big change then someone can add support later. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does np.std() make two passes through the data?
On Mon, Nov 22, 2010 at 1:59 PM, Keith Goodman kwgood...@gmail.com wrote: On Mon, Nov 22, 2010 at 10:51 AM, josef.p...@gmail.com wrote: On Mon, Nov 22, 2010 at 1:39 PM, Keith Goodman kwgood...@gmail.com wrote: On Mon, Nov 22, 2010 at 10:32 AM, josef.p...@gmail.com wrote: On Mon, Nov 22, 2010 at 1:26 PM, Keith Goodman kwgood...@gmail.com wrote: On Mon, Nov 22, 2010 at 9:03 AM, Keith Goodman kwgood...@gmail.com wrote: @cython.boundscheck(False) @cython.wraparound(False) def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, amean, ai, da for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: da = ai - amean asum += da a2sum += (da * da) asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN) This is 5% faster: @cython.boundscheck(False) @cython.wraparound(False) def nanstd_1d_float64_axis0_2(np.ndarray[np.float64_t, ndim=1] a, int ddof): nanstd of 1d numpy array with dtype=np.float64 along axis=0. cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, amean, ai for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: ai -= amean asum += (ai * ai) return sqrt(asum / (count - ddof)) else: return np.float64(NAN) I think it would be better to write nanvar instead of nanstd and take the square root only in a delegating nanstd, instead of the other way around. (Also a change that should be made in scipy.stats) Yeah, I noticed that numpy does that. I was planning to have separate var and std functions. Here's why (from the readme file, but maybe I should template it, the sqrt automatically converts large ddof to NaN): I'm not sure what you are saying, dropping the squareroot in the function doesn't require nan handling in the inner loop. If you want to return nan when count-ddof=0, then you could just replace if count 0: ... by if count -ddof 0: ... Or am I missing the point? Yes, sorry. Ignore the sqrt/nan comment. The point is that I want to be able to return the underlying, low-level cython function (see below). So I need separate var and std versions to do that (unless I can modify default kwargs on the fly). I think you could still delegate at the cython level, but given the amount of code duplication that is required, I don't expect it to make much difference for staying DRY. Josef Under the hood Nanny uses a separate Cython function for each combination of ndim, dtype, and axis. A lot of the overhead in ny.nanmax, for example, is in checking that your axis is within range, converting non-array data to an array, and selecting the function to use to calculate nanmax. You can get rid of the overhead by doing all this before you, say, enter an inner loop: arr = np.random.rand(10,10) axis = 0 func, a = ny.func.nanmax_selector(arr, axis) func.__name__ 'nanmax_2d_float64_axis0' Let's see how much faster than runs: timeit np.nanmax(arr, axis=0) 1 loops, best of 3: 25.7 us per loop timeit ny.nanmax(arr, axis=0) 10 loops, best of 3: 5.25 us per loop timeit func(a) 10 loops, best of 3: 2.5 us per loop Note that func is faster than the Numpy's non-nan version of max: timeit arr.max(axis=0) 10 loops, best of 3: 3.28 us per loop So adding NaN protection to your inner loops has a negative cost! ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] indexing question
On 11/21/10 11:37 AM, Ernest Adrogué wrote: so you want t[:,x,y] I tried that, but it's not the same: In [307]: t[[0,1],x,y] Out[307]: array([1, 7]) In [308]: t[:,x,y] Out[308]: array([[1, 3], [5, 7]]) what is your t? Here's my example, which I think matches what you asked for: In [1]: import numpy as np In [2]: a = np.arange(12) In [3]: a.shape = (3,2,2) In [4]: a Out[4]: array([[[ 0, 1], [ 2, 3]], [[ 4, 5], [ 6, 7]], [[ 8, 9], [10, 11]]]) In [5]: a[:,1,0] Out[5]: array([ 2, 6, 10]) -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does np.std() make two passes through the data?
On Mon, Nov 22, 2010 at 2:04 PM, Keith Goodman kwgood...@gmail.com wrote: On Mon, Nov 22, 2010 at 11:00 AM, josef.p...@gmail.com wrote: I don't think that works for complex numbers. (statsmodels has now a preference that calculations work also for complex numbers) I'm only supporting int32, int64, float64 for now. Getting the other ints and floats should be easy. I don't have plans for complex numbers. If it's not a big change then someone can add support later. Fair enough, but if you need numerical derivatives, then complex support looks useful. Josef ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] indexing question
I didn't realize the x's and y's were varying the first time around. There's probably a way to omit it, but I think the conceptually simplest way is probably what you had to begin with. Build an index by saying i = numpy.arange(0, t.shape[0]) then you can do t[i, x,y] On Mon, Nov 22, 2010 at 11:08 AM, Christopher Barker chris.bar...@noaa.gov wrote: On 11/21/10 11:37 AM, Ernest Adrogué wrote: so you want t[:,x,y] I tried that, but it's not the same: In [307]: t[[0,1],x,y] Out[307]: array([1, 7]) In [308]: t[:,x,y] Out[308]: array([[1, 3], [5, 7]]) what is your t? Here's my example, which I think matches what you asked for: In [1]: import numpy as np In [2]: a = np.arange(12) In [3]: a.shape = (3,2,2) In [4]: a Out[4]: array([[[ 0, 1], [ 2, 3]], [[ 4, 5], [ 6, 7]], [[ 8, 9], [10, 11]]]) In [5]: a[:,1,0] Out[5]: array([ 2, 6, 10]) -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: NumPy 1.5.1
On 11/20/10 11:04 PM, Ralf Gommers wrote: I am pleased to announce the availability of NumPy 1.5.1. Binaries, sources and release notes can be found at https://sourceforge.net/projects/numpy/files/. Thank you to everyone who contributed to this release. Yes, thanks so much -- in particular thanks to the team that build the OS-X binaries -- looks like a complete set! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: NumPy 1.5.1
Hi, On Mon, Nov 22, 2010 at 11:35 AM, Christopher Barker chris.bar...@noaa.gov wrote: On 11/20/10 11:04 PM, Ralf Gommers wrote: I am pleased to announce the availability of NumPy 1.5.1. Binaries, sources and release notes can be found at https://sourceforge.net/projects/numpy/files/. Thank you to everyone who contributed to this release. Yes, thanks so much -- in particular thanks to the team that build the OS-X binaries -- looks like a complete set! Many thanks from me too - particularly for clearing up that annoying numpy-distuils scipy build problem. Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] N dimensional dichotomy optimization
Hi list, does anybody have, or knows where I can find some N dimensional dichotomy optimization code in Python (BSD licensed, or equivalent)? Worst case, it does not look too bad to code, but I am interested by any advice. I haven't done my reading yet, and I don't know how ill-posed a problem it is. I had in mind starting from a set of points and iterating the computation of the objective function's value at the barycenters of these points, and updating this list of points. This does raise a few questions on what are the best possible updates. Cheers, Gael ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] indexing question
22/11/10 @ 11:08 (-0800), thus spake Christopher Barker: On 11/21/10 11:37 AM, Ernest Adrogué wrote: so you want t[:,x,y] I tried that, but it's not the same: In [307]: t[[0,1],x,y] Out[307]: array([1, 7]) In [308]: t[:,x,y] Out[308]: array([[1, 3], [5, 7]]) what is your t? Here's my example, which I think matches what you asked for: In [1]: import numpy as np In [2]: a = np.arange(12) In [3]: a.shape = (3,2,2) In [4]: a Out[4]: array([[[ 0, 1], [ 2, 3]], [[ 4, 5], [ 6, 7]], [[ 8, 9], [10, 11]]]) In [5]: a[:,1,0] Out[5]: array([ 2, 6, 10]) This works with scalar indices, but not with arrays. The problem is that I don't want always the same element from each subarray, but an arbitrary element, say the (1,0) from the first, the (0,0) from the second, and so on, so I have to use arrays. -- Ernest ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] indexing question
2010/11/21 Ernest Adrogué eadro...@gmx.net: Hi, Suppose an array of shape (N,2,2), that is N arrays of shape (2,2). I want to select an element (x,y) from each one of the subarrays, so I get a 1-dimensional array of length N. For instance: In [228]: t=np.arange(8).reshape(2,2,2) In [229]: t Out[229]: array([[[0, 1], [2, 3]], [[4, 5], [6, 7]]]) In [230]: x=[0,1] In [231]: y=[1,1] In [232]: t[[0,1],x,y] Out[232]: array([1, 7]) This way, I get the elements (0,1) and (1,1) which is what I wanted. The question is: is it possible to omit the [0,1] in the index? No, but you can write generic code for it: t[np.arange(t.shape[0]), x, y] -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] indexing question
22/11/10 @ 11:20 (-0800), thus spake John Salvatier: I didn't realize the x's and y's were varying the first time around. There's probably a way to omit it, but I think the conceptually simplest way is probably what you had to begin with. Build an index by saying i = numpy.arange(0, t.shape[0]) then you can do t[i, x,y] Exactly. I was just wondering if I can speed this up by omitting building the arange array. This is inside a function that gets called a lot, so I suppose it would make a difference if I can get rid of it. -- Ernest ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] N dimensional dichotomy optimization
2010/11/22 Gael Varoquaux gael.varoqu...@normalesup.org: Hi list, Hi ;) does anybody have, or knows where I can find some N dimensional dichotomy optimization code in Python (BSD licensed, or equivalent)? I don't know any code, but it should be too difficult by bgoing through a KdTree. Worst case, it does not look too bad to code, but I am interested by any advice. I haven't done my reading yet, and I don't know how ill-posed a problem it is. I had in mind starting from a set of points and iterating the computation of the objective function's value at the barycenters of these points, and updating this list of points. This does raise a few questions on what are the best possible updates. In this case, you may want to check Nelder-Mead algotihm (also known as down-hill simplex or polytope), which is available in scikits.optimization, but there are other implementations out there. Cheers ;) Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] indexing question
22/11/10 @ 14:04 (-0600), thus spake Robert Kern: This way, I get the elements (0,1) and (1,1) which is what I wanted. The question is: is it possible to omit the [0,1] in the index? No, but you can write generic code for it: t[np.arange(t.shape[0]), x, y] Thank you. This is what I wanted to know. -- Ernest ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] indexing question
I think that the only speedup you will get is defining an index only once and reusing it. 2010/11/22 Ernest Adrogué eadro...@gmx.net: 22/11/10 @ 14:04 (-0600), thus spake Robert Kern: This way, I get the elements (0,1) and (1,1) which is what I wanted. The question is: is it possible to omit the [0,1] in the index? No, but you can write generic code for it: t[np.arange(t.shape[0]), x, y] Thank you. This is what I wanted to know. -- Ernest ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] N dimensional dichotomy optimization
On Mon, Nov 22, 2010 at 09:12:45PM +0100, Matthieu Brucher wrote: Hi ;) Hi bro does anybody have, or knows where I can find some N dimensional dichotomy optimization code in Python (BSD licensed, or equivalent)? I don't know any code, but it should be too difficult by bgoing through a KdTree. I am not in terribly high-dimensional spaces, so I don't really need to use a KdTree (but we do happen to have a nice BallTree available in the scikit-learn, so I could use it just to play :). Worst case, it does not look too bad to code, but I am interested by any advice. I haven't done my reading yet, and I don't know how ill-posed a problem it is. I had in mind starting from a set of points and iterating the computation of the objective function's value at the barycenters of these points, and updating this list of points. This does raise a few questions on what are the best possible updates. In this case, you may want to check Nelder-Mead algotihm (also known as down-hill simplex or polytope), which is available in scikits.optimization, but there are other implementations out there. Interesting reference. I had never looked at the Nelder-Mead algorithm. I am wondering if it does what I want, thought. The reason I am looking at dichotomy optimization is that the objective function that I want to optimize has local roughness, but is globally pretty much a bell-shaped curve. Dichotomy looks like it will get quite close to the top of the curve (and I have been told that it works well on such problems). One thing that is nice with dichotomy for my needs is that it is not based on a gradient, and it works in a convex of the parameter space. Will the Nelder-Mead display such properties? It seems so to me, but I don't trust my quick read through of Wikipedia. I realize that maybe I should rephrase my question to try and draw more out of the common wealth of knowledge on this mailing list: what do people suggest to tackle this problem? Guided by Matthieu's suggestion, I have started looking at Powell's algorithm, and given the introduction of www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering whether I should not investigate it. Can people provide any insights on these problems. Many thanks, Gael PS: The reason I am looking at this optimization problem is that I got tired of looking at grid searches optimize the cross-validation score on my 3-parameter estimator (not in the scikit-learn, because it is way too specific to my problems). ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] N dimensional dichotomy optimization
2010/11/22 Gael Varoquaux gael.varoqu...@normalesup.org: On Mon, Nov 22, 2010 at 09:12:45PM +0100, Matthieu Brucher wrote: Hi ;) Hi bro does anybody have, or knows where I can find some N dimensional dichotomy optimization code in Python (BSD licensed, or equivalent)? I don't know any code, but it should be too difficult by bgoing through a KdTree. I am not in terribly high-dimensional spaces, so I don't really need to use a KdTree (but we do happen to have a nice BallTree available in the scikit-learn, so I could use it just to play :). :D In this case, you may want to check Nelder-Mead algotihm (also known as down-hill simplex or polytope), which is available in scikits.optimization, but there are other implementations out there. Interesting reference. I had never looked at the Nelder-Mead algorithm. I am wondering if it does what I want, thought. The reason I am looking at dichotomy optimization is that the objective function that I want to optimize has local roughness, but is globally pretty much a bell-shaped curve. Dichotomy looks like it will get quite close to the top of the curve (and I have been told that it works well on such problems). One thing that is nice with dichotomy for my needs is that it is not based on a gradient, and it works in a convex of the parameter space. It seems that a simplex is what you need. It uses the barycenter (more or less) to find a new point in the simplex. And it works well only in convex functions (but in fact almost all functions have an issue with this :D) Will the Nelder-Mead display such properties? It seems so to me, but I don't trust my quick read through of Wikipedia. Yes, it does need a gradient and if the function is convex, it works in a convex in the parameter space. I realize that maybe I should rephrase my question to try and draw more out of the common wealth of knowledge on this mailing list: what do people suggest to tackle this problem? Guided by Matthieu's suggestion, I have started looking at Powell's algorithm, and given the introduction of www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering whether I should not investigate it. Can people provide any insights on these problems. Indeed, Powell may also a solution. A simplex is just what is closer to what you hinted as an optimization algorithm. Many thanks, You're welcome ;) Gael PS: The reason I am looking at this optimization problem is that I got tired of looking at grid searches optimize the cross-validation score on my 3-parameter estimator (not in the scikit-learn, because it is way too specific to my problems). Perhaps you may want to combine it with genetic algorithms. We also kind of combine grid search with simplex-based optimizer with simulated annealing in some of our global optimization problems, and I think I'll try at one point to introduce genetic algorithms instead of the grid search. Your problem is simpler though if it displays some convexity. Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] N dimensional dichotomy optimization
On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote: It seems that a simplex is what you need. Ha! I am learning new fancy words. Now I can start looking clever. I realize that maybe I should rephrase my question to try and draw more out of the common wealth of knowledge on this mailing list: what do people suggest to tackle this problem? Guided by Matthieu's suggestion, I have started looking at Powell's algorithm, and given the introduction of www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering whether I should not investigate it. Can people provide any insights on these problems. Indeed, Powell may also a solution. A simplex is just what is closer to what you hinted as an optimization algorithm. I'll do a bit more reading. PS: The reason I am looking at this optimization problem is that I got tired of looking at grid searches optimize the cross-validation score on my 3-parameter estimator (not in the scikit-learn, because it is way too specific to my problems). Perhaps you may want to combine it with genetic algorithms. We also kind of combine grid search with simplex-based optimizer with simulated annealing in some of our global optimization problems, and I think I'll try at one point to introduce genetic algorithms instead of the grid search. Well, in the scikit, in the long run (it will take a little while) I'd like to expose other optimization methods then the GridSearchCV, so if you have code or advice to give us, we'd certainly be interested. Gael ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] N dimensional dichotomy optimization
2010/11/22 Gael Varoquaux gael.varoqu...@normalesup.org: On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote: It seems that a simplex is what you need. Ha! I am learning new fancy words. Now I can start looking clever. I realize that maybe I should rephrase my question to try and draw more out of the common wealth of knowledge on this mailing list: what do people suggest to tackle this problem? Guided by Matthieu's suggestion, I have started looking at Powell's algorithm, and given the introduction of www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering whether I should not investigate it. Can people provide any insights on these problems. Indeed, Powell may also a solution. A simplex is just what is closer to what you hinted as an optimization algorithm. I'll do a bit more reading. PS: The reason I am looking at this optimization problem is that I got tired of looking at grid searches optimize the cross-validation score on my 3-parameter estimator (not in the scikit-learn, because it is way too specific to my problems). Perhaps you may want to combine it with genetic algorithms. We also kind of combine grid search with simplex-based optimizer with simulated annealing in some of our global optimization problems, and I think I'll try at one point to introduce genetic algorithms instead of the grid search. Well, in the scikit, in the long run (it will take a little while) I'd like to expose other optimization methods then the GridSearchCV, so if you have code or advice to give us, we'd certainly be interested. Gael There is scikits.optimization partly in the externals :D But I don't think they should be in scikits.learn directly. Of course, the scikit may need access to some global optimization methods, but the most used one is already there (the grid search). Then for genetic algorithms, pyevolve is pretty much all you want (I still have to check the multiprocessing part) Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] N dimensional dichotomy optimization
On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote: It seems that a simplex is what you need. It uses the barycenter (more or less) to find a new point in the simplex. And it works well only in convex functions (but in fact almost all functions have an issue with this :D) One last question, now that I know that what I am looking for is a simplex algorithm (it indeed corresponds to what I was after), is there a reason not to use optimize.fmin? It implements a Nelder-Mead. I must admit that I don't see how I can use it to specify the convex hull of the parameters in which it operates, or restrict it to work only on integers, which are two things that I may want to do. Gaël ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] N dimensional dichotomy optimization
On Mon, Nov 22, 2010 at 5:27 PM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: 2010/11/22 Gael Varoquaux gael.varoqu...@normalesup.org: On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote: It seems that a simplex is what you need. Ha! I am learning new fancy words. Now I can start looking clever. I realize that maybe I should rephrase my question to try and draw more out of the common wealth of knowledge on this mailing list: what do people suggest to tackle this problem? Guided by Matthieu's suggestion, I have started looking at Powell's algorithm, and given the introduction of www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering whether I should not investigate it. Can people provide any insights on these problems. Indeed, Powell may also a solution. A simplex is just what is closer to what you hinted as an optimization algorithm. I'll do a bit more reading. PS: The reason I am looking at this optimization problem is that I got tired of looking at grid searches optimize the cross-validation score on my 3-parameter estimator (not in the scikit-learn, because it is way too specific to my problems). Perhaps you may want to combine it with genetic algorithms. We also kind of combine grid search with simplex-based optimizer with simulated annealing in some of our global optimization problems, and I think I'll try at one point to introduce genetic algorithms instead of the grid search. Well, in the scikit, in the long run (it will take a little while) I'd like to expose other optimization methods then the GridSearchCV, so if you have code or advice to give us, we'd certainly be interested. Gael There is scikits.optimization partly in the externals :D But I don't think they should be in scikits.learn directly. Of course, the scikit may need access to some global optimization methods, but the most used one is already there (the grid search). Then for genetic algorithms, pyevolve is pretty much all you want (I still have to check the multiprocessing part) Is that license http://pyevolve.sourceforge.net/license.html BSD compatible ? Josef Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Matlab IO Warning in mio5.py
It's not an error but a harmless (although confusing) warning message. You should be able to filter it by adding the following to scipy/__init__.py: import warnings warnings.filterwarnings(action='ignore', message='.*__builtin__.file size changed.*') Can you check if that works for you? Worked on my system! osx 10.6.5, python 2.6, numpy 1.5.1, scipy 0.8! Need to add useless stuff to satisfy the condition that there be more newtext than quoted. Argh. And more: Your email address below has to be valid. An authorization email will be sent to you after posting. You have to reply to that message to confirm that you exist. After you've confirmed your existence, the message will then be forwarded to the mailing list. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] N dimensional dichotomy optimization
2010/11/22 Gael Varoquaux gael.varoqu...@normalesup.org: On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote: It seems that a simplex is what you need. It uses the barycenter (more or less) to find a new point in the simplex. And it works well only in convex functions (but in fact almost all functions have an issue with this :D) One last question, now that I know that what I am looking for is a simplex algorithm (it indeed corresponds to what I was after), is there a reason not to use optimize.fmin? It implements a Nelder-Mead. I must admit that I don't see how I can use it to specify the convex hull of the parameters in which it operates, or restrict it to work only on integers, which are two things that I may want to do. optimize.fmin can be enough, I don't know it well enough. Nelder-Mead is not a constrained optimization algorithm, so you can't specify an outer hull. As for the integer part, I don't know if optimize.fmin is type consistent, I don't know if scikits.optimization is either, but I can check it. It should, as there is nothing impeding it. Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] numpy.test() errors
Hi all, There are some new test errors == ERROR: Test with missing and filling values -- Traceback (most recent call last): File /data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/tests/test_io.py, line 947, in test_user_filling_values test = np.genfromtxt(StringIO(data), **kwargs) File /data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/npyio.py, line 1285, in genfromtxt key = names.index(key) AttributeError: 'tuple' object has no attribute 'index' == ERROR: test_user_missing_values (test_io.TestFromTxt) -- Traceback (most recent call last): File /data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/tests/test_io.py, line 931, in test_user_missing_values **basekwargs) File /data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/npyio.py, line 1657, in mafromtxt return genfromtxt(fname, **kwargs) File /data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/npyio.py, line 1285, in genfromtxt key = names.index(key) AttributeError: 'tuple' object has no attribute 'index' == ERROR: test_io.test_gzip_load -- Traceback (most recent call last): File /data/home/nwagner/local/lib/python2.5/site-packages/nose-0.11.1-py2.5.egg/nose/case.py, line 183, in runTest self.test(*self.arg) File /data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/tests/test_io.py, line 1255, in test_gzip_load assert_array_equal(np.load(f), a) File /data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/npyio.py, line 327, in load fid = seek_gzip_factory(file) File /data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/npyio.py, line 66, in seek_gzip_factory g.name = f.name AttributeError: GzipFile instance has no attribute 'name' -- Ran 3066 tests in 12.458s FAILED (KNOWNFAIL=4, errors=3) nose.result.TextTestResult run=3066 errors=3 failures=0 numpy.__version__ '2.0.0.dev-12d0200' Nils ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion