Re: [Numpy-discussion] categorical distributions

2010-11-22 Thread David Warde-Farley
On 2010-11-22, at 2:51 AM, Hagen Fürstenau wrote:

 but this is bound to be inefficient as soon as the vector of
 probabilities gets large, especially if you want to draw multiple samples.
 
 Have I overlooked something or should this be added?

I think you misunderstand the point of multinomial distributions. A sample from 
a multinomial is simply a sample from n i.i.d. categoricals, reported as the 
counts for each category in the N observations. It's very easy to recover the 
'categorical' samples from a 'multinomial' sample.

import numpy as np
a = np.random.multinomial(50, [.3, .3, .4])
b = np.zeros(50, dtype=int)
upper = np.cumsum(a); lower = upper - a

for value in range(len(a)):
b[lower[value]:upper[value]] = value
# mix up the order, in-place, if you care about them not being sorted
np.random.shuffle(b)

then b is a sample from the corresponding 'categorical' distribution.

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] categorical distributions

2010-11-22 Thread Hagen Fürstenau
 ISTM that this elementary functionality deserves an implementation
 that's as fast as it can be.

To substantiate this, I just wrote a simple implementation of
categorical in numpy/random/mtrand.pyx and it's more than 8x faster
than your version for multiple samples of the same distribution and more
than 3x faster than using multinomial(1, ...) for multiple samples of
different distributions (each time tested with 1000 samples drawn from
distributions over 1000 categories).

I can provide it as a patch if there's any interest.

- Hagen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] categorical distributions

2010-11-22 Thread josef . pktd
On Mon, Nov 22, 2010 at 6:05 AM, Hagen Fürstenau ha...@zhuliguan.net wrote:
 ISTM that this elementary functionality deserves an implementation
 that's as fast as it can be.

 To substantiate this, I just wrote a simple implementation of
 categorical in numpy/random/mtrand.pyx and it's more than 8x faster
 than your version for multiple samples of the same distribution and more
 than 3x faster than using multinomial(1, ...) for multiple samples of
 different distributions (each time tested with 1000 samples drawn from
 distributions over 1000 categories).

 I can provide it as a patch if there's any interest.

Can you compare the speed of your cython solution with the version of Chuck

--
For instance, weight 0..3 by 1..4, then

In [14]: w = arange(1,5)

In [15]: p = cumsum(w)/float(w.sum())

In [16]: bincount(p.searchsorted(random(100)))/1e6
Out[16]: array([ 0.100336,  0.200382,  0.299132,  0.40015 ])

-
from numpy mailing list thread Weighted random integers, sep. 10
Using searchsorted hat roughly a 10 times speedup compared to my
multinomial version

Josef



 - Hagen

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Slow element-by-element access?

2010-11-22 Thread Charles R Harris
On Mon, Nov 22, 2010 at 2:30 AM, Pauli Virtanen p...@iki.fi wrote:

 Sun, 21 Nov 2010 23:26:37 -0700, Charles R Harris wrote:
 [clip]
  Yes, indexing is known to be slow, although I don't recall the precise
  reason for that. Something to do with way integers are handled or some
  such. There was some discussion on the list many years ago...

 It could be useful if someone spent time on profiling the call overhead
 for integer indexing. From what I remember from that part of the code,
 I'm fairly sure it can be optimized.


I was thinking the same. Profiling the code could be very useful.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread Keith Goodman
On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern robert.k...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 19:49, Keith Goodman kwgood...@gmail.com wrote:

 But this sample gives a difference:

 a = np.random.rand(100)
 a.var()
   0.080232196646619805
 var(a)
   0.080232196646619791

 As you know, I'm trying to make a drop-in replacement for
 scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
 part. Either that, or suck it up and store the damn mean.

 The difference is less than eps. Quite possibly, the one-pass version
 is even closer to the true value than the two-pass version.

I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays:

 a = np.random.rand(100)

# numpy; doesn't take care of NaNs
 a.std()
   0.28852169850186793

# cython of np.sqrt(((arr - arr.mean())**2).mean())
 nanstd_twopass(a, ddof=0)
   0.28852169850186798

# cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
 nanstd_simple(a, ddof=0)
   0.28852169850187437

# 
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
 nanstd_online(a, ddof=0)
   0.28852169850187243

# My target, scipy version
 from scipy.stats import nanstd
 nanstd(a, bias=True)
   0.28852169850186798

Timing:

 timeit nanstd(a, bias=True)
10 loops, best of 3: 27.8 ms per loop
 timeit a.std()
100 loops, best of 3: 11.5 ms per loop
 timeit nanstd_twopass(a, ddof=0)
100 loops, best of 3: 3.24 ms per loop
 timeit nanstd_simple(a, ddof=0)
1000 loops, best of 3: 1.6 ms per loop
 timeit nanstd_online(a, ddof=0)
100 loops, best of 3: 10.8 ms per loop

nanstd_simple is the fastest but I assume the algorithm is no good for
general use?

I think I'll go with nanstd_twopass. It will most closely match
numpy/scipy, is more robust than nanstd_simple, and is the second
fastest.

Here's the code. Improvements welcomed.

@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], count = 0
cdef np.float64_t asum = 0, a2sum=0, ai
for i in range(a0):
ai = a[i]
if ai == ai:
asum += ai
a2sum += ai * ai
count += 1
if count  0:
asum = asum * asum
return sqrt((a2sum - asum / count) / (count - ddof))
else:
return np.float64(NAN)

@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], n = 0
cdef np.float64_t mean = 0, M2 = 0, delta, x
for i in range(a0):
x = a[i]
if x == x:
n += 1
delta = x - mean
mean = mean + delta / n
M2 = M2 + delta * (x - mean)
if n  0:
return np.float64(sqrt(M2 / (n - ddof)))
else:
return np.float64(NAN)

@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], count = 0
cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
for i in range(a0):
ai = a[i]
if ai == ai:
asum += ai
count += 1
if count  0:
amean = asum / count
asum = 0
for i in range(a0):
ai = a[i]
if ai == ai:
da = ai - amean
asum += da
a2sum += (da * da)
asum = asum * asum
return sqrt((a2sum - asum / count) / (count - ddof))
else:
return np.float64(NAN)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread Benjamin Root
On Mon, Nov 22, 2010 at 11:03 AM, Keith Goodman kwgood...@gmail.com wrote:

 On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern robert.k...@gmail.com
 wrote:
  On Sun, Nov 21, 2010 at 19:49, Keith Goodman kwgood...@gmail.com
 wrote:
 
  But this sample gives a difference:
 
  a = np.random.rand(100)
  a.var()
0.080232196646619805
  var(a)
0.080232196646619791
 
  As you know, I'm trying to make a drop-in replacement for
  scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
  part. Either that, or suck it up and store the damn mean.
 
  The difference is less than eps. Quite possibly, the one-pass version
  is even closer to the true value than the two-pass version.

 I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays:

  a = np.random.rand(100)

 # numpy; doesn't take care of NaNs
  a.std()
   0.28852169850186793

 # cython of np.sqrt(((arr - arr.mean())**2).mean())
  nanstd_twopass(a, ddof=0)
   0.28852169850186798

 # cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
  nanstd_simple(a, ddof=0)
   0.28852169850187437

 #
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
  nanstd_online(a, ddof=0)
   0.28852169850187243

 # My target, scipy version
  from scipy.stats import nanstd
  nanstd(a, bias=True)
   0.28852169850186798

 Timing:

  timeit nanstd(a, bias=True)
 10 loops, best of 3: 27.8 ms per loop
  timeit a.std()
 100 loops, best of 3: 11.5 ms per loop
  timeit nanstd_twopass(a, ddof=0)
 100 loops, best of 3: 3.24 ms per loop
  timeit nanstd_simple(a, ddof=0)
 1000 loops, best of 3: 1.6 ms per loop
  timeit nanstd_online(a, ddof=0)
 100 loops, best of 3: 10.8 ms per loop

 nanstd_simple is the fastest but I assume the algorithm is no good for
 general use?

 I think I'll go with nanstd_twopass. It will most closely match
 numpy/scipy, is more robust than nanstd_simple, and is the second
 fastest.

 Here's the code. Improvements welcomed.

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], count = 0
cdef np.float64_t asum = 0, a2sum=0, ai
for i in range(a0):
ai = a[i]
if ai == ai:
asum += ai
a2sum += ai * ai
count += 1
if count  0:
asum = asum * asum
return sqrt((a2sum - asum / count) / (count - ddof))
else:
return np.float64(NAN)

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], n = 0
cdef np.float64_t mean = 0, M2 = 0, delta, x
for i in range(a0):
x = a[i]
if x == x:
 n += 1
delta = x - mean
mean = mean + delta / n
M2 = M2 + delta * (x - mean)
 if n  0:
return np.float64(sqrt(M2 / (n - ddof)))
else:
return np.float64(NAN)

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], count = 0
cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
for i in range(a0):
ai = a[i]
if ai == ai:
asum += ai
count += 1
if count  0:
amean = asum / count
asum = 0
for i in range(a0):
ai = a[i]
if ai == ai:
da = ai - amean
asum += da
a2sum += (da * da)
asum = asum * asum
return sqrt((a2sum - asum / count) / (count - ddof))
else:
return np.float64(NAN)


I wonder how the results would change if the size of the array was larger
than the processor cache?  I still can't seem to wrap my head around the
idea that a two-pass algorithm would be faster than a single-pass.  Is this
just a big-O thing where sometimes one algorithm will be faster than the
other based on the size of the problem?

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread josef . pktd
On Mon, Nov 22, 2010 at 12:07 PM, Benjamin Root ben.r...@ou.edu wrote:
 On Mon, Nov 22, 2010 at 11:03 AM, Keith Goodman kwgood...@gmail.com wrote:

 On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern robert.k...@gmail.com
 wrote:
  On Sun, Nov 21, 2010 at 19:49, Keith Goodman kwgood...@gmail.com
  wrote:
 
  But this sample gives a difference:
 
  a = np.random.rand(100)
  a.var()
    0.080232196646619805
  var(a)
    0.080232196646619791
 
  As you know, I'm trying to make a drop-in replacement for
  scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
  part. Either that, or suck it up and store the damn mean.
 
  The difference is less than eps. Quite possibly, the one-pass version
  is even closer to the true value than the two-pass version.

 I wrote 3 cython prototype implementations of nanstd for 1d float64
 arrays:

  a = np.random.rand(100)

 # numpy; doesn't take care of NaNs
  a.std()
   0.28852169850186793

 # cython of np.sqrt(((arr - arr.mean())**2).mean())
  nanstd_twopass(a, ddof=0)
   0.28852169850186798

 # cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
  nanstd_simple(a, ddof=0)
   0.28852169850187437

 #
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
  nanstd_online(a, ddof=0)
   0.28852169850187243

 # My target, scipy version
  from scipy.stats import nanstd
  nanstd(a, bias=True)
   0.28852169850186798

 Timing:

  timeit nanstd(a, bias=True)
 10 loops, best of 3: 27.8 ms per loop
  timeit a.std()
 100 loops, best of 3: 11.5 ms per loop
  timeit nanstd_twopass(a, ddof=0)
 100 loops, best of 3: 3.24 ms per loop
  timeit nanstd_simple(a, ddof=0)
 1000 loops, best of 3: 1.6 ms per loop
  timeit nanstd_online(a, ddof=0)
 100 loops, best of 3: 10.8 ms per loop

 nanstd_simple is the fastest but I assume the algorithm is no good for
 general use?

 I think I'll go with nanstd_twopass. It will most closely match
 numpy/scipy, is more robust than nanstd_simple, and is the second
 fastest.

 Here's the code. Improvements welcomed.

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, ai
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            a2sum += ai * ai
            count += 1
    if count  0:
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], n = 0
    cdef np.float64_t mean = 0, M2 = 0, delta, x
    for i in range(a0):
        x = a[i]
        if x == x:
            n += 1
            delta = x - mean
            mean = mean + delta / n
            M2 = M2 + delta * (x - mean)
    if n  0:
        return np.float64(sqrt(M2 / (n - ddof)))
    else:
        return np.float64(NAN)

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                da = ai - amean
                asum += da
                a2sum += (da * da)
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)

 I wonder how the results would change if the size of the array was larger
 than the processor cache?  I still can't seem to wrap my head around the
 idea that a two-pass algorithm would be faster than a single-pass.  Is this
 just a big-O thing where sometimes one algorithm will be faster than the
 other based on the size of the problem?

nanstd_online requires too many divisions according to the Wikipedia
article, and is useful mainly if the array
doesn't fit in memory.

Two pass would provide precision that we would expect in numpy, but I
don't know if anyone ever tested the NIST problems for basic
statistics.

Josef


 Ben Root


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread Keith Goodman
On Mon, Nov 22, 2010 at 9:13 AM,  josef.p...@gmail.com wrote:

 Two pass would provide precision that we would expect in numpy, but I
 don't know if anyone ever tested the NIST problems for basic
 statistics.

Here are the results for their most difficult dataset. But I guess
running one test doesn't mean anything.

http://www.itl.nist.gov/div898/strd/univ/addinfo/numacc4.html

 np.absolute(a.std(ddof=1) - 0.1)
   5.5884095961911129e-10
 np.absolute(nanstd_online(a, ddof=1) - 0.1)
   5.5890501948763216e-10
 np.absolute(nanstd_simple(a, ddof=1) - 0.1)
   nan  # Ha!
 np.absolute(nanstd_twopass(a, ddof=1) - 0.1)
   5.5879308125117433e-10
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread Keith Goodman
On Mon, Nov 22, 2010 at 10:51 AM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:39 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 10:32 AM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:26 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 9:03 AM, Keith Goodman kwgood...@gmail.com wrote:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                da = ai - amean
                asum += da
                a2sum += (da * da)
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)

 This is 5% faster:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_1d_float64_axis0_2(np.ndarray[np.float64_t, ndim=1] a, int 
 ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, amean, ai
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                ai -= amean
                asum += (ai * ai)
        return sqrt(asum / (count - ddof))
    else:
        return np.float64(NAN)

 I think it would be better to write nanvar instead of nanstd and take
 the square root only in a delegating nanstd, instead of the other way
 around. (Also a change that should be made in scipy.stats)

 Yeah, I noticed that numpy does that. I was planning to have separate
 var and std functions. Here's why (from the readme file, but maybe I
 should template it, the sqrt automatically converts large ddof to
 NaN):

 I'm not sure what you are saying, dropping the squareroot in the
 function doesn't require nan handling in the inner loop. If you want
 to return nan when count-ddof=0, then you could just replace

 if count  0:
 ...

 by
 if count -ddof  0:
 ...

 Or am I missing the point?

Yes, sorry. Ignore the sqrt/nan comment. The point is that I want to
be able to return the underlying, low-level cython function (see
below). So I need separate var and std versions to do that (unless I
can modify default kwargs on the fly).

 Under the hood Nanny uses a separate Cython function for each
 combination of ndim, dtype, and axis. A lot of the overhead in
 ny.nanmax, for example, is in checking that your axis is within range,
 converting non-array data to an array, and selecting the function to
 use to calculate nanmax.

 You can get rid of the overhead by doing all this before you, say,
 enter an inner loop:

 arr = np.random.rand(10,10)
 axis = 0
 func, a = ny.func.nanmax_selector(arr, axis)
 func.__name__
 'nanmax_2d_float64_axis0'

 Let's see how much faster than runs:

 timeit np.nanmax(arr, axis=0)
 1 loops, best of 3: 25.7 us per loop
 timeit ny.nanmax(arr, axis=0)
 10 loops, best of 3: 5.25 us per loop
 timeit func(a)
 10 loops, best of 3: 2.5 us per loop

 Note that func is faster than the Numpy's non-nan version of max:

 timeit arr.max(axis=0)
 10 loops, best of 3: 3.28 us per loop

 So adding NaN protection to your inner loops has a negative cost!
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread josef . pktd
On Mon, Nov 22, 2010 at 1:51 PM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:39 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 10:32 AM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:26 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 9:03 AM, Keith Goodman kwgood...@gmail.com wrote:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                da = ai - amean
                asum += da
                a2sum += (da * da)
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)

 This is 5% faster:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_1d_float64_axis0_2(np.ndarray[np.float64_t, ndim=1] a, int 
 ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, amean, ai
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                ai -= amean
                asum += (ai * ai)
        return sqrt(asum / (count - ddof))
    else:
        return np.float64(NAN)

 I think it would be better to write nanvar instead of nanstd and take
 the square root only in a delegating nanstd, instead of the other way
 around. (Also a change that should be made in scipy.stats)

 Yeah, I noticed that numpy does that. I was planning to have separate
 var and std functions. Here's why (from the readme file, but maybe I
 should template it,

the sqrt automatically converts large ddof to
 NaN):

I don't think that works for complex numbers.
(statsmodels has now a preference that calculations work also for
complex numbers)

Josef


 I'm not sure what you are saying, dropping the squareroot in the
 function doesn't require nan handling in the inner loop. If you want
 to return nan when count-ddof=0, then you could just replace

 if count  0:
 ...

 by
 if count -ddof  0:
 ...

 Or am I missing the point?

 Josef



 Under the hood Nanny uses a separate Cython function for each
 combination of ndim, dtype, and axis. A lot of the overhead in
 ny.nanmax, for example, is in checking that your axis is within range,
 converting non-array data to an array, and selecting the function to
 use to calculate nanmax.

 You can get rid of the overhead by doing all this before you, say,
 enter an inner loop:

 arr = np.random.rand(10,10)
 axis = 0
 func, a = ny.func.nanmax_selector(arr, axis)
 func.__name__
 'nanmax_2d_float64_axis0'

 Let's see how much faster than runs:

 timeit np.nanmax(arr, axis=0)
 1 loops, best of 3: 25.7 us per loop
 timeit ny.nanmax(arr, axis=0)
 10 loops, best of 3: 5.25 us per loop
 timeit func(a)
 10 loops, best of 3: 2.5 us per loop

 Note that func is faster than the Numpy's non-nan version of max:

 timeit arr.max(axis=0)
 10 loops, best of 3: 3.28 us per loop

 So adding NaN protection to your inner loops has a negative cost!
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread Keith Goodman
On Mon, Nov 22, 2010 at 11:00 AM,  josef.p...@gmail.com wrote:

 I don't think that works for complex numbers.
 (statsmodels has now a preference that calculations work also for
 complex numbers)

I'm only supporting int32, int64, float64 for now. Getting the other
ints and floats should be easy. I don't have plans for complex
numbers. If it's not a big change then someone can add support later.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread josef . pktd
On Mon, Nov 22, 2010 at 1:59 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 10:51 AM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:39 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 10:32 AM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:26 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 9:03 AM, Keith Goodman kwgood...@gmail.com 
 wrote:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                da = ai - amean
                asum += da
                a2sum += (da * da)
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)

 This is 5% faster:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_1d_float64_axis0_2(np.ndarray[np.float64_t, ndim=1] a, int 
 ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, amean, ai
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                ai -= amean
                asum += (ai * ai)
        return sqrt(asum / (count - ddof))
    else:
        return np.float64(NAN)

 I think it would be better to write nanvar instead of nanstd and take
 the square root only in a delegating nanstd, instead of the other way
 around. (Also a change that should be made in scipy.stats)

 Yeah, I noticed that numpy does that. I was planning to have separate
 var and std functions. Here's why (from the readme file, but maybe I
 should template it, the sqrt automatically converts large ddof to
 NaN):

 I'm not sure what you are saying, dropping the squareroot in the
 function doesn't require nan handling in the inner loop. If you want
 to return nan when count-ddof=0, then you could just replace

 if count  0:
 ...

 by
 if count -ddof  0:
 ...

 Or am I missing the point?

 Yes, sorry. Ignore the sqrt/nan comment. The point is that I want to
 be able to return the underlying, low-level cython function (see
 below). So I need separate var and std versions to do that (unless I
 can modify default kwargs on the fly).

I think you could still delegate at the cython level, but given the
amount of code duplication that is required, I don't expect it to make
much difference for staying DRY.

Josef


 Under the hood Nanny uses a separate Cython function for each
 combination of ndim, dtype, and axis. A lot of the overhead in
 ny.nanmax, for example, is in checking that your axis is within range,
 converting non-array data to an array, and selecting the function to
 use to calculate nanmax.

 You can get rid of the overhead by doing all this before you, say,
 enter an inner loop:

 arr = np.random.rand(10,10)
 axis = 0
 func, a = ny.func.nanmax_selector(arr, axis)
 func.__name__
 'nanmax_2d_float64_axis0'

 Let's see how much faster than runs:

 timeit np.nanmax(arr, axis=0)
 1 loops, best of 3: 25.7 us per loop
 timeit ny.nanmax(arr, axis=0)
 10 loops, best of 3: 5.25 us per loop
 timeit func(a)
 10 loops, best of 3: 2.5 us per loop

 Note that func is faster than the Numpy's non-nan version of max:

 timeit arr.max(axis=0)
 10 loops, best of 3: 3.28 us per loop

 So adding NaN protection to your inner loops has a negative cost!
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] indexing question

2010-11-22 Thread Christopher Barker
On 11/21/10 11:37 AM, Ernest Adrogué wrote:
 so you want

 t[:,x,y]

 I tried that, but it's not the same:

 In [307]: t[[0,1],x,y]
 Out[307]: array([1, 7])

 In [308]: t[:,x,y]
 Out[308]:
 array([[1, 3],
 [5, 7]])

what is your t? Here's my example, which I think matches what you asked for:

In [1]: import numpy as np

In [2]: a = np.arange(12)

In [3]: a.shape = (3,2,2)

In [4]: a
Out[4]:
array([[[ 0,  1],
 [ 2,  3]],

[[ 4,  5],
 [ 6,  7]],

[[ 8,  9],
 [10, 11]]])

In [5]: a[:,1,0]
Out[5]: array([ 2,  6, 10])




-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread josef . pktd
On Mon, Nov 22, 2010 at 2:04 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 11:00 AM,  josef.p...@gmail.com wrote:

 I don't think that works for complex numbers.
 (statsmodels has now a preference that calculations work also for
 complex numbers)

 I'm only supporting int32, int64, float64 for now. Getting the other
 ints and floats should be easy. I don't have plans for complex
 numbers. If it's not a big change then someone can add support later.

Fair enough, but if you need numerical derivatives, then complex
support looks useful.

Josef

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] indexing question

2010-11-22 Thread John Salvatier
I didn't realize the x's and y's were varying the first time around.
There's probably a way to omit it, but I think the conceptually
simplest way is probably what you had to begin with. Build an index by
saying i = numpy.arange(0, t.shape[0])

then you can do t[i, x,y]

On Mon, Nov 22, 2010 at 11:08 AM, Christopher Barker
chris.bar...@noaa.gov wrote:
 On 11/21/10 11:37 AM, Ernest Adrogué wrote:
 so you want

 t[:,x,y]

 I tried that, but it's not the same:

 In [307]: t[[0,1],x,y]
 Out[307]: array([1, 7])

 In [308]: t[:,x,y]
 Out[308]:
 array([[1, 3],
         [5, 7]])

 what is your t? Here's my example, which I think matches what you asked for:

 In [1]: import numpy as np

 In [2]: a = np.arange(12)

 In [3]: a.shape = (3,2,2)

 In [4]: a
 Out[4]:
 array([[[ 0,  1],
         [ 2,  3]],

        [[ 4,  5],
         [ 6,  7]],

        [[ 8,  9],
         [10, 11]]])

 In [5]: a[:,1,0]
 Out[5]: array([ 2,  6, 10])




 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: NumPy 1.5.1

2010-11-22 Thread Christopher Barker
On 11/20/10 11:04 PM, Ralf Gommers wrote:
 I am pleased to announce the availability of NumPy 1.5.1.

 Binaries, sources and release notes can be found at
 https://sourceforge.net/projects/numpy/files/.

 Thank you to everyone who contributed to this release.

Yes, thanks so much -- in particular thanks to the team that build the 
OS-X binaries -- looks like a complete set!

-Chris




-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: NumPy 1.5.1

2010-11-22 Thread Matthew Brett
Hi,

On Mon, Nov 22, 2010 at 11:35 AM, Christopher Barker
chris.bar...@noaa.gov wrote:
 On 11/20/10 11:04 PM, Ralf Gommers wrote:
 I am pleased to announce the availability of NumPy 1.5.1.

 Binaries, sources and release notes can be found at
 https://sourceforge.net/projects/numpy/files/.

 Thank you to everyone who contributed to this release.

 Yes, thanks so much -- in particular thanks to the team that build the
 OS-X binaries -- looks like a complete set!

Many thanks from me too - particularly for clearing up that annoying
numpy-distuils scipy build problem.

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] N dimensional dichotomy optimization

2010-11-22 Thread Gael Varoquaux
Hi list,

does anybody have, or knows where I can find some N dimensional dichotomy 
optimization code in Python (BSD licensed, or equivalent)?

Worst case, it does not look too bad to code, but I am interested by any 
advice. I haven't done my reading yet, and I don't know how ill-posed a problem 
it is. I had in mind starting from a set of points and iterating the 
computation of the objective function's value at the barycenters of these 
points, and updating this list of points. This does raise a few questions on 
what are the best possible updates.

Cheers,

Gael

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] indexing question

2010-11-22 Thread Ernest Adrogué
22/11/10 @ 11:08 (-0800), thus spake Christopher Barker:
 On 11/21/10 11:37 AM, Ernest Adrogué wrote:
 so you want
 
 t[:,x,y]
 
 I tried that, but it's not the same:
 
 In [307]: t[[0,1],x,y]
 Out[307]: array([1, 7])
 
 In [308]: t[:,x,y]
 Out[308]:
 array([[1, 3],
 [5, 7]])
 
 what is your t? Here's my example, which I think matches what you asked for:
 
 In [1]: import numpy as np
 
 In [2]: a = np.arange(12)
 
 In [3]: a.shape = (3,2,2)
 
 In [4]: a
 Out[4]:
 array([[[ 0,  1],
 [ 2,  3]],
 
[[ 4,  5],
 [ 6,  7]],
 
[[ 8,  9],
 [10, 11]]])
 
 In [5]: a[:,1,0]
 Out[5]: array([ 2,  6, 10])

This works with scalar indices, but not with arrays.
The problem is that I don't want always the same element
from each subarray, but an arbitrary element, say the (1,0)
from the first, the (0,0) from the second, and so on,
so I have to use arrays.

-- 
Ernest
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] indexing question

2010-11-22 Thread Robert Kern
2010/11/21 Ernest Adrogué eadro...@gmx.net:
 Hi,

 Suppose an array of shape (N,2,2), that is N arrays of
 shape (2,2). I want to select an element (x,y) from each one
 of the subarrays, so I get a 1-dimensional array of length
 N. For instance:

 In [228]: t=np.arange(8).reshape(2,2,2)

 In [229]: t
 Out[229]:
 array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])

 In [230]: x=[0,1]

 In [231]: y=[1,1]

 In [232]: t[[0,1],x,y]
 Out[232]: array([1, 7])

 This way, I get the elements (0,1) and (1,1) which is what
 I wanted. The question is: is it possible to omit the [0,1]
 in the index?

No, but you can write generic code for it:

  t[np.arange(t.shape[0]), x, y]

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] indexing question

2010-11-22 Thread Ernest Adrogué
22/11/10 @ 11:20 (-0800), thus spake John Salvatier:
 I didn't realize the x's and y's were varying the first time around.
 There's probably a way to omit it, but I think the conceptually
 simplest way is probably what you had to begin with. Build an index by
 saying i = numpy.arange(0, t.shape[0])
 
 then you can do t[i, x,y]

Exactly. I was just wondering if I can speed this up by omitting
building the arange array. This is inside a function that gets
called a lot, so I suppose it would make a difference if I can get
rid of it.

-- 
Ernest
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] N dimensional dichotomy optimization

2010-11-22 Thread Matthieu Brucher
2010/11/22 Gael Varoquaux gael.varoqu...@normalesup.org:
 Hi list,

Hi ;)

 does anybody have, or knows where I can find some N dimensional dichotomy 
 optimization code in Python (BSD licensed, or equivalent)?

I don't know any code, but it should be too difficult by bgoing
through a KdTree.

 Worst case, it does not look too bad to code, but I am interested by any 
 advice. I haven't done my reading yet, and I don't know how ill-posed a 
 problem it is. I had in mind starting from a set of points and iterating the 
 computation of the objective function's value at the barycenters of these 
 points, and updating this list of points. This does raise a few questions on 
 what are the best possible updates.

In this case, you may want to check Nelder-Mead algotihm (also known
as down-hill simplex or polytope), which is available in
scikits.optimization, but there are other implementations out there.

Cheers ;)

Matthieu
-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] indexing question

2010-11-22 Thread Ernest Adrogué
22/11/10 @ 14:04 (-0600), thus spake Robert Kern:
  This way, I get the elements (0,1) and (1,1) which is what
  I wanted. The question is: is it possible to omit the [0,1]
  in the index?
 
 No, but you can write generic code for it:
 
   t[np.arange(t.shape[0]), x, y]

Thank you. This is what I wanted to know.

-- 
Ernest
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] indexing question

2010-11-22 Thread John Salvatier
I think that the only speedup you will get is defining an index only
once and reusing it.

2010/11/22 Ernest Adrogué eadro...@gmx.net:
 22/11/10 @ 14:04 (-0600), thus spake Robert Kern:
  This way, I get the elements (0,1) and (1,1) which is what
  I wanted. The question is: is it possible to omit the [0,1]
  in the index?

 No, but you can write generic code for it:

   t[np.arange(t.shape[0]), x, y]

 Thank you. This is what I wanted to know.

 --
 Ernest
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] N dimensional dichotomy optimization

2010-11-22 Thread Gael Varoquaux
On Mon, Nov 22, 2010 at 09:12:45PM +0100, Matthieu Brucher wrote:
 Hi ;)

Hi bro

  does anybody have, or knows where I can find some N dimensional
  dichotomy optimization code in Python (BSD licensed, or equivalent)?

 I don't know any code, but it should be too difficult by bgoing
 through a KdTree.

I am not in terribly high-dimensional spaces, so I don't really need to
use a KdTree (but we do happen to have a nice BallTree available in the
scikit-learn, so I could use it just to  play :). 

  Worst case, it does not look too bad to code, but I am interested by
  any advice. I haven't done my reading yet, and I don't know how
  ill-posed a problem it is. I had in mind starting from a set of
  points and iterating the computation of the objective function's
  value at the barycenters of these points, and updating this list of
  points. This does raise a few questions on what are the best possible
  updates.

 In this case, you may want to check Nelder-Mead algotihm (also known
 as down-hill simplex or polytope), which is available in
 scikits.optimization, but there are other implementations out there.

Interesting reference. I had never looked at the Nelder-Mead algorithm.
I am wondering if it does what I want, thought. 

The reason I am looking at dichotomy optimization is that the objective
function that I want to optimize has local roughness, but is globally
pretty much a bell-shaped curve. Dichotomy looks like it will get quite
close to the top of the curve (and I have been told that it works well on
such problems). One thing that is nice with dichotomy for my needs is
that it is not based on a gradient, and it works in a convex of the
parameter space.

Will the Nelder-Mead display such properties? It seems so to me, but I
don't trust my quick read through of Wikipedia.

I realize that maybe I should rephrase my question to try and draw more
out of the common wealth of knowledge on this mailing list: what do
people suggest to tackle this problem? Guided by Matthieu's suggestion, I
have started looking at Powell's algorithm, and given the introduction of
www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering
whether I should not investigate it. Can people provide any insights on
these problems.

Many thanks,

Gael

PS: The reason I am looking at this optimization problem is that I got
tired of looking at grid searches optimize the cross-validation score on 
my 3-parameter estimator (not in the scikit-learn, because it is way too
specific to my problems).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] N dimensional dichotomy optimization

2010-11-22 Thread Matthieu Brucher
2010/11/22 Gael Varoquaux gael.varoqu...@normalesup.org:
 On Mon, Nov 22, 2010 at 09:12:45PM +0100, Matthieu Brucher wrote:
 Hi ;)

 Hi bro

  does anybody have, or knows where I can find some N dimensional
  dichotomy optimization code in Python (BSD licensed, or equivalent)?

 I don't know any code, but it should be too difficult by bgoing
 through a KdTree.

 I am not in terribly high-dimensional spaces, so I don't really need to
 use a KdTree (but we do happen to have a nice BallTree available in the
 scikit-learn, so I could use it just to  play :).

:D

 In this case, you may want to check Nelder-Mead algotihm (also known
 as down-hill simplex or polytope), which is available in
 scikits.optimization, but there are other implementations out there.

 Interesting reference. I had never looked at the Nelder-Mead algorithm.
 I am wondering if it does what I want, thought.

 The reason I am looking at dichotomy optimization is that the objective
 function that I want to optimize has local roughness, but is globally
 pretty much a bell-shaped curve. Dichotomy looks like it will get quite
 close to the top of the curve (and I have been told that it works well on
 such problems). One thing that is nice with dichotomy for my needs is
 that it is not based on a gradient, and it works in a convex of the
 parameter space.

It seems that a simplex is what you need. It uses the barycenter (more
or less) to find a new point in the simplex. And it works well only in
convex functions (but in fact almost all functions have an issue with
this :D)

 Will the Nelder-Mead display such properties? It seems so to me, but I
 don't trust my quick read through of Wikipedia.

Yes, it does need a gradient and if the function is convex, it works
in a convex in the parameter space.

 I realize that maybe I should rephrase my question to try and draw more
 out of the common wealth of knowledge on this mailing list: what do
 people suggest to tackle this problem? Guided by Matthieu's suggestion, I
 have started looking at Powell's algorithm, and given the introduction of
 www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering
 whether I should not investigate it. Can people provide any insights on
 these problems.

Indeed, Powell may also a solution. A simplex is just what is closer
to what you hinted as an optimization algorithm.

 Many thanks,

You're welcome ;)

 Gael

 PS: The reason I am looking at this optimization problem is that I got
 tired of looking at grid searches optimize the cross-validation score on
 my 3-parameter estimator (not in the scikit-learn, because it is way too
 specific to my problems).

Perhaps you may want to combine it with genetic algorithms. We also
kind of combine grid search with simplex-based optimizer with
simulated annealing in some of our global optimization problems, and I
think I'll try at one point to introduce genetic algorithms instead of
the grid search. Your problem is simpler though if it displays some
convexity.

Matthieu
-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] N dimensional dichotomy optimization

2010-11-22 Thread Gael Varoquaux
On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote:
 It seems that a simplex is what you need.

Ha! I am learning new fancy words. Now I can start looking clever.

  I realize that maybe I should rephrase my question to try and draw more
  out of the common wealth of knowledge on this mailing list: what do
  people suggest to tackle this problem? Guided by Matthieu's suggestion, I
  have started looking at Powell's algorithm, and given the introduction of
  www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering
  whether I should not investigate it. Can people provide any insights on
  these problems.

 Indeed, Powell may also a solution. A simplex is just what is closer
 to what you hinted as an optimization algorithm.

I'll do a bit more reading.

  PS: The reason I am looking at this optimization problem is that I got
  tired of looking at grid searches optimize the cross-validation score on
  my 3-parameter estimator (not in the scikit-learn, because it is way too
  specific to my problems).

 Perhaps you may want to combine it with genetic algorithms. We also
 kind of combine grid search with simplex-based optimizer with
 simulated annealing in some of our global optimization problems, and I
 think I'll try at one point to introduce genetic algorithms instead of
 the grid search.

Well, in the scikit, in the long run (it will take a little while) I'd
like to expose other optimization methods then the GridSearchCV, so if
you have code or advice to give us, we'd certainly be interested.

Gael
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] N dimensional dichotomy optimization

2010-11-22 Thread Matthieu Brucher
2010/11/22 Gael Varoquaux gael.varoqu...@normalesup.org:
 On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote:
 It seems that a simplex is what you need.

 Ha! I am learning new fancy words. Now I can start looking clever.

  I realize that maybe I should rephrase my question to try and draw more
  out of the common wealth of knowledge on this mailing list: what do
  people suggest to tackle this problem? Guided by Matthieu's suggestion, I
  have started looking at Powell's algorithm, and given the introduction of
  www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering
  whether I should not investigate it. Can people provide any insights on
  these problems.

 Indeed, Powell may also a solution. A simplex is just what is closer
 to what you hinted as an optimization algorithm.

 I'll do a bit more reading.

  PS: The reason I am looking at this optimization problem is that I got
  tired of looking at grid searches optimize the cross-validation score on
  my 3-parameter estimator (not in the scikit-learn, because it is way too
  specific to my problems).

 Perhaps you may want to combine it with genetic algorithms. We also
 kind of combine grid search with simplex-based optimizer with
 simulated annealing in some of our global optimization problems, and I
 think I'll try at one point to introduce genetic algorithms instead of
 the grid search.

 Well, in the scikit, in the long run (it will take a little while) I'd
 like to expose other optimization methods then the GridSearchCV, so if
 you have code or advice to give us, we'd certainly be interested.

 Gael

There is scikits.optimization partly in the externals :D But I don't
think they should be in scikits.learn directly. Of course, the scikit
may need access to some global optimization methods, but the most used
one is already there (the grid search).
Then for genetic algorithms, pyevolve is pretty much all you want (I
still have to check the multiprocessing part)

Matthieu
-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] N dimensional dichotomy optimization

2010-11-22 Thread Gael Varoquaux
On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote:
 It seems that a simplex is what you need. It uses the barycenter (more
 or less) to find a new point in the simplex. And it works well only in
 convex functions (but in fact almost all functions have an issue with
 this :D)

One last question, now that I know that what I am looking for is a
simplex algorithm (it indeed corresponds to what I was after), is there a
reason not to use optimize.fmin? It implements a Nelder-Mead. I must
admit that I don't see how I can use it to specify the convex hull of the
parameters in which it operates, or restrict it to work only on integers,
which are two things that I may want to do.

Gaël
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] N dimensional dichotomy optimization

2010-11-22 Thread josef . pktd
On Mon, Nov 22, 2010 at 5:27 PM, Matthieu Brucher
matthieu.bruc...@gmail.com wrote:
 2010/11/22 Gael Varoquaux gael.varoqu...@normalesup.org:
 On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote:
 It seems that a simplex is what you need.

 Ha! I am learning new fancy words. Now I can start looking clever.

  I realize that maybe I should rephrase my question to try and draw more
  out of the common wealth of knowledge on this mailing list: what do
  people suggest to tackle this problem? Guided by Matthieu's suggestion, I
  have started looking at Powell's algorithm, and given the introduction of
  www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering
  whether I should not investigate it. Can people provide any insights on
  these problems.

 Indeed, Powell may also a solution. A simplex is just what is closer
 to what you hinted as an optimization algorithm.

 I'll do a bit more reading.

  PS: The reason I am looking at this optimization problem is that I got
  tired of looking at grid searches optimize the cross-validation score on
  my 3-parameter estimator (not in the scikit-learn, because it is way too
  specific to my problems).

 Perhaps you may want to combine it with genetic algorithms. We also
 kind of combine grid search with simplex-based optimizer with
 simulated annealing in some of our global optimization problems, and I
 think I'll try at one point to introduce genetic algorithms instead of
 the grid search.

 Well, in the scikit, in the long run (it will take a little while) I'd
 like to expose other optimization methods then the GridSearchCV, so if
 you have code or advice to give us, we'd certainly be interested.

 Gael

 There is scikits.optimization partly in the externals :D But I don't
 think they should be in scikits.learn directly. Of course, the scikit
 may need access to some global optimization methods, but the most used
 one is already there (the grid search).
 Then for genetic algorithms, pyevolve is pretty much all you want (I
 still have to check the multiprocessing part)

Is that license http://pyevolve.sourceforge.net/license.html BSD compatible ?

Josef


 Matthieu
 --
 Information System Engineer, Ph.D.
 Blog: http://matt.eifelle.com
 LinkedIn: http://www.linkedin.com/in/matthieubrucher
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Matlab IO Warning in mio5.py

2010-11-22 Thread Alexander Kain
 It's not an error but a harmless (although confusing) warning message.
 You should be able to filter it by adding the following to
 scipy/__init__.py:
 
 import warnings
 warnings.filterwarnings(action='ignore', message='.*__builtin__.file
 size changed.*')
 
 Can you check if that works for you?

Worked on my system! osx 10.6.5, python 2.6, numpy 1.5.1, scipy 0.8!

Need to add useless stuff to satisfy the condition that
there be more newtext than quoted. Argh.
And more:
Your email address below has to be valid.
An authorization email will be sent to you after posting. You 
have to reply to that message to confirm that you exist.
After you've confirmed your existence, the message will
then be forwarded to the mailing list.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] N dimensional dichotomy optimization

2010-11-22 Thread Matthieu Brucher
2010/11/22 Gael Varoquaux gael.varoqu...@normalesup.org:
 On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote:
 It seems that a simplex is what you need. It uses the barycenter (more
 or less) to find a new point in the simplex. And it works well only in
 convex functions (but in fact almost all functions have an issue with
 this :D)

 One last question, now that I know that what I am looking for is a
 simplex algorithm (it indeed corresponds to what I was after), is there a
 reason not to use optimize.fmin? It implements a Nelder-Mead. I must
 admit that I don't see how I can use it to specify the convex hull of the
 parameters in which it operates, or restrict it to work only on integers,
 which are two things that I may want to do.

optimize.fmin can be enough, I don't know it well enough. Nelder-Mead
is not a constrained optimization algorithm, so you can't specify an
outer hull. As for the integer part, I don't know if optimize.fmin is
type consistent, I don't know if scikits.optimization is either, but I
can check it. It should, as there is nothing impeding it.

Matthieu
-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] numpy.test() errors

2010-11-22 Thread Nils Wagner
Hi all,

There are some new test errors

==
ERROR: Test with missing and filling values
--
Traceback (most recent call last):
   File 
/data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/tests/test_io.py,
 
line 947, in test_user_filling_values
 test = np.genfromtxt(StringIO(data), **kwargs)
   File 
/data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/npyio.py, 
line 1285, in genfromtxt
 key = names.index(key)
AttributeError: 'tuple' object has no attribute 'index'

==
ERROR: test_user_missing_values (test_io.TestFromTxt)
--
Traceback (most recent call last):
   File 
/data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/tests/test_io.py,
 
line 931, in test_user_missing_values
 **basekwargs)
   File 
/data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/npyio.py, 
line 1657, in mafromtxt
 return genfromtxt(fname, **kwargs)
   File 
/data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/npyio.py, 
line 1285, in genfromtxt
 key = names.index(key)
AttributeError: 'tuple' object has no attribute 'index'

==
ERROR: test_io.test_gzip_load
--
Traceback (most recent call last):
   File 
/data/home/nwagner/local/lib/python2.5/site-packages/nose-0.11.1-py2.5.egg/nose/case.py,
 
line 183, in runTest
 self.test(*self.arg)
   File 
/data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/tests/test_io.py,
 
line 1255, in test_gzip_load
 assert_array_equal(np.load(f), a)
   File 
/data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/npyio.py, 
line 327, in load
 fid = seek_gzip_factory(file)
   File 
/data/home/nwagner/local/lib/python2.5/site-packages/numpy/lib/npyio.py, 
line 66, in seek_gzip_factory
 g.name = f.name
AttributeError: GzipFile instance has no attribute 'name'

--
Ran 3066 tests in 12.458s

FAILED (KNOWNFAIL=4, errors=3)
nose.result.TextTestResult run=3066 errors=3 failures=0


 numpy.__version__
'2.0.0.dev-12d0200'

Nils
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion