Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread Keith Goodman
On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern robert.k...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 19:49, Keith Goodman kwgood...@gmail.com wrote:

 But this sample gives a difference:

 a = np.random.rand(100)
 a.var()
   0.080232196646619805
 var(a)
   0.080232196646619791

 As you know, I'm trying to make a drop-in replacement for
 scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
 part. Either that, or suck it up and store the damn mean.

 The difference is less than eps. Quite possibly, the one-pass version
 is even closer to the true value than the two-pass version.

I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays:

 a = np.random.rand(100)

# numpy; doesn't take care of NaNs
 a.std()
   0.28852169850186793

# cython of np.sqrt(((arr - arr.mean())**2).mean())
 nanstd_twopass(a, ddof=0)
   0.28852169850186798

# cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
 nanstd_simple(a, ddof=0)
   0.28852169850187437

# 
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
 nanstd_online(a, ddof=0)
   0.28852169850187243

# My target, scipy version
 from scipy.stats import nanstd
 nanstd(a, bias=True)
   0.28852169850186798

Timing:

 timeit nanstd(a, bias=True)
10 loops, best of 3: 27.8 ms per loop
 timeit a.std()
100 loops, best of 3: 11.5 ms per loop
 timeit nanstd_twopass(a, ddof=0)
100 loops, best of 3: 3.24 ms per loop
 timeit nanstd_simple(a, ddof=0)
1000 loops, best of 3: 1.6 ms per loop
 timeit nanstd_online(a, ddof=0)
100 loops, best of 3: 10.8 ms per loop

nanstd_simple is the fastest but I assume the algorithm is no good for
general use?

I think I'll go with nanstd_twopass. It will most closely match
numpy/scipy, is more robust than nanstd_simple, and is the second
fastest.

Here's the code. Improvements welcomed.

@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], count = 0
cdef np.float64_t asum = 0, a2sum=0, ai
for i in range(a0):
ai = a[i]
if ai == ai:
asum += ai
a2sum += ai * ai
count += 1
if count  0:
asum = asum * asum
return sqrt((a2sum - asum / count) / (count - ddof))
else:
return np.float64(NAN)

@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], n = 0
cdef np.float64_t mean = 0, M2 = 0, delta, x
for i in range(a0):
x = a[i]
if x == x:
n += 1
delta = x - mean
mean = mean + delta / n
M2 = M2 + delta * (x - mean)
if n  0:
return np.float64(sqrt(M2 / (n - ddof)))
else:
return np.float64(NAN)

@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], count = 0
cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
for i in range(a0):
ai = a[i]
if ai == ai:
asum += ai
count += 1
if count  0:
amean = asum / count
asum = 0
for i in range(a0):
ai = a[i]
if ai == ai:
da = ai - amean
asum += da
a2sum += (da * da)
asum = asum * asum
return sqrt((a2sum - asum / count) / (count - ddof))
else:
return np.float64(NAN)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread Benjamin Root
On Mon, Nov 22, 2010 at 11:03 AM, Keith Goodman kwgood...@gmail.com wrote:

 On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern robert.k...@gmail.com
 wrote:
  On Sun, Nov 21, 2010 at 19:49, Keith Goodman kwgood...@gmail.com
 wrote:
 
  But this sample gives a difference:
 
  a = np.random.rand(100)
  a.var()
0.080232196646619805
  var(a)
0.080232196646619791
 
  As you know, I'm trying to make a drop-in replacement for
  scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
  part. Either that, or suck it up and store the damn mean.
 
  The difference is less than eps. Quite possibly, the one-pass version
  is even closer to the true value than the two-pass version.

 I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays:

  a = np.random.rand(100)

 # numpy; doesn't take care of NaNs
  a.std()
   0.28852169850186793

 # cython of np.sqrt(((arr - arr.mean())**2).mean())
  nanstd_twopass(a, ddof=0)
   0.28852169850186798

 # cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
  nanstd_simple(a, ddof=0)
   0.28852169850187437

 #
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
  nanstd_online(a, ddof=0)
   0.28852169850187243

 # My target, scipy version
  from scipy.stats import nanstd
  nanstd(a, bias=True)
   0.28852169850186798

 Timing:

  timeit nanstd(a, bias=True)
 10 loops, best of 3: 27.8 ms per loop
  timeit a.std()
 100 loops, best of 3: 11.5 ms per loop
  timeit nanstd_twopass(a, ddof=0)
 100 loops, best of 3: 3.24 ms per loop
  timeit nanstd_simple(a, ddof=0)
 1000 loops, best of 3: 1.6 ms per loop
  timeit nanstd_online(a, ddof=0)
 100 loops, best of 3: 10.8 ms per loop

 nanstd_simple is the fastest but I assume the algorithm is no good for
 general use?

 I think I'll go with nanstd_twopass. It will most closely match
 numpy/scipy, is more robust than nanstd_simple, and is the second
 fastest.

 Here's the code. Improvements welcomed.

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], count = 0
cdef np.float64_t asum = 0, a2sum=0, ai
for i in range(a0):
ai = a[i]
if ai == ai:
asum += ai
a2sum += ai * ai
count += 1
if count  0:
asum = asum * asum
return sqrt((a2sum - asum / count) / (count - ddof))
else:
return np.float64(NAN)

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], n = 0
cdef np.float64_t mean = 0, M2 = 0, delta, x
for i in range(a0):
x = a[i]
if x == x:
 n += 1
delta = x - mean
mean = mean + delta / n
M2 = M2 + delta * (x - mean)
 if n  0:
return np.float64(sqrt(M2 / (n - ddof)))
else:
return np.float64(NAN)

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
nanstd of 1d numpy array with dtype=np.float64 along axis=0.
cdef Py_ssize_t i
cdef int a0 = a.shape[0], count = 0
cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
for i in range(a0):
ai = a[i]
if ai == ai:
asum += ai
count += 1
if count  0:
amean = asum / count
asum = 0
for i in range(a0):
ai = a[i]
if ai == ai:
da = ai - amean
asum += da
a2sum += (da * da)
asum = asum * asum
return sqrt((a2sum - asum / count) / (count - ddof))
else:
return np.float64(NAN)


I wonder how the results would change if the size of the array was larger
than the processor cache?  I still can't seem to wrap my head around the
idea that a two-pass algorithm would be faster than a single-pass.  Is this
just a big-O thing where sometimes one algorithm will be faster than the
other based on the size of the problem?

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread josef . pktd
On Mon, Nov 22, 2010 at 12:07 PM, Benjamin Root ben.r...@ou.edu wrote:
 On Mon, Nov 22, 2010 at 11:03 AM, Keith Goodman kwgood...@gmail.com wrote:

 On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern robert.k...@gmail.com
 wrote:
  On Sun, Nov 21, 2010 at 19:49, Keith Goodman kwgood...@gmail.com
  wrote:
 
  But this sample gives a difference:
 
  a = np.random.rand(100)
  a.var()
    0.080232196646619805
  var(a)
    0.080232196646619791
 
  As you know, I'm trying to make a drop-in replacement for
  scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
  part. Either that, or suck it up and store the damn mean.
 
  The difference is less than eps. Quite possibly, the one-pass version
  is even closer to the true value than the two-pass version.

 I wrote 3 cython prototype implementations of nanstd for 1d float64
 arrays:

  a = np.random.rand(100)

 # numpy; doesn't take care of NaNs
  a.std()
   0.28852169850186793

 # cython of np.sqrt(((arr - arr.mean())**2).mean())
  nanstd_twopass(a, ddof=0)
   0.28852169850186798

 # cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
  nanstd_simple(a, ddof=0)
   0.28852169850187437

 #
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
  nanstd_online(a, ddof=0)
   0.28852169850187243

 # My target, scipy version
  from scipy.stats import nanstd
  nanstd(a, bias=True)
   0.28852169850186798

 Timing:

  timeit nanstd(a, bias=True)
 10 loops, best of 3: 27.8 ms per loop
  timeit a.std()
 100 loops, best of 3: 11.5 ms per loop
  timeit nanstd_twopass(a, ddof=0)
 100 loops, best of 3: 3.24 ms per loop
  timeit nanstd_simple(a, ddof=0)
 1000 loops, best of 3: 1.6 ms per loop
  timeit nanstd_online(a, ddof=0)
 100 loops, best of 3: 10.8 ms per loop

 nanstd_simple is the fastest but I assume the algorithm is no good for
 general use?

 I think I'll go with nanstd_twopass. It will most closely match
 numpy/scipy, is more robust than nanstd_simple, and is the second
 fastest.

 Here's the code. Improvements welcomed.

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, ai
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            a2sum += ai * ai
            count += 1
    if count  0:
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], n = 0
    cdef np.float64_t mean = 0, M2 = 0, delta, x
    for i in range(a0):
        x = a[i]
        if x == x:
            n += 1
            delta = x - mean
            mean = mean + delta / n
            M2 = M2 + delta * (x - mean)
    if n  0:
        return np.float64(sqrt(M2 / (n - ddof)))
    else:
        return np.float64(NAN)

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                da = ai - amean
                asum += da
                a2sum += (da * da)
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)

 I wonder how the results would change if the size of the array was larger
 than the processor cache?  I still can't seem to wrap my head around the
 idea that a two-pass algorithm would be faster than a single-pass.  Is this
 just a big-O thing where sometimes one algorithm will be faster than the
 other based on the size of the problem?

nanstd_online requires too many divisions according to the Wikipedia
article, and is useful mainly if the array
doesn't fit in memory.

Two pass would provide precision that we would expect in numpy, but I
don't know if anyone ever tested the NIST problems for basic
statistics.

Josef


 Ben Root


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread Keith Goodman
On Mon, Nov 22, 2010 at 9:13 AM,  josef.p...@gmail.com wrote:

 Two pass would provide precision that we would expect in numpy, but I
 don't know if anyone ever tested the NIST problems for basic
 statistics.

Here are the results for their most difficult dataset. But I guess
running one test doesn't mean anything.

http://www.itl.nist.gov/div898/strd/univ/addinfo/numacc4.html

 np.absolute(a.std(ddof=1) - 0.1)
   5.5884095961911129e-10
 np.absolute(nanstd_online(a, ddof=1) - 0.1)
   5.5890501948763216e-10
 np.absolute(nanstd_simple(a, ddof=1) - 0.1)
   nan  # Ha!
 np.absolute(nanstd_twopass(a, ddof=1) - 0.1)
   5.5879308125117433e-10
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread Keith Goodman
On Mon, Nov 22, 2010 at 10:51 AM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:39 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 10:32 AM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:26 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 9:03 AM, Keith Goodman kwgood...@gmail.com wrote:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                da = ai - amean
                asum += da
                a2sum += (da * da)
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)

 This is 5% faster:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_1d_float64_axis0_2(np.ndarray[np.float64_t, ndim=1] a, int 
 ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, amean, ai
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                ai -= amean
                asum += (ai * ai)
        return sqrt(asum / (count - ddof))
    else:
        return np.float64(NAN)

 I think it would be better to write nanvar instead of nanstd and take
 the square root only in a delegating nanstd, instead of the other way
 around. (Also a change that should be made in scipy.stats)

 Yeah, I noticed that numpy does that. I was planning to have separate
 var and std functions. Here's why (from the readme file, but maybe I
 should template it, the sqrt automatically converts large ddof to
 NaN):

 I'm not sure what you are saying, dropping the squareroot in the
 function doesn't require nan handling in the inner loop. If you want
 to return nan when count-ddof=0, then you could just replace

 if count  0:
 ...

 by
 if count -ddof  0:
 ...

 Or am I missing the point?

Yes, sorry. Ignore the sqrt/nan comment. The point is that I want to
be able to return the underlying, low-level cython function (see
below). So I need separate var and std versions to do that (unless I
can modify default kwargs on the fly).

 Under the hood Nanny uses a separate Cython function for each
 combination of ndim, dtype, and axis. A lot of the overhead in
 ny.nanmax, for example, is in checking that your axis is within range,
 converting non-array data to an array, and selecting the function to
 use to calculate nanmax.

 You can get rid of the overhead by doing all this before you, say,
 enter an inner loop:

 arr = np.random.rand(10,10)
 axis = 0
 func, a = ny.func.nanmax_selector(arr, axis)
 func.__name__
 'nanmax_2d_float64_axis0'

 Let's see how much faster than runs:

 timeit np.nanmax(arr, axis=0)
 1 loops, best of 3: 25.7 us per loop
 timeit ny.nanmax(arr, axis=0)
 10 loops, best of 3: 5.25 us per loop
 timeit func(a)
 10 loops, best of 3: 2.5 us per loop

 Note that func is faster than the Numpy's non-nan version of max:

 timeit arr.max(axis=0)
 10 loops, best of 3: 3.28 us per loop

 So adding NaN protection to your inner loops has a negative cost!
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread josef . pktd
On Mon, Nov 22, 2010 at 1:51 PM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:39 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 10:32 AM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:26 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 9:03 AM, Keith Goodman kwgood...@gmail.com wrote:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                da = ai - amean
                asum += da
                a2sum += (da * da)
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)

 This is 5% faster:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_1d_float64_axis0_2(np.ndarray[np.float64_t, ndim=1] a, int 
 ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, amean, ai
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                ai -= amean
                asum += (ai * ai)
        return sqrt(asum / (count - ddof))
    else:
        return np.float64(NAN)

 I think it would be better to write nanvar instead of nanstd and take
 the square root only in a delegating nanstd, instead of the other way
 around. (Also a change that should be made in scipy.stats)

 Yeah, I noticed that numpy does that. I was planning to have separate
 var and std functions. Here's why (from the readme file, but maybe I
 should template it,

the sqrt automatically converts large ddof to
 NaN):

I don't think that works for complex numbers.
(statsmodels has now a preference that calculations work also for
complex numbers)

Josef


 I'm not sure what you are saying, dropping the squareroot in the
 function doesn't require nan handling in the inner loop. If you want
 to return nan when count-ddof=0, then you could just replace

 if count  0:
 ...

 by
 if count -ddof  0:
 ...

 Or am I missing the point?

 Josef



 Under the hood Nanny uses a separate Cython function for each
 combination of ndim, dtype, and axis. A lot of the overhead in
 ny.nanmax, for example, is in checking that your axis is within range,
 converting non-array data to an array, and selecting the function to
 use to calculate nanmax.

 You can get rid of the overhead by doing all this before you, say,
 enter an inner loop:

 arr = np.random.rand(10,10)
 axis = 0
 func, a = ny.func.nanmax_selector(arr, axis)
 func.__name__
 'nanmax_2d_float64_axis0'

 Let's see how much faster than runs:

 timeit np.nanmax(arr, axis=0)
 1 loops, best of 3: 25.7 us per loop
 timeit ny.nanmax(arr, axis=0)
 10 loops, best of 3: 5.25 us per loop
 timeit func(a)
 10 loops, best of 3: 2.5 us per loop

 Note that func is faster than the Numpy's non-nan version of max:

 timeit arr.max(axis=0)
 10 loops, best of 3: 3.28 us per loop

 So adding NaN protection to your inner loops has a negative cost!
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread Keith Goodman
On Mon, Nov 22, 2010 at 11:00 AM,  josef.p...@gmail.com wrote:

 I don't think that works for complex numbers.
 (statsmodels has now a preference that calculations work also for
 complex numbers)

I'm only supporting int32, int64, float64 for now. Getting the other
ints and floats should be easy. I don't have plans for complex
numbers. If it's not a big change then someone can add support later.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread josef . pktd
On Mon, Nov 22, 2010 at 1:59 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 10:51 AM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:39 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 10:32 AM,  josef.p...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 1:26 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 9:03 AM, Keith Goodman kwgood...@gmail.com 
 wrote:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                da = ai - amean
                asum += da
                a2sum += (da * da)
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)

 This is 5% faster:

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def nanstd_1d_float64_axis0_2(np.ndarray[np.float64_t, ndim=1] a, int 
 ddof):
    nanstd of 1d numpy array with dtype=np.float64 along axis=0.
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, amean, ai
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count  0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                ai -= amean
                asum += (ai * ai)
        return sqrt(asum / (count - ddof))
    else:
        return np.float64(NAN)

 I think it would be better to write nanvar instead of nanstd and take
 the square root only in a delegating nanstd, instead of the other way
 around. (Also a change that should be made in scipy.stats)

 Yeah, I noticed that numpy does that. I was planning to have separate
 var and std functions. Here's why (from the readme file, but maybe I
 should template it, the sqrt automatically converts large ddof to
 NaN):

 I'm not sure what you are saying, dropping the squareroot in the
 function doesn't require nan handling in the inner loop. If you want
 to return nan when count-ddof=0, then you could just replace

 if count  0:
 ...

 by
 if count -ddof  0:
 ...

 Or am I missing the point?

 Yes, sorry. Ignore the sqrt/nan comment. The point is that I want to
 be able to return the underlying, low-level cython function (see
 below). So I need separate var and std versions to do that (unless I
 can modify default kwargs on the fly).

I think you could still delegate at the cython level, but given the
amount of code duplication that is required, I don't expect it to make
much difference for staying DRY.

Josef


 Under the hood Nanny uses a separate Cython function for each
 combination of ndim, dtype, and axis. A lot of the overhead in
 ny.nanmax, for example, is in checking that your axis is within range,
 converting non-array data to an array, and selecting the function to
 use to calculate nanmax.

 You can get rid of the overhead by doing all this before you, say,
 enter an inner loop:

 arr = np.random.rand(10,10)
 axis = 0
 func, a = ny.func.nanmax_selector(arr, axis)
 func.__name__
 'nanmax_2d_float64_axis0'

 Let's see how much faster than runs:

 timeit np.nanmax(arr, axis=0)
 1 loops, best of 3: 25.7 us per loop
 timeit ny.nanmax(arr, axis=0)
 10 loops, best of 3: 5.25 us per loop
 timeit func(a)
 10 loops, best of 3: 2.5 us per loop

 Note that func is faster than the Numpy's non-nan version of max:

 timeit arr.max(axis=0)
 10 loops, best of 3: 3.28 us per loop

 So adding NaN protection to your inner loops has a negative cost!
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-22 Thread josef . pktd
On Mon, Nov 22, 2010 at 2:04 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 11:00 AM,  josef.p...@gmail.com wrote:

 I don't think that works for complex numbers.
 (statsmodels has now a preference that calculations work also for
 complex numbers)

 I'm only supporting int32, int64, float64 for now. Getting the other
 ints and floats should be easy. I don't have plans for complex
 numbers. If it's not a big change then someone can add support later.

Fair enough, but if you need numerical derivatives, then complex
support looks useful.

Josef

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Does np.std() make two passes through the data?

2010-11-21 Thread Keith Goodman
Does np.std() make two passes through the data?

Numpy:

 arr = np.random.rand(10)
 arr.std()
   0.3008736260967052

Looks like an algorithm that makes one pass through the data (one for
loop) wouldn't match arr.std():

 np.sqrt((arr*arr).mean() - arr.mean()**2)
   0.30087362609670526

But a slower two-pass algorithm would match arr.std():

 np.sqrt(((arr - arr.mean())**2).mean())
   0.3008736260967052

Is there a way to get the same result as arr.std() in one pass (cython
for loop) of the data?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-21 Thread josef . pktd
On Sun, Nov 21, 2010 at 6:43 PM, Keith Goodman kwgood...@gmail.com wrote:
 Does np.std() make two passes through the data?

 Numpy:

 arr = np.random.rand(10)
 arr.std()
   0.3008736260967052

 Looks like an algorithm that makes one pass through the data (one for
 loop) wouldn't match arr.std():

 np.sqrt((arr*arr).mean() - arr.mean()**2)
   0.30087362609670526

 But a slower two-pass algorithm would match arr.std():

 np.sqrt(((arr - arr.mean())**2).mean())
   0.3008736260967052

 Is there a way to get the same result as arr.std() in one pass (cython
 for loop) of the data?

reference several times pointed to on the list is the wikipedia page, e.g.
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm

I don't know about actual implementation.

Josef

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-21 Thread Robert Kern
On Sun, Nov 21, 2010 at 17:43, Keith Goodman kwgood...@gmail.com wrote:
 Does np.std() make two passes through the data?

Yes. See PyArray_Std() in numpy/core/src/calculation.c

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-21 Thread Keith Goodman
On Sun, Nov 21, 2010 at 4:18 PM,  josef.p...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 6:43 PM, Keith Goodman kwgood...@gmail.com wrote:
 Does np.std() make two passes through the data?

 Numpy:

 arr = np.random.rand(10)
 arr.std()
   0.3008736260967052

 Looks like an algorithm that makes one pass through the data (one for
 loop) wouldn't match arr.std():

 np.sqrt((arr*arr).mean() - arr.mean()**2)
   0.30087362609670526

 But a slower two-pass algorithm would match arr.std():

 np.sqrt(((arr - arr.mean())**2).mean())
   0.3008736260967052

 Is there a way to get the same result as arr.std() in one pass (cython
 for loop) of the data?

 reference several times pointed to on the list is the wikipedia page, e.g.
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm

Unfortunately it doesn't give the same result as numpy's two pass algorithm.

From wikipedia:

def var(arr):
n = 0
mean = 0
M2 = 0
for x in arr:
n = n + 1
delta = x - mean
mean = mean + delta / n
M2 = M2 + delta * (x - mean)
return M2 / n

This set of random samples gives matching variance:

 a = np.random.rand(100)
 a.var()
   0.07867478939716277
 var(a)
   0.07867478939716277

But this sample gives a difference:

 a = np.random.rand(100)
 a.var()
   0.080232196646619805
 var(a)
   0.080232196646619791

As you know, I'm trying to make a drop-in replacement for
scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
part. Either that, or suck it up and store the damn mean.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does np.std() make two passes through the data?

2010-11-21 Thread Keith Goodman
On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern robert.k...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 19:49, Keith Goodman kwgood...@gmail.com wrote:

 But this sample gives a difference:

 a = np.random.rand(100)
 a.var()
   0.080232196646619805
 var(a)
   0.080232196646619791

 As you know, I'm trying to make a drop-in replacement for
 scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
 part. Either that, or suck it up and store the damn mean.

 The difference is less than eps. Quite possibly, the one-pass version
 is even closer to the true value than the two-pass version.

Good, it passes the Kern test.

Here's an even more robust estimate:

 var(a - a.mean())
   0.080232196646619819

Which is better, numpy's two pass or the one pass on-line method?

 test()
NumPy error: 9.31135e-18
Nanny error: 6.5745e-18  -- One pass wins!

def test(n=10):
numpy = 0
nanny = 0
for i in range(n):
a = np.random.rand(10)
truth = var(a - a.mean())
numpy += np.absolute(truth - a.var())
nanny += np.absolute(truth - var(a))
print 'NumPy error: %g' % (numpy / n)
print 'Nanny error: %g' % (nanny / n)
print
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion