Re: [Numpy-discussion] [AstroPy] Rotating and Transforming Vectors--Flight Path of a Celestial Body

2009-12-17 Thread Anne Archibald
2009/12/18 Wayne Watson sierra_mtnv...@sbcglobal.net:
 It's starting to come back to me. I found a few old graphics books that
 get into transformation matrices and such. Yes, earth centered. I ground
 out some code with geometry and trig that at least gets the first point
 on the path right. I think I can probably apply a rotation around the
 x-axis several times with a 1/2 degree rotation for each step towards
 the north to get the job done.

 I'm still fairly fresh with Python, and found a little bit of info in a
 Tentative numpy tutorial. I found this on getting started with matrices:

 from numpy import matrix

 Apparently matrix is a class in numpy, and there are many others, linalg
 I think is one. How
 does one find all the classes, and are there rules that keep them apart.
 It was tempting to add
 import numpy in addition to the above, but how is what is in numpy
 different that the classes?
 Is numpy solely class driven? That is, if one does a from as above to
 get all the classes, does
 it give all the capabilities that just using import numpy does?

Many things in python are classes; a class is a way of attaching
relevant functions to a collection of data (more precisely, a class is
a *type*, defining the interpretation of data; usually they also carry
a collection of functions to operate on that data). So the central
feature of numpy is a class, ndarray, that represents a collection of
values of homogeneous type. This may be one, two, or many-dimensional,
and there are various operations, including linear algebra, on them
available in numpy.

The matrix class is a special kind of ndarray in which a number of
modifications have been made. In particular, the * operator no longer
does element-wise operations, it does matrix multiplication. There are
also various rules designed to ensure that matrix objects are always
two-dimensional. I avoid matrix objects like the plague, but some
people find them useful.

numpy.linalg is an entirely different beast. It is a *module*, a
collection of functions (and potentially objects and classes). It is
like sys or os: you import it and the functions, objects and classes
it contains become available. This is a basic feature of python. What
is unusual (but not unique) is that rather than having to explicitly
import it like:

import numpy
import numpy.linalg

numpy.linalg.svd(numpy.ones((3,2)))

numpy automatically imports it for you, every time. This is done for
historical reasons and won't change, but is a wart.

For your purposes, I recommend simply using numpy arrays -
three-element arrays for vectors, three-by-three for matrices - and
using the linear algebra functions numpy provides to act on them. For
example, dot does matrix-matrix, matrix-vector, and vector-vector
multiplication.


Anne

P.S. you can usually write out a rotation explicitly, e.g. as
[[cos(t), sin(t), 0],
 [-sin(t), cos(t), 0],
 [0, 0, 1]]
but if you want a more general one I believe there's a clever way to
make it using two reflections. -A
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] no ordinary Bessel functions?

2009-12-15 Thread Anne Archibald
2009/12/14 Dr. Phillip M. Feldman pfeld...@verizon.net:

 When I issue the command

 np.lookfor('bessel')

 I get the following:

 Search results for 'bessel'
 ---
 numpy.i0
    Modified Bessel function of the first kind, order 0.
 numpy.kaiser
    Return the Kaiser window.
 numpy.random.vonmises
    Draw samples from a von Mises distribution.

 I assume that there is an ordinary (unmodified) Bessel function in NumPy,
 but have not been able to figure out how to access it. Also, I need to
 operate sometimes on scalars, and sometimes on arrays. For operations on
 scalars, are the NumPy Bessel functions significantly slower than the SciPy
 Bessel functions?

I am afraid this is one of the historical warts in numpy. The proper
place for special functions is in scipy.special, which does indeed
have various flavours of Bessel functions.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Repeated dot products

2009-12-12 Thread Anne Archibald
2009/12/12 T J tjhn...@gmail.com:
 Hi,

 Suppose I have an array of shape:  (n, k, k).  In this case, I have n
 k-by-k matrices.  My goal is to compute the product of a (potentially
 large) user-specified selection (with replacement) of these matrices.
 For example,

   x = [0,1,2,1,3,3,2,1,3,2,1,5,3,2,3,5,2,5,3,2,1,3,5,6]

 says that I want to take the 0th matrix and dot it with the 1st matrix
 and dot that product with the 2nd matrix and dot that product with the
 1st matrix again and so on...

 Essentially, I am looking for efficient ways of doing this.  It seems
 like what I *need* is for dot to be a ufunc with a reduce() operator.
 Then I would construct an array of the matrices, as specified by the
 input x.  For now, I am using a python loop and this is unbearable.

 prod = np.eye(k)
 for i in x:
 ...  prod = dot(prod, matrices[i])
 ...

 Is there a better way?

You are right that numpy/scipy should have a matrix product ufunc. In
fact it should have ufunc versions of all the linear algebra tools. It
even has the infrastructure (generalized ufuncs) to support such a
thing in a very general way. Sadly this infrastructure is basically
unused.

Your best workaround depends on the sizes of the various objects
involved. If k is large enough, then a k by k by k matrix multiply
will be slow enough that it doesn't really matter what else you do. If
x is much longer than n, you will want to avoid constructing a len(x)
by k by k array. If n is really small and x really large, you might
even win by some massively clever Knuthian operation that found
repeated substrings in x and evaluated each corresponding matrix
product only once. If on the other hand len(x) is much shorter than n,
you could perhaps benefit from forming an intermediate len(x) by k by
k array. This array could then be used in a vectorized reduce
operation, if we had one. Normally, one can work around the lack of a
vectorized matrix multiplication by forming an (n*k) by k matrix and
multiplying it, but I don't think this will help any with reduce. If k
is small, you can multiply two k by k matrices by producing the k by k
by k elementwise product and then adding along the middle axis, but I
don't think this will work well with reduction either.

So I have just two practical solutions, really:
(a) use python reduce() and the matrix product, possibly taking
advantage of the lack of need to produce an intermediate len(x) by k
by k matrix, and live with python in that part of the loop, or
(b) use the fact that in recent versions of numpy there's a
quasi-undocumented ufuncized matrix multiply built as a test of the
generalized ufunc mechanism. I have no idea whether it supports
reduce() (and if it doesn't adding it may be an unpleasant
experience).

Good luck,
Anne

P.S. if you come up with a good implementation of the repeated
substring approach I'd love to hear it! -A
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Assigning complex values to a real array

2009-12-09 Thread Anne Archibald
2009/12/9 Dr. Phillip M. Feldman pfeld...@verizon.net:


 Pauli Virtanen-3 wrote:

 Nevertheless, I can't really regard dropping the imaginary part a
 significant issue.


 I am amazed that anyone could say this.  For anyone who works with Fourier
 transforms, or with electrical circuits, or with electromagnetic waves,
 dropping the imaginary part is a huge issue because we get answers that are
 totally wrong.

I agree that dropping the imaginary part is a wart. But it is one that
is not very hard to learn to live with. I say this as someone who has
been burned by it while using Fourier analysis to work with
astronomical data.

 When I recently tried to validate a code, the answers were wrong, and it
 took two full days to track down the cause.  I am now forced to reconsider
 carefully whether Python/NumPy is a suitable platform for serious scientific
 computing.

While I find the current numpy complex-real conversion annoying, I
have to say, this kind of rhetoric does not benefit your cause. It
sounds childish and manipulative, and makes even people who agree in
principle want to tell you to go ahead and use MATLAB and stop
pestering us. We are not here to sell you on numpy; if you hate it,
don't use it. We are here because *we* use it, warts and all, and we
want to discuss interesting topics related to numpy. That you would
have implemented it differently is not very interesting if you are not
even willing to understand why it is the way it is and what a change
would cost, let alone propose a workable way to improve.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Find the N maximum values and corresponding indexes in an array

2009-12-02 Thread Anne Archibald
2009/12/2 Keith Goodman kwgood...@gmail.com:
 On Wed, Dec 2, 2009 at 5:09 PM, Neal Becker ndbeck...@gmail.com wrote:
 David Warde-Farley wrote:

 On 2-Dec-09, at 6:55 PM, Howard Chong wrote:

 def myFindMaxA(myList):
    implement finding maximum value with for loop iteration
    maxIndex=0
    maxVal=myList[0]
    for index, item in enumerate(myList):
        if item[0]maxVal:
            maxVal=item[0]
            maxIndex=index
    return [maxIndex, maxVal]



 My question is: how can I make the latter version run faster? I
 think the
 answer is that I have to do the iteration in C.


 def find_biggest_n(myarray, n):
 ind = np.argsort(myarray)
 return ind[-n:], myarray[ind[-n:]]

 David
 Not bad, although I wonder whether a partial sort could be faster.

 I'm doing a lot of sorting right now. I only need to sort the lowest
 30% of values in a 1d array (about 250k elements), the rest I don't
 need to sort. How do I do a partial sort?

Algorithmically, if you're doing a quicksort, you just don't sort one
side of a partition when it's outside the range you want sorted (which
could even be data-dependent, I suppose, as well as
number-of-items-dependent). This is useful for sorting out only the
extreme elements, as well as finding quantiles (and those elements
above/below them). Unfortunately I'm not aware of any implementation,
useful though it might be.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Find the N maximum values and corresponding indexes in an array

2009-12-02 Thread Anne Archibald
2009/12/2 David Warde-Farley d...@cs.toronto.edu:
 On 2-Dec-09, at 8:09 PM, Neal Becker wrote:

 Not bad, although I wonder whether a partial sort could be faster.

 Probably (if the array is large) but depending on n, not if it's in
 Python. Ideal problem for Cython, though.

How is Cython support for generic types these days? One wouldn't want
to have to write separate versions for each dtype...

Anne

 David
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Find the N maximum values and corresponding indexes in an array

2009-12-02 Thread Anne Archibald
2009/12/2 Keith Goodman kwgood...@gmail.com:
 On Wed, Dec 2, 2009 at 5:27 PM, Anne Archibald
 peridot.face...@gmail.com wrote:
 2009/12/2 Keith Goodman kwgood...@gmail.com:
 On Wed, Dec 2, 2009 at 5:09 PM, Neal Becker ndbeck...@gmail.com wrote:
 David Warde-Farley wrote:

 On 2-Dec-09, at 6:55 PM, Howard Chong wrote:

 def myFindMaxA(myList):
    implement finding maximum value with for loop iteration
    maxIndex=0
    maxVal=myList[0]
    for index, item in enumerate(myList):
        if item[0]maxVal:
            maxVal=item[0]
            maxIndex=index
    return [maxIndex, maxVal]



 My question is: how can I make the latter version run faster? I
 think the
 answer is that I have to do the iteration in C.


 def find_biggest_n(myarray, n):
 ind = np.argsort(myarray)
 return ind[-n:], myarray[ind[-n:]]

 David
 Not bad, although I wonder whether a partial sort could be faster.

 I'm doing a lot of sorting right now. I only need to sort the lowest
 30% of values in a 1d array (about 250k elements), the rest I don't
 need to sort. How do I do a partial sort?

 Algorithmically, if you're doing a quicksort, you just don't sort one
 side of a partition when it's outside the range you want sorted (which
 could even be data-dependent, I suppose, as well as
 number-of-items-dependent). This is useful for sorting out only the
 extreme elements, as well as finding quantiles (and those elements
 above/below them). Unfortunately I'm not aware of any implementation,
 useful though it might be.

 Oh, I thought he meant there was a numpy function for partial sorting.

 What kind of speed up for my problem (sort lower 30% of a 1d array
 with 250k elements) could I expect if I paid someone to write it in
 cython? Twice as fast? I'm actually doing x.argsort(), if that
 matters.

Hard to say exactly, but I'd say at best a factor of two. You can get
a rough best-case value by doing an argmin followed by an argsort of
the first 30%.

In reality things will be a bit worse because you won't immediately be
able to discard the whole rest of the array, and because your argsort
will be accessing data spread over a larger swath of memory (so worse
cache performance). But if this doesn't provide a useful speedup, it's
very unlikely partial sorting will be of any use to you.

Medians and other quantiles, or partitioning tasks, are where it really shines.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] convert strides/shape/offset into nd index?

2009-11-30 Thread Anne Archibald
2009/11/30 James Bergstra bergs...@iro.umontreal.ca:
 Your question involves a few concepts:

 - an integer vector describing the position of an element

 - the logical shape (another int vector)

 - the physical strides (another int vector)

 Ignoring the case of negative offsets, a physical offset is the inner
 product of the physical strides with the position vector.

 In these terms, you are asking how to solve the inner-product equation
 for the position vector.  There can be many possible solutions (like,
 if there is a stride of 1, then you can make that dimension account
 for the entire offset.  This is often not the solution you want.).
 For valid ndarrays though, there is at most one solution though with
 the property that every position element is less than the shape.

 You will also need to take into account that for certain stride
 vectors, there is no way to get certain offsets.  Imagine all the
 strides were even, and you needed to get at an odd offset... it would
 be impossible.  It would even be impossible if there were a dimension
 with stride 1 but it had shape of 1 too.

 I can't think of an algorithm off the top of my head that would do
 this in a quick and elegant way.

Not to be a downer, but this problem is technically NP-complete. The
so-called knapsack problem is to find a subset of a collection of
numbers that adds up to the specified number, and it is NP-complete.
Unfortunately, it is exactly what you need to do to find the indices
to a particular memory location in an array of shape (2,2,...,2).

What that means in practice is that either you have to allow
potentially very slow algorithms (though you know that there will
never be more than 32 different values in the knapsack, which might or
might not be enough to keep things tractable) or use heuristic
algorithms that don't always work. There are probably fairly good
heuristics, particularly if the array elements are all at distinct
memory locations (arrays with overlapping elements can arise from
broadcasting and other slightly more arcane operations).

Anne


 James

 On Sun, Nov 29, 2009 at 10:36 AM, Zachary Pincus
 zachary.pin...@yale.edu wrote:
 Hi all,

 I'm curious as to what the most straightforward way is to convert an
 offset into a memory buffer representing an arbitrarily strided array
 into the nd index into that array. (Let's assume for simplicity that
 each element is one byte...)

 Does sorting the strides from largest to smallest and then using
 integer division and mod (in the obvious way) always work? (I can't
 seem to find a counterexample, but I am not used to thinking too
 deeply about bizarrely-strided configurations.) Is there a method that
 doesn't involve sorting?

 Thanks,
 Zach
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




 --
 http://www-etud.iro.umontreal.ca/~bergstrj
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] non-standard standard deviation

2009-11-29 Thread Anne Archibald
2009/11/29 Dr. Phillip M. Feldman pfeld...@verizon.net:

 All of the statistical packages that I am currently using and have used in
 the past (Matlab, Minitab, R, S-plus) calculate standard deviation using the
 sqrt(1/(n-1)) normalization, which gives a result that is unbiased when
 sampling from a normally-distributed population.  NumPy uses the sqrt(1/n)
 normalization.  I'm currently using the following code to calculate standard
 deviations, but would much prefer if this could be fixed in NumPy itself:

This issue was the subject of lengthy discussions on the mailing list,
the upshot of which is that in current versions of scipy, std and var
take an optional argument ddof, into which you can supply 1 to get
the normalization you want.

Anne

 def mystd(x=numpy.array([]), axis=None):
   This function calculates the standard deviation of the input using the
   definition of standard deviation that gives an unbiased result for
 samples
   from a normally-distributed population.
 --
 View this message in context: 
 http://old.nabble.com/non-standard-standard-deviation-tp26566808p26566808.html
 Sent from the Numpy-discussion mailing list archive at Nabble.com.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Computing Simple Statistics When Only they Frequency Distribution is Known

2009-11-29 Thread Anne Archibald
2009/11/28 Wayne Watson sierra_mtnv...@sbcglobal.net:
 Anne Archibald wrote:
 2009/11/28 Wayne Watson sierra_mtnv...@sbcglobal.net:

 I was only illustrating a way that I would not consider, since the
 hardware has already created the pdf. I've already coded it pretty much
 as you have suggested. As I think I mention ed above, I'm a bit
 surprised numpy doesn't provide the code you suggest as part of some
 function. CalcSimplefromPDF(xvalues=mydatarray, avg=ture, minmax=true,
 ...).


 Feel free to submit an implementation to numpy's issue tracker. I
 suggest modifying mean, std, and var (at least) so that, like average,
 they take an array of weights.

 How would I do that?


Obtain a copy of recent numpy source code - a .tar file from the
website, or using SVN, or git. Then add the feature plus some tests,
confirm that the tests pass, and post a request and your patch to the
bug tracker.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Computing Simple Statistics When Only they Frequency Distribution is Known

2009-11-28 Thread Anne Archibald
2009/11/28 Wayne Watson sierra_mtnv...@sbcglobal.net:

 I was only illustrating a way that I would not consider, since the
 hardware has already created the pdf. I've already coded it pretty much
 as you have suggested. As I think I mention ed above, I'm a bit
 surprised numpy doesn't provide the code you suggest as part of some
 function. CalcSimplefromPDF(xvalues=mydatarray, avg=ture, minmax=true,
 ...).

Feel free to submit an implementation to numpy's issue tracker. I
suggest modifying mean, std, and var (at least) so that, like average,
they take an array of weights.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Anne Archibald
2009/11/27 Christopher Barker chris.bar...@noaa.gov:

 The point is that I don't think we can just decide to use Unicode or
 Bytes in all places where PyString was used earlier.

 Agreed.

I only half agree. It seems to me that for almost all situations where
PyString was used, the right data type is a python3 string (which is
unicode). I realize there may be some few cases where it is
appropriate to use bytes, but I think there needs to be a compelling
reason for each one.

 In a way, unicode strings are a bit like arrays: they have an encoding
 associated with them (like a dtype in numpy). You can represent a given
 bit of text in multiple different arangements of bytes, but they are all
 supposed to mean the same thing and, if you know the encoding, you can
 convert between them. This is kind of like how one can represent 5 in
 any of many dtypes: uint8, int16, int32, float32, float64, etc. Not any
 value represented by one dtype can be converted to all other dtypes, but
 many can. Just like encodings.

This is incorrect. Unicode objects do not have default encodings or
multiple internal representations (within a single python interpreter,
at least). Unicode objects use 2- or 4-byte internal representations
internally, but this is almost invisible to the user. Encodings only
become relevant when you want to convert a unicode object to a byte
stream. It is usually an error to store text in a byte stream (for it
to make sense you must provide some mechanism to specify the
encoding).

 Anyway, all this brings me to think about the use of strings in numpy in
 this way: if it is meant to be a human-readable piece of text, it should
 be a unicode object. If not, then it is bytes.

 So: fromstring and the like should, of course, work with bytes (though
 maybe buffers really...)

I think if you're going to call it fromstring, it should onvert from
strings (i.e. unicode strings). But really, I think it makes more
sense to rename it frombytes() and have it convert bytes objects. One
could then have
def fromstring(s, encoding=utf-8):
return frombytes(s.encode(encoding))
as a shortcut. Maybe ASCII makes more sense as a default encoding. But
really, think about where the user's going to get the srting: most of
the time it's coming from a disk file or a network stream, so it will
be a byte string already, so they should use frombytes.

 To summarize the use cases I've ran across so far:

 1) For 'S' dtype, I believe we use Bytes for the raw data and the
    interface.

 I don't think so here. 'S' is usually used to store human-readable
 strings, I'd certainly expect to be able to do:

 s_array = np.array(['this', 'that'], dtype='S10')

 And I'd expect it to work with non-literals that were unicode strings,
 i.e. human readable text. In fact, it's pretty rare that I'd ever want
 bytes here. So I'd see 'S' mapped to 'U' here.

+1

 Francesc Alted wrote:
 the next  should still work:

 In [2]: s = np.array(['asa'], dtype=S10)

 In [3]: s[0]
 Out[3]: 'asa'  # will become b'asa' in Python 3

 I don't like that -- I put in a string, and get a bytes object back?

I agree.

 In [4]: s.dtype.itemsize
 Out[4]: 10     # still 1-byte per element

 But what it the the strings passed in aren't representable in one byte
 per character? Do we define S as only supporting ANSI-only string?
 what encoding?

Itemsize will change. That's fine.

 3) Format strings

       a = array([], dtype=b'i4')

 I don't think it makes sense to handle format strings in Unicode
 internally -- they should always be coerced to bytes.

 This should be fine -- we control what is a valid format string, and
 thus they can always be ASCII-safe.

I have to disagree. Why should we force the user to use bytes? The
format strings are just that, strings, and we should be able to supply
python strings to them. Keep in mind that coercing strings to bytes
requires extra information, namely the encoding. If you want to
emulate python2's value-dependent coercion - raise an exception only
if non-ASCII is present - keep in mind that python3 is specifically
removing that behaviour because of the problems it caused.


Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Adding the new polynomial/chebyshev modules.

2009-11-16 Thread Anne Archibald
2009/11/16 Christopher Barker chris.bar...@noaa.gov:
 Charles R Harris wrote:
 I would like some advise on the best way to add the new functions. I've
 added a new package polynomial, and that package contains four new
 modules: chebyshev, polynomial, polytemplate, polyutils.

 This seems to belong more in scipy than numpy, but I'll leave that to
 others to decide.

 whether or not to include all of the functions in these packages in the
 __init__.py, or to just import the modules.

 Are any of them compiled code? I've been very frustrated when I can't
 use some pure python stuff in scipy because broken compiled fortran
 extensions are getting imported that I don't even need.

 If that isn't an issue, and the polynomial package would end up with
 only a handful of names, then I say import them all. Another way to ask
 this: would there by ANY names in the polynomial package if you don't
 import the modules?

 If there is compiled code, the import could fail gracefully, and then
 you could still pull it all in.

 OTOH, what this does is bring stuff into memory unnecessarily, and also
 brings it into stand-alone bundles (py2exe, py2app, etc). So if these
 modules are not small, then it's probably better to have to import them
 explicitly.

 Also -- do you foresee many more polynomial types in the future? I know
 I'd like to see Hermite.

I have kind of gone silent on this issue, in part because I have been
busy with other things. I think that Charles Harris' framework for
working with polynomials is perfectly sufficient for adding Chebyshev
polynomials, and since they're finished and working and fill a
concrete need, they should probably go into numpy or scipy as is. But
if you want to start introducing other types of polynomials - Hermite,
Lagrange interpolation based, Bernstein based, or other - I think we
would need to revive the discussion about how to unify all these
different types.

Anne

 -Chris


 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-16 Thread Anne Archibald
2009/11/16 Christopher Barker chris.bar...@noaa.gov:
 Anne Archibald wrote:
 2009/11/13 Christopher Barker chris.bar...@noaa.gov:
 Wow! great -- you sounded interested, but I had no idea you'd run out
 and do it! thanks! we'll check it out.

 well, it turns out the Python version is unacceptably slow. However, we
 found we could use ckdtree, and use:

 tree.query(points,
             2, # Number of points to return per point given.
             distance_upper_bound = distance_tolerance,
            )

 This gives us a set of pairs of points closer than our distance
 tolerance -- it includes duplicates, of course, but even when we filter
 those in pure python, it's is nice and speedy.

 It looks like it would be pretty easy to add this to the Cython code,
 but it's working now for us, so...

You probably noticed this, but I should remind you that if a point has
more than one nearby neighbor, this may not give you what you wanted.
In particular, if there are three points all within the fixed distance
from each other, you may not find all three pairs this way.

Anne

 Thanks again,

 -Chris



 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Adding the new polynomial/chebyshev modules.

2009-11-16 Thread Anne Archibald
2009/11/16 Robert Kern robert.k...@gmail.com:
 On Mon, Nov 16, 2009 at 18:05, Christopher Barker chris.bar...@noaa.gov 
 wrote:
 Charles R Harris wrote:
 That's what I ended up doing. You still need to do import
 numpy.polynomial to get to them, they aren't automatically imported
 into the numpy namespace.

 good start. This brings up a semi-off-topic question:

 Is there a way to avoid importing everything when importing a module
 deep in a big package?

 The package authors need to keep the __init__.py files clear. There is
 nothing you can do as a user.

The reason numpy and scipy don't do this is largely historical -
Numeric had a nearly flat namespace, and imported all the submodules
in any case, and numpy is trying to remain somewhat compatible; scipy
originally tried to reexport all the numpy symbols, for some reason,
and there too it is maintaining compatibility.

Since spatial is new, though, it should be pretty good about not
inflating your imports unnecessarily. It does, of course, import
various things itself, which may balloon out the imports.

Anne

 --
 Robert Kern

 I have come to believe that the whole world is an enigma, a harmless
 enigma that is made terrible by our own mad attempt to interpret it as
 though it had an underlying truth.
  -- Umberto Eco
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-13 Thread Anne Archibald
2009/11/13 Christopher Barker chris.bar...@noaa.gov:
 Anne Archibald wrote:
 2009/11/10 Christopher Barker chris.bar...@noaa.gov:
 I have a bunch of points in 2-d space, and I need to find out which
 pairs of points are within a certain distance of one-another (regular
 old Euclidean norm).

 This is now implemented in SVN.

 Wow! great -- you sounded interested, but I had no idea you'd run out
 and do it! thanks! we'll check it out.

No problem, it was a fairly minor modification of the two-tree code.

 I (tentatively?) used a set to store
 the collection of pairs, because my tree traversal is not smart enough
 to reliably uniquify the pairs without using sets. With more time and
 energy, I'm sure the algorithm could be improved to avoid using sets
 (both internally and on return), but I think that's something to save
 for the Cython version.

 I agree -- what's wrong with using a set?

It should be possible to arrange the tree traversal algorithm so that
each pair of subtrees is automatically traversed only once - akin to
for i in range(n):
for j in range(i):
This would avoid having to build and modify a set every time you
traverse the tree (sets are fairly cheap hash tables, so the cost is
probably minor compared to other python overhead), and it would also
naturally allow the result to be a list/inflatable array (which would
save memory, if nothing else). But it would also require carefully
thinking through the tree traversal algorithm.

 Thanks, we'll let you know how it works for us.

Good luck!

Anne

 -Chris


 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Anne Archibald
2009/11/12 Lou Pecora lou_boog2...@yahoo.com:


 - Original Message 
 From: Christopher Barker chris.bar...@noaa.gov
 To: Discussion of Numerical Python numpy-discussion@scipy.org
 Sent: Thu, November 12, 2009 12:37:37 PM
 Subject: Re: [Numpy-discussion] finding close together points.

 Lou Pecora wrote:
 a KD tree for 2D nearest neighbor seems like over kill.  You
 might want to try the simple approach of using boxes of points to
 narrow things down by sorting on the first component.

 yeah, we'll probably do something like that if we have to write the code
 ourselves. At the moment, we're using geohash:

 http://pypi.python.org/pypi/Geohash/

 (this is for points on the earth)

 and it's working OK. I was just hoping kdtree would work out of the box!

 where for a
 static data set it can match KD trees in speed

 Why does it have to be static -- it doesn't look hard to insert/remove
 points.

 --- Lou answers --


 It doesn't have to be static, but if I remember correctly, when you add 
 points you have to resort or do a smart insert (which may require a lot of 
 data shifting).  Not hard, but the original paper claimed that if there are 
 a lot of changes to the data set this becomes more expensive than the 
 kd-tree.  If you can get the original paper I mentioned, it will have more 
 info.  It's pretty clearly written and contains some nice comparisons to 
 kd-trees for finding near neighbors.

 Bottom line is that you can still try it with changing data.  See what the 
 run times are like.  I've used the approach for up to eight dimensions with 
 about 10^5 data points.  It worked nicely.  I remember looking for a kd-tree 
 library that would work out of the box (C or C++), but everything I found was 
 not as simple as I would have liked.  The slice searching was a nice 
 solution.  But maybe I just didn't know where to look or I'm not 
 understanding some of the libraries I did find.

 Let us know how it goes.

Just a quick comment: none of the KDTrees in scipy.spatial support any
kind of modification right now, so if your data set changes you will
need to rebuild them completely. Construction is fairly cheap, but
it's very unlikely to compete with a data structure that supports
modification algorithms.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Anne Archibald
2009/11/11 Christopher Barker chris.bar...@noaa.gov:
 Anne Archibald wrote:
 2009/11/10 Christopher Barker chris.bar...@noaa.gov:

 I have a bunch of points in 2-d space, and I need to find out which
 pairs of points are within a certain distance of one-another (regular
 old Euclidean norm).

 This is an eminently reasonable thing to want, and KDTree should
 support it. Unfortunately it doesn't.

 darn.

 It's slow both because you're using the python implementation rather
 than the C,

 I noticed that.

 The one good thing about using the python implementation rather than
 the Cython one is that you can subclass it, providing a new method.
 There's still a certain amount of boilerplate code to write, but it
 shouldn't be too bad.

 Can you give me a hint as to where to start?  I have no idea how kd
 trees work.

 If this is still too slow, I have no problem incorporating additional
 code into cKDTree; the only reason none of the ball queries are in
 there is because ball queries must return variable-size results, so
 you lose a certain amount of speed because you're forced to manipulate
 python objects. But if there are relatively few result points, this
 need not be much of a slowdown.

 And certainly in this case, there would not be very many results, and
 lists are pretty fast.

This is now implemented in SVN. I (tentatively?) used a set to store
the collection of pairs, because my tree traversal is not smart enough
to reliably uniquify the pairs without using sets. With more time and
energy, I'm sure the algorithm could be improved to avoid using sets
(both internally and on return), but I think that's something to save
for the Cython version.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread Anne Archibald
2009/11/10 Christopher Barker chris.bar...@noaa.gov:
 Hi all,

 I have a bunch of points in 2-d space, and I need to find out which
 pairs of points are within a certain distance of one-another (regular
 old Euclidean norm).

This is an eminently reasonable thing to want, and KDTree should
support it. Unfortunately it doesn't.

 scipy.spatial.KDTree.query_ball_tree() seems like it's built for this.

 However, I'm a bit confused. The first argument is a kdtree, but I'm
 calling it as a method of a kdtree -- I want to know which points in the
 tree I already have are closer that some r from each-other.

 If I call it as:

 tree.query_ball_tree(tree, r)

 I get a big list, that has all the points in it (some of them paired up
 with close neighbors.) It appears I'm getting the distances between all
 the points in the tree and itself, as though they were different trees.

 This is slow, takes a bunch of memory, and I then have to parse out the
 list to find the ones that are paired up.

 Is there a way to get just the close ones from the single tree?

Unfortunately not at the moment.

It's slow both because you're using the python implementation rather
than the C, and because you're getting all pairs where pair
includes pairing a point with itself (and also each distinct pair in
both orders). The tree really should allow self-queries that don't
return the point and itself.

The one good thing about using the python implementation rather than
the Cython one is that you can subclass it, providing a new method.
There's still a certain amount of boilerplate code to write, but it
shouldn't be too bad.

If this is still too slow, I have no problem incorporating additional
code into cKDTree; the only reason none of the ball queries are in
there is because ball queries must return variable-size results, so
you lose a certain amount of speed because you're forced to manipulate
python objects. But if there are relatively few result points, this
need not be much of a slowdown.

Anne

 thanks,

 -Chris






 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Use-case for np.choose

2009-11-08 Thread Anne Archibald
2009/11/8  josef.p...@gmail.com:
 On Sat, Nov 7, 2009 at 7:53 PM, David Goldsmith d.l.goldsm...@gmail.com 
 wrote:
 Thanks, Anne.

 On Sat, Nov 7, 2009 at 1:32 PM, Anne Archibald peridot.face...@gmail.com
 wrote:

 2009/11/7 David Goldsmith d.l.goldsm...@gmail.com:

 snip


  Also, my experimenting suggests that the index array ('a', the first
  argument in the func. sig.) *must* have shape (choices.shape[-1],) -
  someone
  please let me know ASAP if this is not the case, and please furnish me
  w/ a
  counterexample because I was unable to generate one myself.

 It seems like a and each of the choices must have the same shape

 So in essence, at least as it presently functions, the shape of 'a'
 *defines* what the individual choices are within 'choices`, and if 'choices'
 can't be parsed into an integer number of such individual choices, that's
 when an exception is raised?


 (with

 the exception that choices acn be scalars), but I would consider this
 a bug.

 OK, then we definitely need more people to opine on this, because, if the
 the two don't match, our established policy is to document *desired*
 behavior, not extant behavior (and file a bug ticket).


 Really, a and all the choices should be broadcast to the same
 shape. Or maybe it doesn't make sense to broadcast a - it could be

 Thus begging the question: does anyone actually have an extant, specific
 use-case?


 valuable to know that the result is always exactly the same shape as a
 - but broadcasting all the choice arrays presents an important
 improvement of choose over fancy indexing.

 Then perhaps we need either another function, or a flag specifying which
 behavior this one should exhibit.


 There's a reason choose
 accepts a sequence of arrays as its second argument, rather than a
 higher-dimensional array.

 And that reason is probably supposed to be transparent above, but I've
 confused it by this point, so can you please reiterate it here, in so many
 words. :-)

 From looking at a few special cases, I think that full broadcasting rules 
 apply.
 First a and all choice array are broadcast to the same shape,
 then the selection is done according to the elements of (the broadcasted) a.

 For broadcasting it doesn't matter whether they are scalars or 1d or 2d or
 a 2d single column array. (I haven't tried more than 2 dimensions)

This must have changed since 1.2.1, since I get ValueError: too many
dimensions for all those examples. (Though the 1.2.1 docs claim
broadcasting.)

 The examples look a bit messy, but broadcasting is relatively straightforward.

 (I think, np.where is a bit easier to use because `a` is just a
 condition and doesn't
 require an index array)

Well, a condition *is* an index array, it is just restricted to the
values 0 and 1. But I agree it's not so natural to construct an index
array with more values. I guess you could do something like:

A = (C-1) + (C0) + (C1)
B = np.choose(A, (1, -C, C, 1))

But that's one of those magical numpy expressions that make the reader
stop and ask how on earth does that work? For further bewilderment
you'd make it computed:

A = np.sum(C[...,None]transition_values, axis=-1)
B = np.choose(A, [piece(C) for piece in piecewise_pieces])

Anne

 Josef

 np.choose(1, (3,4))
 4
 np.choose(0, (3,4))
 3
 np.choose(0, (np.arange(3)[:,None],np.arange(4),0))
 array([[0, 0, 0, 0],
       [1, 1, 1, 1],
       [2, 2, 2, 2]])
 np.choose(2, (np.arange(3)[:,None],np.arange(4),0))
 array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])
 np.choose(1, (np.arange(3)[:,None],np.arange(4),0))
 array([[0, 1, 2, 3],
       [0, 1, 2, 3],
       [0, 1, 2, 3]])
 np.choose([1,2,0,0], (np.arange(3)[:,None],np.arange(4),0))
 array([[0, 0, 0, 0],
       [0, 0, 1, 1],
       [0, 0, 2, 2]])
 np.choose(np.array([[1,2,0,0]]), (np.arange(3)[:,None],np.arange(4),0))
 array([[0, 0, 0, 0],
       [0, 0, 1, 1],
       [0, 0, 2, 2]])
 np.choose(np.array([[1,2,0]]).T, (np.arange(3)[:,None],np.arange(4),0))
 array([[0, 1, 2, 3],
       [0, 0, 0, 0],
       [2, 2, 2, 2]])





 Thanks again,

 DG

 Anne

  Thanks,
 
  DG
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Use-case for np.choose

2009-11-08 Thread Anne Archibald
2009/11/8 David Goldsmith d.l.goldsm...@gmail.com:
 On Sat, Nov 7, 2009 at 11:59 PM, Anne Archibald peridot.face...@gmail.com
 wrote:

 2009/11/7 David Goldsmith d.l.goldsm...@gmail.com:
  So in essence, at least as it presently functions, the shape of 'a'
  *defines* what the individual choices are within 'choices`, and if
  'choices'
  can't be parsed into an integer number of such individual choices,
  that's
  when an exception is raised?

 Um, I don't think so.

 Think of it this way: you provide np.choose with a selector array, a,
 and a list (not array!) [c0, c1, ..., cM] of choices. You construct an
 output array, say r, the same shape as a (no matter how many
 dimensions it has).

 Except that I haven't yet seen a working example with 'a' greater than 1-D,
 Josef's last two examples notwithstanding; or is that what you're saying is
 the bug.

There's nothing magic about A being one-dimensional.

C = np.random.randn(2,3,5)
A = (C-1).astype(int) + (C0).astype(int) + (C1).astype(int)

R = np.choose(A, (-1, -C, C, 1))
Requv = np.minimum(np.abs(C),1)

or:

def wedge(*functions):
 Return a function whose value is the minimum of those of functions
 def wedgef(X):
  fXs = [f(X) for f in functions]
  A = np.argmin(fXs, axis=0)
  return np.choose(A,fXs)
 return wedgef

so e.g. np.abs is -wedge(lambda X: X, lambda X: -X)

This works no matter what shape of X the user supplies - so a wedged
function can be somewhat ufunclike - by making A the same shape.

 The (i0, i1, ..., iN) element of the output array
 is obtained by looking at the (i0, i1, ..., iN) element of a, which
 should be an integer no larger than M; say j. Then r[i0, i1, ..., iN]
 = cj[i0, i1, ..., iN]. That is, each element of the selector array
 determines which of the choice arrays to pull the corresponding
 element from.

 That's pretty clear (thanks for doing my work for me). ;-), Yet, see above.

 For example, suppose that you are processing an array C, and have
 constructed a selector array A the same shape as C in which a value is
 0, 1, or 2 depending on whether the C value is too small, okay, or too
 big respectively. Then you might do something like:

 C = np.choose(A, [-inf, C, inf])

 This is something you might want to do no matter what shape A and C
 have. It's important not to require that the choices be an array of
 choices, because they often have quite different shapes (here, two are
 scalars) and it would be wasteful to broadcast them up to the same
 shape as C, just to stack them.

 OK, that's a pretty generic use-case, thanks; let me see if I understand it
 correctly: A is some how created independently with a 0 everywhere C is too
 small, a 1 everywhere C is OK, and a 2 everywhere C is too big; then
 np.choose(A, [-inf, C, inf]) creates an array that is -inf everywhere C is
 too small, inf everywhere C is too large, and C otherwise (and since -inf
 and inf are scalars, this implies broadcasting of these is taking place).
 This is what you're asserting *should* be the behavior.  So, unless there is
 disagreement about this (you yourself said the opposite viewpoint might
 rationally be held) np.choose definitely presently has a bug, namely, the
 index array can't be of arbitrary shape.

There seems to be some disagreement between versions, but both Josef
and I find that the index array *can* be arbitrary shape. In numpy
1.2.1 I find that all the choose items must be the same shape as it,
which I think is a bug.

What I suggested might be okay was if the index array was not
broadcasted, so that the outputs always had exactly the same shape as
the index array. But upon reflection it's useful to be able to use a
1-d array to select rows from a set of matrices, so I now think that
all of A and the elements of choose should be broadcast to the same
shape. This seems to be what Josef observes in his version of numpy,
so maybe there's nothing to do.

Anne

 DG


 Anne
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Use-case for np.choose

2009-11-08 Thread Anne Archibald
As Josef said, this is not correct. I think the key point of confusion is this:

Do not pass choose two arrays.

Pass it one array and a *list* of arrays. The fact that choices can be
an array is a quirk we can't change, but you should think of the
second argument as a list of arrays, possibly of different shapes.

np.choose(np.ones((2,1,1)), [ np.ones((1,3,1)), np.ones((1,1,5)) ] )

Here the first argument is an array of choices. The second argument is
a *list* - if you cast it to an array you'll get an error or nonsense
- of arrays to choose from. The broadcasting ensures the first
argument and *each element of the list* are the same shape. The only
constraint on the number of arrays in the list is that it be larger
than the largest value in a.

If you try to make the second argument into a single array, for one
thing you are throwing away useful generality by forcing each choice
to be the same shape (and real shape, not zero-strided fake shape),
and for another the broadcasting becomes very hard to understand.

Anne


2009/11/8 David Goldsmith d.l.goldsm...@gmail.com:
 OK, now I'm trying to wrap my brain around broadcasting in choose when both
 `a` *and* `choices` need to be (non-trivially) broadcast in order to arrive
 at a common shape, e.g.:

 c=np.arange(4).reshape((2,1,2)) # shape is (2,1,2)
 a=np.eye(2, dtype=int) # shape is (2,2)
 np.choose(a,c)
 array([[2, 1],
    [0, 3]])

 (Unfortunately, the implementation is in C, so I can't easily insert print
 statements to see intermediate results.)

 First, let me confirm that the above is indeed an example of what I think it
 is, i.e., both `a` and `choices` are broadcast in order for this to work,
 correct?  (And if incorrect, how is one broadcast to the shape of the
 other?)  Second, both are broadcast to shape (2,2,2), correct?  But how,
 precisely, i.e., does c become

 [[[0, 1], [2, 3]],    [[[0, 1], [0, 1]],
  [[0, 1], [2, 3]]]   or   [[2, 3], [2, 3]]]

 and same question for a?  Then, once a is broadcast to a (2,2,2) shape,
 precisely how does it pick and choose from c to create a (2,2) result?
 For example, suppose a is broadcast to:

 [[[1, 0], [0, 1]],
  [[1, 0], [0, 1]]]

 (as indicated above, I'm uncertain at this point if this is indeed what a is
 broadcast to); how does this create the (2,2) result obtained above?
 (Obviously this depends in part on precisely how c is broadcast, I do
 recognize that much.)

 Finally, a seemingly relevant comment in the C source is:

 /* Broadcast all arrays to each other, index array at the end.*/

 This would appear to confirm that co-broadcasting is performed if
 necessary, but what does the index array at the end phrase mean?

 Thanks for your continued patience and tutelage.

 DG

 On Sun, Nov 8, 2009 at 5:36 AM, josef.p...@gmail.com wrote:

 On Sun, Nov 8, 2009 at 5:00 AM, David Goldsmith d.l.goldsm...@gmail.com
 wrote:
  On Sun, Nov 8, 2009 at 12:57 AM, Anne Archibald
  peridot.face...@gmail.com
  wrote:
 
  2009/11/8 David Goldsmith d.l.goldsm...@gmail.com:
   On Sat, Nov 7, 2009 at 11:59 PM, Anne Archibald
   peridot.face...@gmail.com
   wrote:
  
   2009/11/7 David Goldsmith d.l.goldsm...@gmail.com:
So in essence, at least as it presently functions, the shape of
'a'
*defines* what the individual choices are within 'choices`, and if
'choices'
can't be parsed into an integer number of such individual choices,
that's
when an exception is raised?
  
   Um, I don't think so.
  
   Think of it this way: you provide np.choose with a selector array,
   a,
   and a list (not array!) [c0, c1, ..., cM] of choices. You construct
   an
   output array, say r, the same shape as a (no matter how many
   dimensions it has).
  
   Except that I haven't yet seen a working example with 'a' greater
   than
   1-D,
   Josef's last two examples notwithstanding; or is that what you're
   saying
   is
   the bug.
 
  There's nothing magic about A being one-dimensional.
 
  C = np.random.randn(2,3,5)
  A = (C-1).astype(int) + (C0).astype(int) + (C1).astype(int)
 
  R = np.choose(A, (-1, -C, C, 1))
 
  OK, now I get it: np.choose(A[0,:,:], (-1,-C,C,-1)) and
  np.choose(A[0,:,0].reshape((3,1)), (-1,-C,C,1)), e.g., also work, but
  np.choose(A[0,:,0], (-1,-C,C,-1)) doesn't - what's necessary for
  choose's
  arguments is that both can be broadcast to a common shape (as you state
  below), but choose won't reshape the arguments for you to make this
  possible, you have to do so yourself first, if necessary.  That does
  appear
  to be what's happening now; but do we want choose to be smarter than
  that
  (e.g., for np.choose(A[0,:,0], (-1,-C,C,-1)) to work, so that the user
  doesn't need to include the .reshape((3,1)))?

 No, I don't think we want to be that smart.

 If standard broadcasting rules apply, as I think they do, then I wouldn't
 want
 any special newaxis or reshapes done automatically. It will be confusing,
 the function wouldn't know what to do if there are, e.g., as many

Re: [Numpy-discussion] Use-case for np.choose

2009-11-08 Thread Anne Archibald
2009/11/8 David Goldsmith d.l.goldsm...@gmail.com:
 On Sun, Nov 8, 2009 at 7:40 PM, Anne Archibald peridot.face...@gmail.com
 wrote:

 As Josef said, this is not correct. I think the key point of confusion is
 this:

 Do not pass choose two arrays.

 Pass it one array and a *list* of arrays. The fact that choices can be
 an array is a quirk we can't change, but you should think of the
 second argument as a list of arrays,

 Fine, but as you say, one *can* pass choose an array as the second argument
 and it doesn't raise an exception, so if someone is stupid/careless enough
 to pass an array for `choices`, how is choose interpreting it as a list?  Is
 the first dimension list converted (so that, e.g., my (2,1,2) example is
 interpreted as a two element list, each of whose elements is a (1,2) array)?

It seems to me that this is the only reasonable interpretation, yes.
After all, arrays behave like sequences along the first axis, whose
elements are arrays of one less dimension. Thus if you pass an array,
any broadcasting happens ignoring the first axis, which is a rather
abnormal pattern for numpy broadcasting, but necessary here.

As a bonus, I think this is what is implemented in current versions of
numpy. (In 1.2.1 it raises an exception if broadcasting is necessary.)

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] How to get rid of the loop?

2009-11-07 Thread Anne Archibald
2009/11/7 Stas K stanc...@gmail.com:
 Thank you, Josef
 It is exactly what I want:

 ar[:,None]**2 + ar**2

 Do you know something about performance of this? In my real program ar  have
 ~ 10k elements, and expression for v more complicated (it has some
 trigonometric functions)

The construction of ar[:,None] (which I prefer to spell
ar[:,np.newaxis]) is cheap, since it just constructs a view into ar.
Computing ar**2 and ar[:,None] require squaring each element of ar,
but this is done in a C loop. (It's done twice in this expression, so
this isn't as efficient as it might be). Only when it comes time to do
the addition do you do n**2 operations. For very large n, this can be
a crushing memory burden, and if you only need these values for an
intermediate calculation it sometimes turns out that it's better to
loop over one of the dimensions. But this expression is probably about
as efficient as you can hope for, given what it does. If you need to
do some more complicated calculation, the main thing to be careful of
is that you do as much calculation as possible on the separate arrays,
while they're still only n elements and not n**2.


Anne

 On 07.11.2009, at 21:57, josef.p...@gmail.com wrote:

 On Sat, Nov 7, 2009 at 1:51 PM, Stas K stanc...@gmail.com wrote:

 Can I get rid of the loop in this example? And what is the fastest way

 to get v in the example?

 ar = array([1,2,3])

 for a in ar:

    for b in ar:

        v = a**2+b**2

 ar[:,None]**2 + ar**2

 array([[ 2,  5, 10],
   [ 5,  8, 13],
   [10, 13, 18]])

 I think, for this case there is also directly a function in numpy
 hypot which should also work.

 Josef

 ___

 NumPy-Discussion mailing list

 NumPy-Discussion@scipy.org

 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Use-case for np.choose

2009-11-07 Thread Anne Archibald
2009/11/7 David Goldsmith d.l.goldsm...@gmail.com:
 Hi, all!  I'm working to clarify the docstring for np.choose (Stefan pointed
 out to me that it is pretty unclear, and I agreed), and, now that (I'm
 pretty sure that) I've figured out what it does in its full generality
 (e.g., when the 'choices' array is greater than 2-D), I'm curious as to
 use-cases.  (Not that I'm suggesting that we deprecate it, but I am curious
 as to whether or not anyone is presently using it and if so, how they're
 using it, esp. if anyone *is* using it with 'choices' arrays 3-D or
 greater.)

It's a generalized np.where(), allowing more than two options:

In [10]: a = np.arange(2*3*5).reshape(2,3,5) % 3

In [11]: o = np.ones((2,3,5))

In [12]: np.choose(a,(o,0.5*o,0.1*o))
Out[12]:
array([[[ 1. ,  0.5,  0.1,  1. ,  0.5],
[ 0.1,  1. ,  0.5,  0.1,  1. ],
[ 0.5,  0.1,  1. ,  0.5,  0.1]],

   [[ 1. ,  0.5,  0.1,  1. ,  0.5],
[ 0.1,  1. ,  0.5,  0.1,  1. ],
[ 0.5,  0.1,  1. ,  0.5,  0.1]]])

 Also, my experimenting suggests that the index array ('a', the first
 argument in the func. sig.) *must* have shape (choices.shape[-1],) - someone
 please let me know ASAP if this is not the case, and please furnish me w/ a
 counterexample because I was unable to generate one myself.

It seems like a and each of the choices must have the same shape (with
the exception that choices acn be scalars), but I would consider this
a bug. Really, a and all the choices should be broadcast to the same
shape. Or maybe it doesn't make sense to broadcast a - it could be
valuable to know that the result is always exactly the same shape as a
- but broadcasting all the choice arrays presents an important
improvement of choose over fancy indexing. There's a reason choose
accepts a sequence of arrays as its second argument, rather than a
higher-dimensional array.

Anne

 Thanks,

 DG

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Use-case for np.choose

2009-11-07 Thread Anne Archibald
2009/11/7 David Goldsmith d.l.goldsm...@gmail.com:
 Thanks, Anne.

 On Sat, Nov 7, 2009 at 1:32 PM, Anne Archibald peridot.face...@gmail.com
 wrote:

 2009/11/7 David Goldsmith d.l.goldsm...@gmail.com:

 snip


  Also, my experimenting suggests that the index array ('a', the first
  argument in the func. sig.) *must* have shape (choices.shape[-1],) -
  someone
  please let me know ASAP if this is not the case, and please furnish me
  w/ a
  counterexample because I was unable to generate one myself.

 It seems like a and each of the choices must have the same shape

 So in essence, at least as it presently functions, the shape of 'a'
 *defines* what the individual choices are within 'choices`, and if 'choices'
 can't be parsed into an integer number of such individual choices, that's
 when an exception is raised?

Um, I don't think so.

Think of it this way: you provide np.choose with a selector array, a,
and a list (not array!) [c0, c1, ..., cM] of choices. You construct an
output array, say r, the same shape as a (no matter how many
dimensions it has). The (i0, i1, ..., iN) element of the output array
is obtained by looking at the (i0, i1, ..., iN) element of a, which
should be an integer no larger than M; say j. Then r[i0, i1, ..., iN]
= cj[i0, i1, ..., iN]. That is, each element of the selector array
determines which of the choice arrays to pull the corresponding
element from.

For example, suppose that you are processing an array C, and have
constructed a selector array A the same shape as C in which a value is
0, 1, or 2 depending on whether the C value is too small, okay, or too
big respectively. Then you might do something like:

C = np.choose(A, [-inf, C, inf])

This is something you might want to do no matter what shape A and C
have. It's important not to require that the choices be an array of
choices, because they often have quite different shapes (here, two are
scalars) and it would be wasteful to broadcast them up to the same
shape as C, just to stack them.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Random int64 and float64 numbers

2009-11-05 Thread Anne Archibald
2009/11/5 David Goldsmith d.l.goldsm...@gmail.com:
 On Thu, Nov 5, 2009 at 3:26 PM, David Warde-Farley d...@cs.toronto.edu
 wrote:

 On 5-Nov-09, at 4:54 PM, David Goldsmith wrote:

  Interesting thread, which leaves me wondering two things: is it
  documented
  somewhere (e.g., at the IEEE site) precisely how many *decimal*
  mantissae
  are representable using the 64-bit IEEE standard for float
  representation
  (if that makes sense);

 IEEE-754 says nothing about decimal representations aside from how to
 round when converting to and from strings. You have to provide/accept
 *at least* 9 decimal digits in the significand for single-precision
 and 17 for double-precision (section 5.6). AFAIK implementations will
 vary in how they handle cases where a binary significand would yield
 more digits than that.

 I was actually more interested in the opposite situation, where the decimal
 representation (which is what a user would most likely provide) doesn't have
 a finite binary expansion: what happens then, something analogous to the
 decimal rule of fives?

If you interpret 0.1 as 1/10, then this is a very general
floating-point issue: how you you round off numbers you can't
represent exactly? The usual answer (leaving aside internal
extended-precision shenanigans) is to round, with the rule that when
you're exactly between two floating-point numbers you round to the one
that's even, rather than always up or always down (the numerical
analysis wizards tell us that this is more numerically stable).

Anne

 DG

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Random int64 and float64 numbers

2009-11-05 Thread Anne Archibald
2009/11/5  josef.p...@gmail.com:
 On Thu, Nov 5, 2009 at 10:42 PM, David Goldsmith
 d.l.goldsm...@gmail.com wrote:
 On Thu, Nov 5, 2009 at 3:26 PM, David Warde-Farley d...@cs.toronto.edu
 wrote:

 On 5-Nov-09, at 4:54 PM, David Goldsmith wrote:

  Interesting thread, which leaves me wondering two things: is it
  documented
  somewhere (e.g., at the IEEE site) precisely how many *decimal*
  mantissae
  are representable using the 64-bit IEEE standard for float
  representation
  (if that makes sense);

 IEEE-754 says nothing about decimal representations aside from how to
 round when converting to and from strings. You have to provide/accept
 *at least* 9 decimal digits in the significand for single-precision
 and 17 for double-precision (section 5.6). AFAIK implementations will
 vary in how they handle cases where a binary significand would yield
 more digits than that.

 I was actually more interested in the opposite situation, where the decimal
 representation (which is what a user would most likely provide) doesn't have
 a finite binary expansion: what happens then, something analogous to the
 decimal rule of fives?

 Since according to my calculations there are only about

 4* 10**17 * 308
 12320L

More straightforwardly, it's not too far below 2**64.

 double-precision floats, there are huge gaps in the floating point
 representation of the real line.
 Any user input or calculation result just gets converted to the
 closest float.

Yes. But the huge gaps are only huge in absolute size; in fractional
error they're always about the same size. They are usually only an
issue if you're representing numbers in a small range with a huge
offset. To take a not-so-random example, you could be representing
times in days since November 17 1858, when what you care about are the
microsecond-scale differences in photon arrival times. Even then
you're probably okay as long as you compute directly (t[i] =
i*dt/86400 + start_t) rather than having some sort of running
accumulator (t[i]=t; t += dt/86400).

Professor Kahan's reasoning for using doubles for most everything is
that they generally have so much more precision than you actually need
that you can get away with being sloppy.


Anne

 Josef



 DG

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Random int64 and float64 numbers

2009-11-02 Thread Anne Archibald
2009/11/1 Thomas Robitaille thomas.robitai...@gmail.com:
 Hi,

 I'm trying to generate random 64-bit integer values for integers and
 floats using Numpy, within the entire range of valid values for that
 type. To generate random 32-bit floats, I can use:

Others have addressed why this is giving bogus results. But it's worth
thinking about what you're actually trying to do. If it's fuzz
tests, that is, producing unrestricted random floats to feed to a
function, then even if this worked it wouldn't produce what you want:
it will never produce floats of order unity, for example.

If you want do what you described, you could produce floats uniformly
from -1 to 1 and then multiply the results by the largest
representable float.

If you want random floats, I suggest generating an array of random
bytes and reinterpreting them as floats. You'll get a fair number of
NaNs and infinities, so you may want to take only the finite values
and regenerate the rest. This will give you some numbers from all over
the range of floats, including tiny values (and denormalized values,
which are a potentially important special case).

Anne

 np.random.uniform(low=np.finfo(np.float32).min,high=np.finfo
 (np.float32).max,size=10)

 which gives for example

 array([  1.47351436e+37,   9.93620693e+37,   2.22893053e+38,
         -3.33828977e+38,   1.08247781e+37,  -8.37481260e+37,
          2.64176554e+38,  -2.72207226e+37,   2.54790459e+38,
         -2.47883866e+38])

 but if I try and use this for 64-bit numbers, i.e.

 np.random.uniform(low=np.finfo(np.float64).min,high=np.finfo
 (np.float64).max,size=10)

 I get

 array([ Inf,  Inf,  Inf,  Inf,  Inf,  Inf,  Inf,  Inf,  Inf,  Inf])

 Similarly, for integers, I can successfully generate random 32-bit
 integers:

 np.random.random_integers(np.iinfo(np.int32).min,high=np.iinfo
 (np.int32).max,size=10)

 which gives

 array([-1506183689,   662982379, -1616890435, -1519456789,  1489753527,
         -604311122,  2034533014,   449680073,  -444302414,
 -1924170329])

 but am unsuccessful for 64-bit integers, i.e.

 np.random.random_integers(np.iinfo(np.int64).min,high=np.iinfo
 (np.int64).max,size=10)

 which produces the following error:

 OverflowError: long int too large to convert to int

 Is this expected behavior, or are these bugs?

 Thanks for any help,

 Thomas
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Single view on multiple arrays

2009-11-01 Thread Anne Archibald
2009/11/1 Bill Blinn bbl...@gmail.com:
 What is the best way to create a view that is composed of sections of many
 different arrays?

The short answer is, you can't. Numpy arrays must be located
contiguous blocks of memory, and the elements along any dimension must
be equally spaced. A view is simply another array that references
(some of) the same underlying memory, possibly with different strides.

 For example, imagine I had
 a = np.array(range(0, 12)).reshape(3, 4)
 b = np.array(range(12, 24)).reshape(3, 4)
 c = np.array(range(24, 36)).reshape(3, 4)

 v = multiview((3, 4))
 #the idea of the following lines is that the 0th row of v is
 #a view on the first row of a. the same would hold true for
 #the 1st and 2nd row of v and the 0th rows of b and c, respectively
 v[0] = a[0]
 v[1] = b[0]
 v[2] = c[0]

 #change the underlying arrays
 a[0, 0] = 50
 b[0, 0] = 51
 c[0, 0] = 52

 #I would want all these assertions to pass because the view
 #refers to the rows of a, b and c
 assert v[0, 0] == 50
 assert v[1, 0] == 51
 assert v[2, 0] == 52

 Is there any way to do this?

If you need to be able to do this, you're going to have to rearrange
your code somewhat, so that your original arrays are views of parts of
an initial array.

It's worth noting that if what you're worried about is the time it
takes to copy data, you might well be surprised at how cheap data
copying and memory allocation really are. Given that numpy is written
in python, only for really enormous arrays will copying data be
expensive, and allocating memory is really a very cheap operation
(modern malloc()s average something like a few cycles). What's more,
since modern CPUs are so heavily cache-bound, using strided views can
be quite slow, since you end up loading whole 64-byte cache lines for
each 8-byte double you need.

In short, you probably need to rethink your design, but while you're
doing it, don't worry about copying data.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Designing a new storage format for numpy recarrays

2009-10-30 Thread Anne Archibald
2009/10/30 Stephen Simmons m...@stevesimmons.com:
 I should clarify what I meant..

 Suppose I have a recarray with 50 fields and want to read just one of
 those fields. PyTables/HDF will read in the compressed data for chunks
 of complete rows, decompress the full 50 fields, and then give me back
 the data for just one field.

 I'm after a solution where asking for a single field reads in the bytes
 for just that field from disk and decompresses it.

 This is similar to the difference between databases storing their data
 as rows or columns. See for example Mike Stonebraker's C-store
 column-oriented database (http://db.lcs.mit.edu/projects/cstore/vldb.pdf).

Is there any reason not to simply store the data as a collection of
separate arrays, one per column? It shouldn't be too hard to write a
wrapper to give this nicer syntax, while implementing it under the
hood with HDF5...

Anne

 Stephen



 Francesc Alted wrote:
 A Friday 30 October 2009 14:18:05 Stephen Simmons escrigué:

  - Pytables (HDF using chunked storage for recarrays with LZO
 compression and shuffle filter)
     - can't extract individual field from a recarray


 Er... Have you tried the ``cols`` accessor?

 http://www.pytables.org/docs/manual/ch04.html#ColsClassDescr

 Cheers,



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Another suggestion for making numpy's functions generic

2009-10-20 Thread Anne Archibald
2009/10/20 Sebastian Walter sebastian.wal...@gmail.com:
 On Tue, Oct 20, 2009 at 5:45 AM, Anne Archibald
 peridot.face...@gmail.com wrote:
 2009/10/19 Sebastian Walter sebastian.wal...@gmail.com:

 I'm all for generic (u)funcs since they might come handy for me since
 I'm doing lots of operation on arrays of polynomials.

 Just as a side note, if you don't mind my asking, what sorts of
 operations do you do on arrays of polynomials? In a thread on
 scipy-dev we're discussing improving scipy's polynomial support, and
 we'd be happy to get some more feedback on what they need to be able
 to do.

 I've been reading (and commenting) that thread ;)
  I'm doing algorithmic differentiation by computing on truncated
 Taylor polynomials in the Powerbasis,
  i.e. always truncating all operation at degree D
 z(t) = \sum_d=0^{D-1} z_d t^d =  x(t) * y(t) = \sum_{d=0}^{D-1}
 \sum_{k=0}^d x_k * y_{d-k} + O(t^D)

 Using other bases does not make sense in my case since the truncation
 of all terms of higher degree than t^D
 has afaik no good counterpart for bases like chebycheff.
 On the other hand, I need to be generic in the coefficients, e.g.
 z_d from above could be a tensor of any shape,  e.g.  a matrix.

In fact, truncating at degree D for Chebyshev polynomials works
exactly the same way as it does for power polynomials, and if what you
care about is function approximation, it has much nicer behaviour. But
if what you care about is really truncated Taylor polynomials, there's
no beating the power basis.

I realize that arrays of polynomial objects are handy from a
bookkeeping point of view, but how does an array of polynomials, say
of shape (N,), differ from a single polynomial with coefficients of
shape (N,)? I think we need to provide the latter in our polynomial
classes, but as you point out, getting ufunc support for the former is
nontrivial.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Optimized sum of squares

2009-10-20 Thread Anne Archibald
2009/10/20  josef.p...@gmail.com:
 On Sun, Oct 18, 2009 at 6:06 AM, Gary Ruben gru...@bigpond.net.au wrote:
 Hi Gaël,

 If you've got a 1D array/vector called a, I think the normal idiom is

 np.dot(a,a)

 For the more general case, I think
 np.tensordot(a, a, axes=something_else)
 should do it, where you should be able to figure out something_else for
 your particular case.

 Is it really possible to get the same as np.sum(a*a, axis)  with
 tensordot  if a.ndim=2 ?
 Any way I try the something_else, I get extra terms as in np.dot(a.T, a)

It seems like this would be a good place to apply numpy's
higher-dimensional ufuncs: what you want seems to just be the vector
inner product, broadcast over all other dimensions. In fact I believe
this is implemented in numpy as a demo: numpy.umath_tests.inner1d
should do the job.


Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Optimized sum of squares

2009-10-20 Thread Anne Archibald
2009/10/20  josef.p...@gmail.com:
 On Tue, Oct 20, 2009 at 3:09 PM, Anne Archibald
 peridot.face...@gmail.com wrote:
 2009/10/20  josef.p...@gmail.com:
 On Sun, Oct 18, 2009 at 6:06 AM, Gary Ruben gru...@bigpond.net.au wrote:
 Hi Gaël,

 If you've got a 1D array/vector called a, I think the normal idiom is

 np.dot(a,a)

 For the more general case, I think
 np.tensordot(a, a, axes=something_else)
 should do it, where you should be able to figure out something_else for
 your particular case.

 Is it really possible to get the same as np.sum(a*a, axis)  with
 tensordot  if a.ndim=2 ?
 Any way I try the something_else, I get extra terms as in np.dot(a.T, a)

 It seems like this would be a good place to apply numpy's
 higher-dimensional ufuncs: what you want seems to just be the vector
 inner product, broadcast over all other dimensions. In fact I believe
 this is implemented in numpy as a demo: numpy.umath_tests.inner1d
 should do the job.

 Thanks, this works well, needs core in name
 (I might have to learn how to swap or roll axis to use this for more than 2d.)

 np.core.umath_tests.inner1d(a.T, b.T)
 array([12,  8, 16])
 (a*b).sum(0)
 array([12,  8, 16])
 np.core.umath_tests.inner1d(a.T, b.T)
 array([12,  8, 16])
 (a*a).sum(0)
 array([126, 166, 214])
 np.core.umath_tests.inner1d(a.T, a.T)
 array([126, 166, 214])


 What's the status on these functions? They don't show up in the docs
 or help, except for
 a brief mention in the c-api:

 http://docs.scipy.org/numpy/docs/numpy-docs/reference/c-api.generalized-ufuncs.rst/

 Are they for public consumption and should go into the docs?
 Or do they remain a hidden secret, to force users to read the mailing lists?

I think the long-term goal is to have a completely ufuncized linear
algebra library, and I think these functions are just tests of the
gufunc features. In principle, at least, it wouldn't actually be too
hard to fill out a full linear algebra library, since the per
element linear algebra operations already exist. Unfortunately the
code should exist for many data types, and the code generator scheme
currently used to do this for ordinary ufuncs is a barrier to
contributions. It might be worth banging out a doubles-only generic
ufunc linear algebra library (in addition to
numpy.linalg/scipy.linalg), just as a proof of concept.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Another suggestion for making numpy's functions generic

2009-10-19 Thread Anne Archibald
2009/10/19 Sebastian Walter sebastian.wal...@gmail.com:

 I'm all for generic (u)funcs since they might come handy for me since
 I'm doing lots of operation on arrays of polynomials.

Just as a side note, if you don't mind my asking, what sorts of
operations do you do on arrays of polynomials? In a thread on
scipy-dev we're discussing improving scipy's polynomial support, and
we'd be happy to get some more feedback on what they need to be able
to do.

Thanks!
Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] double-precision sqrt?

2009-10-17 Thread Anne Archibald
2009/10/17 Adam Ginsburg adam.ginsb...@colorado.edu:
 My code is actually wrong but I still have the problem I've
 identified that sqrt is leading to precision errors.  Sorry about the
 earlier mistake.

I think you'll find that numpy's sqrt is as good as it gets for double
precision. You can try using numpy's float96 type, which at least on
my machine, does give sa few more significant figures. If you really,
really need accuracy, there are arbitrary-precision packages for
python, which you could try.

But I think you may find that your problem is not solved by higher
precision. Something about ray-tracing just leads it to ferret out
every little limitation of floating-point computation. For example,
you can easily get surface acne when shooting shadow rays, where a
ray shot from a surface to a light source accidentally intersects that
same surface for some pixels but not for others. You can try to fix it
by requiring some minimum intersection distance, but then you'll find
lots of weird little quirks where your minimum distance causes
problems. A better solution is one which takes into account the
awkwardness of floating-point; for this particular case one trick is
to mark the object you're shooting rays from as not a candidate for
intersection. (This doesn't work, of course, if the object can cast
shadows on itself...) I have even seen people advocate for using
interval analysis inside ray tracers, to avoid this kind of problem.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loading data

2009-06-25 Thread Anne Archibald
2009/6/25 Mag Gam magaw...@gmail.com:
 Hello.

 I am very new to NumPy and Python. We are doing some research in our
 Physics lab and we need to store massive amounts of data (100GB
 daily). I therefore, am going to use hdf5 and h5py. The problem is I
 am using np.loadtxt() to create my array and create a dataset
 according to that. np.loadtxt() is reading a file which is about 50GB.
 This takes a very long time! I was wondering if there was a much
 easier and better way of doing this.

If you are stuck with the text array, you probably can't beat
numpy.loadtxt(); reading a 50 GB text file is going to be slow no
matter how you cut it. So I would take a look at the code that
generates the text file, and see if there's any way you can make it
generate a format that is faster to read. (I assume the code is in C
or FORTRAN and you'd rather not mess with it more than necessary).

Of course, generating hdf5 directly is probably fastest; you might
look at the C and FORTRAN hdf5 libraries and see how hard it would be
to integrate them into the code that currently generates a text file.
Even if you need to have a python script to gather the data and add
metadata, hdf5 will be much much more efficient than text files as an
intermediate format.

If integrating HDF5 into the generating application is too difficult,
you can try simply generating a binary format. Using numpy's
structured data types, it is possible to read in binary files
extremely efficiently. If you're using the same architecture to
generate the files as read them, you can just write out raw binary
arrays of floats or doubles and then read them into numpy. I think
FORTRAN also has a semi-standard padded binary format which isn't too
difficult to read either. You could even use numpy's native file
format, which for a single array should be pretty straightforward, and
should yield portable results.

If you really can't modify the code that generates the text files,
your code is going to be slow. But you might be able to make it
slightly less slow. If, for example, the text files are a very
specific format, especially if they're made up of columns of fixed
width, it would be possible to write compiled code to read them
slightly more quickly. (The very easiest way to do this is to write a
little C program that reads the text files and writes out a slightly
friendlier format, as above.) But you may well find that simply
reading a 50 GB file dominates your run time, which would mean that
you're stuck with slowness.


In short: avoid text files if at all possible.


Good luck,
Anne

 TIA
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Interleaved Arrays and

2009-06-16 Thread Anne Archibald
I'm not sure it's worth having a function to replace a one-liner
(column_stack followed by reshape). But if you're going to implement
this with slice assignment, you should take advantage of the
flexibility this method allows and offer the possibility of
interleaving raggedly, that is, where the size of the arrays drops
at some point, so that you could interleave arrays of size 4, 4, and 3
to get one array of size 11. This allows split and join operations,
for example for multiprocessing. On the other hand you should also
include a documentation warning that this can be slow when
interleaving large numbers of small arrays.

Anne

2009/6/16 Neil Martinsen-Burrell n...@wartburg.edu:
 On 2009-06-16 16:05 , Robert wrote:
 Neil Martinsen-Burrell wrote:
 On 06/16/2009 02:18 PM, Robert wrote:
       n = 10
       xx = np.ones(n)
       yy = np.arange(n)
       aa = np.column_stack((xx,yy))
       bb = np.column_stack((xx+1,yy))
       aa
 array([[ 1.,  0.],
           [ 1.,  1.],
           [ 1.,  2.],
           [ 1.,  3.],
           [ 1.,  4.],
           [ 1.,  5.],
           [ 1.,  6.],
           [ 1.,  7.],
           [ 1.,  8.],
           [ 1.,  9.]])
       bb
 array([[ 2.,  0.],
           [ 2.,  1.],
           [ 2.,  2.],
           [ 2.,  3.],
           [ 2.,  4.],
           [ 2.,  5.],
           [ 2.,  6.],
           [ 2.,  7.],
           [ 2.,  8.],
           [ 2.,  9.]])
       np.column_stack((aa,bb))
 array([[ 1.,  0.,  2.,  0.],
           [ 1.,  1.,  2.,  1.],
           [ 1.,  2.,  2.,  2.],
           [ 1.,  3.,  2.,  3.],
           [ 1.,  4.,  2.,  4.],
           [ 1.,  5.,  2.,  5.],
           [ 1.,  6.,  2.,  6.],
           [ 1.,  7.,  2.,  7.],
           [ 1.,  8.,  2.,  8.],
           [ 1.,  9.,  2.,  9.]])
       cc = _
       cc.reshape((n*2,2))
 array([[ 1.,  0.],
           [ 2.,  0.],
           [ 1.,  1.],
           [ 2.,  1.],
           [ 1.,  2.],
           [ 2.,  2.],
           [ 1.,  3.],
           [ 2.,  3.],
           [ 1.,  4.],
           [ 2.,  4.],
           [ 1.,  5.],
           [ 2.,  5.],
           [ 1.,  6.],
           [ 2.,  6.],
           [ 1.,  7.],
           [ 2.,  7.],
           [ 1.,  8.],
           [ 2.,  8.],
           [ 1.,  9.],
           [ 2.,  9.]])
    


 However I feel too, there is a intuitive abbrev function like
 'interleave' or so missing in numpy shape_base or so.

 Using fancy indexing, you can set strided portions of an array equal to
 another array.  So::

 In [2]: aa = np.empty((10,2))

 In [3]: aa[:, 0] = 1

 In [4]: aa[:,1] = np.arange(10)

 In [5]: bb = np.empty((10,2))

 In [6]: bb[:,0] = 2

 In [7]: bb[:,1] = aa[:,1] # this works

 In [8]: cc = np.empty((20,2))

 In [9]: cc[::2,:] = aa

 In [10]: cc[1::2,:] = bb

 In [11]: cc
 Out[11]:
 array([[ 1.,  0.],
          [ 2.,  0.],
          [ 1.,  1.],
          [ 2.,  1.],
          [ 1.,  2.],
          [ 2.,  2.],
          [ 1.,  3.],
          [ 2.,  3.],
          [ 1.,  4.],
          [ 2.,  4.],
          [ 1.,  5.],
          [ 2.,  5.],
          [ 1.,  6.],
          [ 2.,  6.],
          [ 1.,  7.],
          [ 2.,  7.],
          [ 1.,  8.],
          [ 2.,  8.],
          [ 1.,  9.],
          [ 2.,  9.]])

 Using this syntax, interleave could be a one-liner.

 -Neil

 that method of 'filling an empty with a pattern' was mentioned in
 the other (general) interleaving question. It requires however a
 lot of particular numbers and :'s in the code, and requires even
 more statements which can hardly be written in functional style -
 in one line?. The other approach is more jount, free of fancy
 indexing assignments.

 jount?  I think that assigning to a strided index is very clear, but
 that is a difference of opinion.  All of the calls to np.empty are the
 equivalent of the column_stack's in your example.  I think that
 operations on segments of arrays are fundamental to an array-processing
 language such as NumPy.  Using ; you can put as many of those
 statements as you would like one line. :)

 The general interleaving should work efficiently in one like this:

 np.column_stack/concatenate((r,g,b,), axis=...).reshape(..)

 But as all this is not intuitive, something like this should be in
 numpy perhaps? :

 def interleave( tup_arrays, axis = None )

 Here is a minimally tested implementation.  If anyone really wants this
 for numpy, I'll gladly add comments and tests.  I couldn't figure out
 how to automatically find the greatest dtype, so I added an argument to
 specify, otherwise it uses the type of the first array.

 def interleave(arrays, axis=0, dtype=None):
     assert len(arrays)  0
     first = arrays[0]
     assert all([arr.shape == first.shape for arr in arrays])
     new_shape = list(first.shape)
     new_shape[axis] *= len(arrays)
     if dtype is None:
         new_dtype = first.dtype
     else:
         new_dtype = dtype
     interleaved = np.empty(new_shape, new_dtype)
     axis_slice = [slice(None, None, None)]*axis + \
        

Re: [Numpy-discussion] How to remove fortran-like loops with numpy?

2009-06-08 Thread Anne Archibald
2009/6/8 Robert Kern robert.k...@gmail.com:
 On Mon, Jun 8, 2009 at 17:04, David Goldsmithd_l_goldsm...@yahoo.com wrote:

 I look forward to an instructive reply: the Pythonic way to do it would be 
 to take advantage of the facts that Numpy is pre-vectorized and uses 
 broadcasting, but so far I haven't been able to figure out (though I haven't 
 yet really buckled down and tried real hard) how to broadcast a 
 conditionally-terminated iteration where the number of iterations will vary 
 among the array elements.  Hopefully someone else already has. :-)

 You can't, really. What you can do is just keep iterating with the
 whole data set and ignore the parts that have already converged. Here
 is an example:

Well, yes and no. This is only worth doing if the number of problem
points that require many iterations is small - not the case here
without some sort of periodicity detection - but you can keep an array
of not-yet-converged points, which you iterate. When some converge,
you store them in a results array (with fancy indexing) and remove
them from your still-converging array.

It's also worth remembering that the overhead of for loops is large
but not enormous, so you can often remove only the inner for loop, in
this case perhaps iterating over the image a line at a time.

Anne


 z = np.zeros((201,201), dtype=complex)
 Y, X = np.mgrid[1:-1:-201j, -1.5:0.5:201j]
 c = np.empty_like(z)
 c.real = X
 c.imag = Y
 N = np.zeros(z.shape, dtype=int)

 while ((N30) | (abs(z)2)).all():
    N += abs(z)  2
    z = z ** 2 + c

 N[N=30] = 0

 --
 Robert Kern

 I have come to believe that the whole world is an enigma, a harmless
 enigma that is made terrible by our own mad attempt to interpret it as
 though it had an underlying truth.
  -- Umberto Eco
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] extract elements of an array that are contained in another array?

2009-06-04 Thread Anne Archibald
2009/6/4  josef.p...@gmail.com:

 intersect1d should throw a domain error if you give it arrays with
 non-unique elements, which is not done for speed reasons

It seems to me that this is the basic source of the problem. Perhaps
this can be addressed? I realize maintaining compatibility with the
current behaviour is necessary, so how about a multistage deprecation:

1. add a keyword argument to intersect1d assume_unique; if it is not
present, check for uniqueness and emit a warning if not unique
2. change the warning to an exception
Optionally:
3. change the meaning of the function to that of intersect1d_nu if the
keyword argument is not present

One could do something similar with setmember1d.

This would remove the pitfall of the 1d assumption and the wart of the
_nu names without hampering performance for people who know they have
unique arrays and are in a hurry.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] performance matrix multiplication vs. matlab

2009-06-04 Thread Anne Archibald
2009/6/4 David Paul Reichert d.p.reich...@sms.ed.ac.uk:
 Hi all,

 I would be glad if someone could help me with
 the following issue:

  From what I've read on the web it appears to me
 that numpy should be about as fast as matlab. However,
 when I do simple matrix multiplication, it consistently
 appears to be about 5 times slower. I tested this using

 A = 0.9 * numpy.matlib.ones((500,100))
 B = 0.8 * numpy.matlib.ones((500,100))

 def test():
     for i in range(1000):
         A*B.T

 I also used ten times larger matrices with ten times less
 iterations, used xrange instead of range, arrays instead
 of matrices, and tested it on two different machines,
 and the result always seems to be the same.

 Any idea what could go wrong? I'm using ipython and
 matlab R2008b.

Apart from the implementation issues people have chimed in about
already, it's worth noting that the speed of matrix multiplication
depends on the memory layout of the matrices. So generating B instead
directly as a 100 by 500 matrix might affect the speed substantially
(I'm not sure in which direction). If MATLAB's matrices have a
different memory order, that might be a factor as well.

Anne

 Thanks,

 David


 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.


 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Vectorizing array updates

2009-04-29 Thread Anne Archibald
2009/4/29 Dan Goodman dg.gm...@thesamovar.net:
 Robert Kern wrote:
 On Wed, Apr 29, 2009 at 16:19, Dan Goodman dg.gm...@thesamovar.net wrote:
 Robert Kern wrote:
 On Wed, Apr 29, 2009 at 08:03, Daniel Yarlett daniel.yarl...@gmail.com 
 wrote:

 As you can see, Current is different in the two cases. Any ideas how I
 can recreate the behavior of the iterative process in a more numpy-
 friendly, vectorized (and hopefully quicker) way?
 Use bincount().
 Neat. Is there a memory efficient way of doing it if the indices are
 very large but there aren't many of them? e.g. if the indices were
 I=[1, 2] then bincount would create a gigantic array of 2
 elements for just two addition operations!

 indices -= indices.min()


 Ah OK, but bincount is still going to make an array of 1 elements in
 that case...

 I came up with this trick, but I wonder whether it's overkill:

 Supposing you want to do y+=x[I] where x is big and indices in I are
 large but sparse (although potentially including repeats).

 J=unique(I)
 K=digitize(I,J)-1
 b=bincount(K)
 y+=b*x[J]

 Well it seems to work, but surely there must be a nicer way?

Well, a few lines serve to rearrange the indices array so it contains
each number only once, and then to add together (with bincount) all
values with the same index:

ix = np.unique1d(indices)
reverse_index = np.searchsorted(ix, indices)
r = np.bincount(reverse_index, weights=values)

You can then do:

y[ix] += r

All this could be done away with if bincount supported input and
output arguments. Then

y[indices] += values

could become

np.bincount(indices, weights=values, input=y, output=y)

implemented y the same nice tight loop one would use in C. (I'm not
sure bincount is the right name for such a general function, though.)

This question comes up fairly often on the mailing list, so it would
be nice to have a single function to point to.


Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Another Array

2009-04-10 Thread Anne Archibald
2009/4/10 Ian Mallett geometr...@gmail.com:

 The vectors are used to jitter each particle's initial speed, so that the
 particles go in different directions instead of moving all as one.  Using
 the unit vector causes the particles to make the smooth parabolic shape.
 The jitter vectors much then be of a random length, so that the particles go
 in all different directions at all different speeds, instead of just all in
 different directions.

Why not just skip the normalization? Then you'll get vectors with
random direction and a natural distribution of lengths. And it'll be
faster than tne unit vectors...

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using reducing functions without eliminating dimensions?

2009-04-09 Thread Anne Archibald
2009/4/9 Charles R Harris charlesr.har...@gmail.com:


 On Tue, Apr 7, 2009 at 12:44 PM, Dan Lenski dlen...@gmail.com wrote:

 Hi all,

 I often want to use some kind of dimension-reducing function (like min(),
 max(), sum(), mean()) on an array without actually removing the last
 dimension, so that I can then do operations broadcasting the reduced
 array back to the size of the full array.  Full example:

   table.shape
  (47, 1814)

   table.min(axis=1).shape
  (47,)

   table - table.min(axis=1)
  ValueError: shape mismatch: objects cannot be broadcast to a single
 shape

   table - table.min(axis=1)[:, newaxis]

 I have to resort to ugly code with lots of stuff like ... axis=1)[:,
 newaxis].

 Is there any way to get the reducing functions to leave a size-1 dummy
 dimension in place, to make this easier?

 Not at the moment. There was talk a while back of adding a keyword for this
 option, it would certainly make things easier for some common uses. It might
 be worth starting that conversation up again.

What's wrong with np.amin(a,axis=-1)[...,np.newaxis]?

Anne

 Chuck



 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Optical autocorrelation calculated with numpy is slow

2009-03-30 Thread Anne Archibald
2009/3/30 João Luís Silva jsi...@fc.up.pt:
 Hi,

 I wrote a script to calculate the *optical* autocorrelation of an
 electric field. It's like the autocorrelation, but sums the fields
 instead of multiplying them. I'm calculating

 I(tau) = integral( abs(E(t)+E(t-tau))**2,t=-inf..inf)

You may be in trouble if there's cancellation, but can't you just
rewrite this as E(t)**2+E(t-tau)**2-2*E(t)*E(t-tau)? Then you have two
O(n) integrals and one standard autocorrelation...

Anne

 with script appended at the end. It's too slow for my purposes (takes ~5
 seconds, and scales ~O(N**2)). numpy's correlate is fast enough, but
 isn't what I need as it multiplies instead of add the fields. Could you
 help me get this script to run faster (without having to write it in
 another programming language) ?

 Thanks,
 João Silva

 #

 import numpy as np
 #import matplotlib.pyplot as plt

 n = 2**12
 n_autocorr = 3*n-2

 c = 3E2
 w0 = 2.0*np.pi*c/800.0
 t_max = 100.0
 t = np.linspace(-t_max/2.0,t_max/2.0,n)

 E = np.exp(-(t/10.0)**2)*np.exp(1j*w0*t)    #Electric field

 dt = t[1]-t[0]
 t_autocorr=np.linspace(-dt*n_autocorr/2.0,dt*n_autocorr/2.0,n_autocorr)
 E1 = np.zeros(n_autocorr,dtype=E.dtype)
 E2 = np.zeros(n_autocorr,dtype=E.dtype)
 Ac = np.zeros(n_autocorr,dtype=np.float64)

 E2[n-1:n-1+n] = E[:]

 for i in range(2*n-2):
     E1[:] = 0.0
     E1[i:i+n] = E[:]

     Ac[i] = np.sum(np.abs(E1+E2)**2)

 Ac *= dt

 #plt.plot(t_autocorr,Ac)
 #plt.show()

 #

 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] array of matrices

2009-03-28 Thread Anne Archibald
2009/3/28 Geoffrey Irving irv...@naml.us:
 On Sat, Mar 28, 2009 at 12:47 AM, Robert Kern robert.k...@gmail.com wrote:
 2009/3/27 Charles R Harris charlesr.har...@gmail.com:

 On Fri, Mar 27, 2009 at 4:43 PM, Robert Kern robert.k...@gmail.com wrote:

 On Fri, Mar 27, 2009 at 17:38, Bryan Cole br...@cole.uklinux.net wrote:
  I have a number of arrays of shape (N,4,4). I need to perform a
  vectorised matrix-multiplication between pairs of them I.e.
  matrix-multiplication rules for the last two dimensions, usual
  element-wise rule for the 1st dimension (of length N).
 
  (How) is this possible with numpy?

 dot(a,b) was specifically designed for this use case.

 I think maybe he wants to treat them as stacked matrices.

 Oh, right. Sorry. dot(a, b) works when a is (N, 4, 4) and b is just
 (4, 4). Never mind.

 It'd be great if this operation existed as a primitive.  What do you
 think would be the best way in which to add it?  One option would be
 to add a keyword argument to dot giving a set of axes to map over.
 E.g.,

    dot(a, b, map=0) = array([dot(u,v) for u,v in zip(a,b)]) # but in C

 map isn't a very good name for the argument, though.

I think the right long-term solution is to make dot (and some other
linear algebra functions) into generalized ufuncs, so that when you
dot two multidimensional objects together, they are treated as arrays
of two-dimensional arrays, broadcasting is done on all but the last
two dimensions, and then the linear algebra is applied elementwise.
This covers basically all stacked matrices uses in a very general
way, but would require some redesigning of the linear algebra system -
for example, dot() currently works on both two- and one-dimensional
arrays, which can't work in such a setting.

The infrastructure to support such generalized ufuncs has been added
to numpy, but as far as I know no functions yet make use of it.

Anne


 Geoffrey
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Cython numerical syntax revisited

2009-03-06 Thread Anne Archibald
2009/3/5 Francesc Alted fal...@pytables.org:
 A Thursday 05 March 2009, Francesc Alted escrigué:
 Well, I suppose that, provided that Cython could perform the for-loop
 transformation, giving support for strided arrays would be relatively
 trivial, and the performance would be similar than numexpr in this
 case.

 Mmh, perhaps not so trivial, because that implies that the stride of an
 array should be known in compilation time, and that would require a new
 qualifier when declaring the array.  Tricky...

Not necessarily. You can transform

a[1,2,3]

into

*(a.data + 1*a.strides[0] + 2*a.strides[1] + 3*a.strides[2])

without any need for static information beyond that a is
3-dimensional. This would already be valuable, though perhaps you'd
want to be able to declare that a particular dimension had stride 1 to
simplify things. You could then use this same implementation to add
automatic iteration.

Anne


 Cheers,

 --
 Francesc Alted
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Interpolation via Fourier transform

2009-03-05 Thread Anne Archibald
2009/3/5 M Trumpis mtrum...@berkeley.edu:
 Hi Nadav.. if you want a lower resolution 2d function with the same
 field of view (or whatever term is appropriate to your case), then in
 principle you can truncate your higher frequencies and do this:

 sig = ifft2_func(sig[N/2 - M/2:N/2 + M/2, N/2 - M/2:N/2+M/2])

 I like to use an fft that transforms from an array indexing
 negative-to-positive freqs to an array that indexes
 negative-to-positive spatial points, so in both spaces, the origin is
 at (N/2,N/2). Then the expression works as-is.

 The problem is if you've got different indexing in one or both spaces
 (typically positive frequencies followed by negative) you can play
 around with a change of variables in your DFT in one or both spaces.
 If the DFT is defined as a computing frequencies from 0,N, then
 putting in n' = n-N/2  leads to a term like exp(1j*pi*q) that
 multiplies f[q]. Here's a toy example:

 a = np.cos(2*np.pi*5*np.arange(64)/64.)

 P.plot(np.fft.fft(a).real)

 P.plot(np.fft.fft(np.power(-1,np.arange(64))*a).real)

 The second one is centered about index N/2

 Similarly, if you need to change the limits of the summation of the
 DFT from 0,N to -N/2,N/2, then you can multiply exp(1j*pi*n) to the
 outside of the summation.

 Like I said, easy enough in principle!

There's also the hit-it-with-a-hammer approach: Just downsample in x
then in y, using the one-dimensional transforms.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Floating point question

2009-03-02 Thread Anne Archibald
On 02/03/2009, Gideon Simpson simp...@math.toronto.edu wrote:
 I recently discovered that for 8 byte floating point numbers, my
  fortran compilers (gfortran 4.2 and ifort 11.0) on an OS X core 2 duo
  machine believe the  smallest number 2.220507...E-308.  I presume that
  my C compilers have similar results.

  I then discovered that the smallest floating point number in python
  2.5 is 4.9065...E-324.  I have been using numpy to generate data,
  saving it with savetxt, and then reading it in as ASCII into my
  fortran code.  Recently, it crapped out on something because it didn't
  like reading it a number that small, though it is apparently perfectly
  acceptable to python.

  My two questions are:

  1.  What is the best way to handle this?  Is it just to add a filter
  of the form

 u = u * ( np.abs(u)  2.3 e-308)

  2.  What gives?  What's the origin of this (perceived) inconsistency
  in floating points across languages within the same platform?

  I recognize that this isn't specific to Scipy/Numpy, but thought
  someone here might have the answer.

What's happening is that numbers like 1e-310 are represented by
denormals. If we use base 10 to explain, suppose that floating point
numbers could only have five digits of mantissa and two digits of
exponent:

1.3000 e 00
2.1000 e-35
1. e-99

Now what to do if you divide that last number in half? You can write it as:

0.5000 e-99

But this is a bit of an anomaly: unlike all normal floating-point
numbers, it has a leading zero, and there are only four digits of
information in the mantissa.  In binary it's even more of an anomaly,
since in binary you can take advantage of the fact that all normal
mantissas start with one by not bothering to store the one. But it
turns out that implementing denormals is useful to provide graceful
degradation as numbers underflow, so it's in IEEE. It appears that
some FORTRAN implementations cannot handle denormals, which is giving
you trouble. It's usually fairly safe to simply replace all denormals
by zero.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concatenating Arrays to make Views

2008-12-15 Thread Anne Archibald
2008/12/15 Benjamin Haynor bhay...@hotmail.com:

 I was wondering if I can concatenate 3 arrays, where the result will be a
 view of the original three arrays, instead of a copy of the data.  For
 example, suppose I write the following
 import numpy as n
 a = n.array([[1,2],[3,4]])
 b = n.array([[5,6],[7,8]])
 c = n.array([[9,10],[11,12]])
 c = n.r_[a,b]
 Now c = :
 [[1,2],
 [3,4],
 [5,6],
 [7,8],
 [9,10],
 [11,12]]
 I was hoping to get an array, such that, when I change d, a, b, and c will
 also change appropriately.
 Any ideas?

An array must be a contiguous piece of memory, so this is impossible
unless you allocate d first and make a b and c views of it.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Shape (z,0)

2008-11-28 Thread Anne Archibald
2008/11/28 T J [EMAIL PROTECTED]:
 import numpy as np
 x = np.ones((3,0))
 x
 array([], shape(3,0), dtype=float64)

 To preempt, I'm not really concerned with the answer to:  Why would
 anyone want to do this?

 I just want to know what is happening.  Especially, with

 x[0,:] = 5

 (which works).  It seems that nothing is really happening here...given
 that, why is it allowed? Ie, are there reasons for not requiring the
 shape dimensions to be greater than 0?

So that scripts can work transparently with arrays of all sizes:

In [1]: import numpy as np

In [3]: a = np.random.randn(5); b = a[a1]; print b.shape
(1,)

In [4]: a = np.random.randn(5); b = a[a1]; print b.shape
(1,)

In [5]: a = np.random.randn(5); b = a[a1]; print b.shape
(0,)

In [10]: b[:]=b[:]-1

The : just means all, so it's fine to use it if there aren't any
(all of none).

Basically, if this were not permitted there would be a zillion little
corner cases code that was trying to be generic would have to deal
with.

There is a similar issue with zero-dimensional arrays for code that is
trying to be generic for number of dimensions. That is, you want to be
able to do something like:

In [12]: a = a-np.mean(a,axis=-1)[...,np.newaxis]

and have it work whether a is an n-dimensional array, in which case
np.mean(a,axis=-1) is (n-1)-dimensional, or a is a one-dimensional
array, in which case np.mean(a,axis=-1) is a zero-dimensional array,
or maybe a scalar:

In [13]: np.mean(a)
Out[13]: 0.0

In [14]: 0.0[...,np.newaxis]
---
TypeError Traceback (most recent call last)

/homes/janeway/aarchiba/ipython console in module()

TypeError: 'float' object is unsubscriptable

In [15]: np.mean(a)[...,np.newaxis]
Out[15]: array([ 0.])

Normalizing this stuff is still ongoing; it's tricky, because you
often want something like np.mean(a) to be just a number, but generic
code wants it to behave like a zero-dimensional array. Currently numpy
supports both array scalars, that is, numbers of array dtypes, and
zero-dimensional arrays; they behave mostly alike, but there are a few
inconsistencies (and it's arguably redundant to have both).

That said, it is often far easier to write generic code by flattening
the input array, dealing with it as a guaranteed-one-dimensional
array, then reconstructing the correct shape at the end, but such code
is kind of a pain to write.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] New ufuncs

2008-11-04 Thread Anne Archibald
On 05/11/2008, Charles R Harris [EMAIL PROTECTED] wrote:


 On Tue, Nov 4, 2008 at 11:05 PM, T J [EMAIL PROTECTED] wrote:
  On Tue, Nov 4, 2008 at 9:37 PM, Anne Archibald
 
  [EMAIL PROTECTED] wrote:
 
   2008/11/5 Charles R Harris [EMAIL PROTECTED]:
   Hi All,
  
   I'm thinking of adding some new ufuncs. Some possibilities are
  
   expadd(a,b) = exp(a) + exp(b) -- For numbers stored as logs:
  
   Surely this should be log(exp(a)+exp(b))? That would be extremely
 useful, yes.
  
 
  +1
 
  But shouldn't it be called 'logadd', for adding values which are stored as
 logs?
 

 Hmm... but I'm thinking one has to be clever here because the main reason I
 heard for using logs was that normal floating point numbers had insufficient
 range. So maybe something like

 logadd(a,b) = a + log(1 + exp(b - a))

 where a  b ?

That's the usual way to do it, yes. I'd use log1p(exp(b-a)) for a
little extra accuracy, though it probably doesn't matter.  And yes,
using logadd.reduce() is not the most efficient way to get a logsum();
no reason it can't be a separate function. As T J says, a logdot()
would come in handy too. A python implementation is a decent first
pass, but logdot() in particular would benefit from a C
implementation.


Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] How to do: y[yT] = y+T

2008-10-27 Thread Anne Archibald
If what you are trying to do is actually ensure all data is within the
range [a,b], you may be interested to know that python's % operator
works on floating-point numbers:

In [1]: -0.1 % 1
Out[1]: 0.90002

So if you want all samples in the range (0,1) you can just do y%=1.

Anne

2008/10/27 Nicolas ROUX [EMAIL PROTECTED]:
 Thanks for all of you,
 for this fast and good reply ;-)


 Nicolas.

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of David Douard
 Sent: Monday, October 27, 2008 12:51 PM
 To: numpy-discussion@scipy.org
 Subject: Re: [Numpy-discussion] How to do: y[yT] = y+T

 Le Monday 27 October 2008 12:41:06 Nicolas ROUX, vous avez écrit :
 Hello,

 I hope this is not a silly question ;-)
 I have a Numpy array, and I want to process it with :
   if the value is lower than Threshold, then increase by Threshold


 I would like to translate it as:
  y[yTreshold] = y + Treshold

 let's see :

 y[yT] += T

 Is it what you want ?


 To benefit from the Numpy speed.
 But this doesn't work, any idea ?

 Thanks,
 Cheers,
 Nicolas.

 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion



 --
 David DouardLOGILAB, Paris (France), +33 1 45 32 03
 12
 Formations Python, Zope, Debian :   http://www.logilab.fr/formations
 Développement logiciel sur mesure : http://www.logilab.fr/services
 Informatique scientifique : http://www.logilab.fr/science

 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-12 Thread Anne Archibald
2008/10/9 David Bolme [EMAIL PROTECTED]:
 I have written up basic nearest neighbor algorithm.  It does a brute
 force search so it will be slower than kdtrees as the number of points
 gets large.  It should however work well for high dimensional data.  I
 have also added the option for user defined distance measures.  The
 user can set a default p.  p has the same functionality if it is a
 float.  p can also be a function that computes a distance matrix or
 the measure can be selected using the strings: Manhattan,
 Euclidean, or Correlation.

 https://pyvision.svn.sourceforge.net/svnroot/pyvision/trunk/src/pyvision/vector/knn.py

This is interesting. I would point out, though, that if you want a
Minkowski norm, it may be more efficient (that is, faster) to use the
new compiled kd-tree implementation with leafsize set to the size of
your data. This is written in compiled code and uses short-circuit
distance evaluation, and may be much faster for high-dimensional
problems.

Given that, this should perhaps go in with other generic metric space
code. I have a functional implementation of ball trees (though I don't
know how efficient they are), and am looking into implementing cover
trees.

 The interface is similar to Anne's code and in many cases can be used
 as a drop in replacement.  I have posted the code to my own project
 because I have a short term need and I do not have access to the scipy
 repository.  Feel free to include the code with scipy under the scipy
 license.

 I did find a typo your documentation.
 typo trie - tree - ... kd-tree is a binary trie, each of whose ...

That's not actually a typo: a trie is a tree in which all the data is
stored at leaf nodes. Basic kd-trees use the nodes themselves to
define splitting planes; you can actually construct one with no extra
storage at all, just by choosing an appropriate order for your array
of points. This implementation chooses splitting planes that may not
pass through any point, so all the points get stored in leaves.

 Also I found the use of k in the documentation some what confusing as
 it is the dimensionality of the data points in the kd-tree and the
 number of neighbors for k-nearest neighbors.

That's a good point. I changed the dimension of the kd-tree to m throughout.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Data types in Numerical Python

2008-10-12 Thread Anne Archibald
2008/10/12 Linda Seltzer [EMAIL PROTECTED]:
 Here is an example that works for any working numpy installation:

 import numpy as npy
 npy.zeros((256, 256))
 This suggestion from David did work so far, and removing the other import
 line enabled the program to run.
 However, the data types the program used as defaults for variables has
 changed, and now I am getting error messages about data types.  It seems
 that some variables are getting a default designation as floats.  Before I
 installed numpy and needed 2-D arrays, the program was working with the
 default types, and I did not have to specify types.
 Is there a clear tutorial that describes a means to assign data types for
 each variable as in C, so that I don't obtain error messages about data
 types?

Python is a dynamically-typed language (unlike C), so variables do not
have type. That is, a variable can refer to an object of any type; if
you need to know what type of object a variable currently refers to
you must inspect the object. You may want to go through one of the
brief introduction-to-python tutorials that are on the python website
just to get comfortable with the language. (For example, understanding
the meaning and syntax of import statements.)

When you create a numpy array, you can specify its type. You can also
explicitly or implicitly convert the types of numpy arrays. I
recommend you select a data type, let's say np.uint32, and make sure
various arrays are created containing that type:

np.zeros((m,n),dtype=np.uint32)
np.arange(10,dtype=np.uint32)
x.astype(np.uint32)
np.array([1,2,3,4.5], dtype=np.uint32)

et cetera. Most operations (addition, multiplication, maximum) will
preserve the data type of arrays they are given (but if you supply two
different data types numpy will attempt to choose the larger).

 Because I am simulating code for a DSP processor, the data types I need
 are unsigned bytes, unsigned 32-bit ints, and signed 32-bit ints.  In some
 cases I can use unsigned and signed 16-bit ints.
 Also, what data types are valid for use with local operations such as
 exclusive or?

The numpy data types you want are described by dtype objects. These
can in principle become quite complicated, but the ones you need are
given names already:

np.uint8
np.uint32
np.int32
np.uint16
np.int16

You can specify any of these as dtype= arguments to the various
numpy functions. If you need really rigid typing, python may not be
the ideal language for you.

Good luck,
Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Vectorizing dot on the two last axis

2008-10-10 Thread Anne Archibald
2008/10/10 Gael Varoquaux [EMAIL PROTECTED]:

 I have been unable to vectorize the following operation::

window_size = 10
nb_windows = 819
nb_clusters = 501
restricted_series = np.random.random(size=(window_size, nb_clusters,
nb_windows))
this_cov = np.zeros((nb_windows, nb_clusters, nb_clusters))
print sys.stderr, alive
for index, window in enumerate(restricted_series):
this_cov[index, ...] = np.dot(window, window.T)


 The last for loop is the one I am unhappy with.

 Note that it is fairly easy to get huge arrays when trying the vectorize
 this through np.dot or np.tensordot.

 I am unhappy with myself: I feel that it should be possible to vectorize
 this operation, but I cannot figure it out.

I am pretty sure that, unfortunately, there is no way to vectorize
this without an intermediate array substantially larger than either
the inputs or the outputs. (Add one to the tally of people wishing for
ufunc-like linear algebra.) From a computational point of view, this
isn't particularly a problem: the intermediate array cannot be large
enough to be a problem without the slices along at least one axis
being large enough that for loop overhead is pretty minor. So you can
just loop over this axis. Coding-wise, of course, this is a royal
pain: unless you know ahead of time which axis is the smallest, you
need to code several versions of your routine, and each version must
include a for loop (just what numpy is supposed to help avoid).

So: ufunc-like linear algebra would be great, but in the meantime, I
guess we all learn to live with for loops.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] collecting the bluest pixels

2008-10-07 Thread Anne Archibald
2008/10/7 paul taney [EMAIL PROTECTED]:
 Hi,

 I have this silly color filter that Stefan gave me:


 def vanderwalt(image, f):
colorfilter, thanks to Stefan van der Walt
RED, GRN, BLU = 0, 1, 2
bluemask = (image[...,BLU]  f*image[...,GRN])  \
   (image[...,BLU]  f*image[...,RED])

return bluemask


 To collect the right number of the bluest pixels I am calling it from this 
 arduous successive approximation routine.  It occured to me that someone on 
 this list knows how to do this in a couple of lines...

Well, I can see several approaches. The most direct way to do what
you're asking is to use scipy.optimize.bisect to implement the
successive approximations. That'll be almost as slow as your current
approach, though. Instead, I'd first write a function that measures
the blueness of each pixel:

def blueness(image):
  A = np.empty(image.shape[:-1],dtype=np.float32)
  np.divide(image[...,BLU],image[...,RED],A) # use three-argument
divide to reduce the number of float temporaries
  B = np.empty(image.shape[:-1],dtype=np.float32)
  np.divide(image[...,BLU],image[...,GRN],B)
  return np.minimum(A,B)

Now, once you have the bluenesses, you can sort them and pull out the
blueness that gives you the percent you want:

bluenesses = np.sort(blueness(image),axis=None) # axis=None flattens the array
factor = bluenesses[int((1-wanted_fraction)*len(bluenesses))]

If you want a smarter blueness filter, you could look into using HSV:
you could then specify a distance in hue and a distance in saturation,
based on how relatively important you think they are.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Merge of generalised ufuncs branch

2008-10-07 Thread Anne Archibald
2008/10/7 Stéfan van der Walt [EMAIL PROTECTED]:

 The generalised ufuncs branch was made available before SciPy'08.  We
 solicited comments on its implementation and structuring, but received
 very little feedback.  Unless there are any further comments from the
 community, I propose that we merge it.

Sounds good to me - I've counted at least three or four threads on the
mailing lists wishing for the ufuncized linear algebra this would
allow since it was put forward. (Of course, these won't appear until
someone implements them - perhaps a start would be for someone to
write a tutorial on using the new generalized ufunc code...)

 It is unfortunate that we have so many patches waiting for review
 (SciPy suffers worst, I'm afraid); clearly there are too few hours in
 a day.  Nothing discourages contributions as much as a stale project,
 and I hope we can avoid that.

The problem may perhaps be that not many people feel they are in a
position to actually do the reviews, so that everyone is waiting on an
imagined few real developers to place the Official Stamp of
Approval. Perhaps an informal rule of thumb for acceptance of patches?
How about: posted to list, reviewed by someone not the author who's
had at least one patch accepted before, and no objections on the list?
Anything that receives objections can fall through to the usual
decision by discussion on the mailing list, this is just intended for
those patches that everyone just kind of shrugs and says looks all
right to me.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] collecting the bluest pixels

2008-10-07 Thread Anne Archibald
2008/10/7 Christopher Barker [EMAIL PROTECTED]:

 I wonder if the euclidian norm would make sense for this application:

 HowFarFromBlue = np.sqrt((255-image[...,BLU])**2 +
  image[...,GRN]**2 +
  image[...,RED]**2)

 smaller numbers would be bluest -- pure blue would be 0, pure red 360,
 etc...

 One thing I like about this is that your blue may not exactly be an
 RBG blue -- so you could see how far a given pixel was from any given
 color -- in your case, whatever your blue is. Then it would be something
 like:

 r, g, b = ref_color

 HowFarFromRefColor = np.sqrt((r - image[...,RED])**2 +
  (g - image[...,GRN])**2 +
  (b - image[...,BLU])**2
  )


 NOTE: I know nothing of image prcessing -- Ill be there is are some
 standard ways to determine how close two colors are.

It's a tricky problem, but if you're serious about it you can use
Euclidean distance in the CIELUV colour space, or if you're really
serious Euclidean distance in CIELAB. Both probably overkill for this
project.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-03 Thread Anne Archibald
2008/10/3 David Bolme [EMAIL PROTECTED]:
 I remember reading a paper or book that stated that for data that has
 been normalized correlation and Euclidean are equivalent and will
 produce the same knn results.  To this end I spent a couple hours this
 afternoon doing the math.  This document is the result.

 http://www.cs.colostate.edu/~bolme/bolme08euclidean.pdf

Yes, you're right, even without the mean subtraction they all lie on a
hypersphere in Euclidean space. It's a little more awkward if you want
to identify x and -x.

 I believe that with mean subtracted and unit length vectors, a
 Euclidean knn algorithm will produces the same result as if the
 vectors were compared using correlation.  I am not sure if kd-trees
 will perform well on the normalized vectors as they have a very
 specific geometry.  If my math checks out it may be worth adding
 Pearson's correlation as a default option or as a separate class.

Actually it's probably easier if the user just does the prenormalization.

 I have also spent a little time looking at kd-trees and the kdtree
 code.  It looks good.  As I understand it kd-trees only work well when
 the number of datapoints (N) is larger than 2^D, where D is the
 dimensionality of those points.  This will not work well for many of
 my computer vision problems because often D is large.  As Anne
 suggested I will probably look at cover trees because often times the
 data are low-dimensional data in high-dimensional spaces.  I have
 been told though that with a large D there is know known fast
 algorithm for knn.

Pretty much true. Though if the intrinsic dimensionality is low, cover
trees should be all right.

 Another problem is that the distances and similarity measures used in
 biometrics and computer vision are often very specialized and may or
 may not conform to the underlying assumptions of fast algorithms.  I
 think for this reason I will need an exhaustive search algorithm.  I
 will code it up modeled after Anne's interface and hopefully it will
 make it into the spatial module.

Metric spaces are quite general - edit distance for strings, for
example, is an adequate distance measure. But brute-force is
definitely worth having too.

If I get the test suite cleaned up, it should be possible to just drop
an arbitrary k-nearest-neighbors class into it and get a reasonably
thorough collection of unit tests.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-02 Thread Anne Archibald
2008/10/2 David Bolme [EMAIL PROTECTED]:
 It may be useful to have an interface that handles both cases:
 similarity and dissimilarity.  Often I have seen Nearest Neighbor
 algorithms that look for maximum similarity instead of minimum
 distance.  In my field (biometrics) we often deal with very
 specialized distance or similarity measures.  I would like to see
 support for user defined distance and similarity functions.  It should
 be easy to implement by passing a function object to the KNN class.  I
 am not sure if kd-trees or other fast algorithms are compatible with
 similarities or non-euclidian norms, however I would be willing to
 implement an exhaustive search KNN that would support user defined
 functions.

kd-trees can only work for distance measures which have certain
special properties (in particular, you have to be able to bound them
based on coordinate differences). This is just fine for all the
Minkowski p-norms (so in particular, Euclidean distance,
maximum-coordinate-difference, and Manhattan distance) and in fact the
current implementation already supports all of these. I don't think
that correlation can be made into such a distance measure - the
neighborhoods are the wrong shape. In fact the basic space is
projective n-1 space rather than affine n-space, so I think you're
going to need some very different algorithm. If you make a metric
space out of it - define d(A,B) to be the angle between A and B - then
cover trees can serve as a spatial data structure for nearest-neighbor
search. Cover trees may be worth implementing, as they're a very
generic data structure, suitable for (among other things)
low-dimensional data in high-dimensional spaces.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-01 Thread Anne Archibald
2008/10/1 Gael Varoquaux [EMAIL PROTECTED]:
 On Tue, Sep 30, 2008 at 06:10:46PM -0400, Anne Archibald wrote:
  k=None in the third call to T.query seems redundant. It should be
  possible do put some logics so that the call is simply

  distances, indices = T.query(xs, distance_upper_bound=1.0)

 Well, the problem with this is that you often want to provide a
 distance upper bound as well as a number of nearest neighbors.

 Absolutely. I just think k should default to None, when
 distance_upper_bound is specified. k=None could be interpreted as k=1
 when distance_uppper_bound is not specified.

That seems very confusing. Better perhaps to have a function
query_all_neighbors, even if it's just a wrapper.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-01 Thread Anne Archibald
2008/10/1 Barry Wark [EMAIL PROTECTED]:

 Thanks for taking this on. The scikits.ann has licensing issues (as
 noted above), so it would be nice to have a clean-room implementation
 in scipy. I am happy to port the scikits.ann API to the final API that
 you choose, however, if you think that would be helpful.

That's a nice idea. I'm not totally sure yet how much it's going to be
possible for different implementations to be plug-in replacements, but
it sure would be nice if users could use ANN transparently.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Upper and lower envelopes

2008-09-30 Thread Anne Archibald
2008/9/30 bevan [EMAIL PROTECTED]:
 Hello,

 I have some XY data.  I would like to generate the equations for an upper and
 lower envelope that excludes a percentage of the data points.

 I would like to define the slope of the envelope line (say 3) and then have my
 code find the intercept that fits my requirements (say 5% of data below the
 lower envelope).  This would then give me the equation and I could plot the
 upper and lower envelopes.


 I hope this makes sense.  Thanks for any help.

For this particular problem - where you know the slope - it's not too
hard. If the slope is b, and your points are x and y, compute y-b*x,
then sort that array, and choose the 5th and 95th percentile values.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-30 Thread Anne Archibald
2008/9/30 Peter [EMAIL PROTECTED]:
 On Tue, Sep 30, 2008 at 5:10 AM, Christopher Barker
 [EMAIL PROTECTED] wrote:

 Anne Archibald wrote:
 I suggest the creation of
 a new submodule of scipy, scipy.spatial,

 +1

 Here's one to consider:
 http://pypi.python.org/pypi/Rtree
 and perhaps other stuff from:
 http://trac.gispython.org/spatialindex/wiki
 which I think is LGPL -- can scipy use that?

 There is also a KDTree module in Biopython (which is under a BSD/MIT
 style licence),
 http://biopython.org/SRC/biopython/Bio/KDTree/

 The current version is in C, there is an older version available in
 the CVS history in C++ too,
 http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/KDTree/?cvsroot=biopython

I think the profusion of different implementations is an argument for
including this in scipy. I think it is also an argument for providing
a standard interface with (at least potentially) several different
implementations. At the moment, that proposed interface looks like:

T = KDTree(data)

distances, indices = T.query(xs) # single nearest neighbor

distances, indices = T.query(xs, k=10) # ten nearest neighbors

distances, indices = T.query(xs, k=None, distance_upper_bound=1.0) #
all within 1 of x

In the first two cases, missing neighbors are represented with an
infinite distance and an invalid index. In the last case, distances
and indices are both either lists (if there's only one query point) or
object arrays of lists (if there are many query points). If only one
neighbor is requested, the array does not have a dimension of length 1
in the which-neighbor position. If (potentially) many neighbors are
returned, they are sorted by distance, nearest first.

What do you think of this interface?

It may make sense to provide additional kinds of query - nearest
neighbors between two trees, for example - some of which would be
available only for some implementations.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-30 Thread Anne Archibald
2008/9/30 Gael Varoquaux [EMAIL PROTECTED]:
 On Tue, Sep 30, 2008 at 05:31:17PM -0400, Anne Archibald wrote:
 T = KDTree(data)

 distances, indices = T.query(xs) # single nearest neighbor

 distances, indices = T.query(xs, k=10) # ten nearest neighbors

 distances, indices = T.query(xs, k=None, distance_upper_bound=1.0) #
 all within 1 of x

 What do you think of this interface?

 k=None in the third call to T.query seems redundant. It should be
 possible do put some logics so that the call is simply

 distances, indices = T.query(xs, distance_upper_bound=1.0)

Well, the problem with this is that you often want to provide a
distance upper bound as well as a number of nearest neighbors. For
example, suppose you are trying to find the point of closest approach
of a path to the set of points stored in a kd-tree. You do the first
query normally, but then since the second point is close to the first,
you can accelerate the search dramatically by providing an upper bound
of the distance from the first point's nearest neighbor to the second
point. With a little luck, this will save ever having to visit much of
the tree.

It is also possible, of course, to provide wrappers to query() that do
just one thing; I had this in mind more for fiddling the output
(returning only the distance to the nearest neighbor, for instance).

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Proposal: scipy.spatial

2008-09-29 Thread Anne Archibald
Hi,

Once again there has been a thread on the numpy/scipy mailing lists
requesting (essentially) some form of spatial data structure. Pointers
have been posted to ANN (sadly LGPLed and in C++) as well as a handful
of pure-python implementations of kd-trees. I suggest the creation of
a new submodule of scipy, scipy.spatial, to contain spatial data
structures and algorithms. Specifically, I propose it contain a
kd-tree implementation providing nearest-neighbor, approximate
nearest-neighbor, and all-points-near queries. I have a few other
suggestions for things it might contain, but kd-trees seem like a good
first step.

2008/9/27 Nathan Bell [EMAIL PROTECTED]:
 On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald
 [EMAIL PROTECTED] wrote:

 I think a kd-tree implementation would be a valuable addition to
 scipy, perhaps in a submodule scipy.spatial that might eventually
 contain other spatial data structures and algorithms. What do you
 think? Should we have one? Should it be based on Sturla Molden's code,
 if the license permits? I am willing to contribute one, if not.

 +1

Judging that your vote and mine are enough in the absence of
dissenting voices, I have written an implementation based on yours,
Sturla Molden's, and the algorithms described by the authors of the
ANN library. Before integrating it into scipy, though, I'd like to
send it around for comments.

Particular issues:

* It's pure python right now, but with some effort it could be
partially or completely cythonized. This is probably a good idea in
the long run. In the meantime I have crudely vectorized it so that
users can at least avoid looping in their own code.
* It is able to return the r nearest neighbors, optionally subject to
a maximum distance limit, as well as approximate nearest neighbors.
* It is not currently set up to conveniently return all points within
some fixed distance of the target point, but this could easily be
added.
* It returns distances and indices of nearest neighbors in the original array.
* The testing code is, frankly, a mess. I need to look into using nose
in a sensible fashion.
* The license is the scipy license.

I am particularly concerned about providing a convenient return
format. The natural return from a query is a list of neighbors, since
it may have variable length (there may not be r elements in the tree,
or you might have supplied a maximum distance which doesn't contain r
points). For a single query, it's simple to return a python list
(should it be sorted? currently it's a heap). But if you want to
vectorize the process, storing the results in an array becomes
cumbersome. One option is an object array full of lists; another,
which, I currently use, is an array with nonexistent points marked
with an infinite distance and an invalid index. A third would be to
return masked arrays. How do you recommend handling this variable (or
potentially-variable) sized output?

 If you're implementing one, I would highly recommend the
 left-balanced kd-tree.
 http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/2535/pdf/imm2535.pdf

Research suggests that at least in high dimension a more geometric
balancing criterion can produce vastly better results. I used the
sliding midpoint rule, which doesn't allow such a numerical
balancing but does ensure that you don't have too many long skinny
cells (since a sphere tends to cut very many of these).

I've also been thinking about what else would go in scipy.spatial. I
think it would be valuable to have a reasonably efficient distance
matrix function (we seem to get that question a lot, and the answer's
not trivial) as well as a sparse distance matrix function based on the
kd-trees. The module could potentially also swallow the (currently
sandboxed?) delaunay code.

Anne
# Copyright Anne M. Archibald 2008
# Released under the scipy license
import numpy as np
from heapq import heappush, heappop

def distance_p(x,y,p=2):
if p==np.inf:
return np.amax(np.abs(y-x),axis=-1)
elif p==1:
return np.sum(np.abs(y-x),axis=-1)
else:
return np.sum(np.abs(y-x)**p,axis=-1)
def distance(x,y,p=2):
if p==np.inf or p==1:
return distance_p(x,y,p)
else:
return distance_p(x,y,p)**(1./p)


class KDTree(object):
kd-tree for quick nearest-neighbor lookup

This class provides an index into a set of k-dimensional points
which can be used to rapidly look up the nearest neighbors of any
point. 

The algorithm used is described in Maneewongvatana and Mount 1999. 
The general idea is that the kd-tree is a binary trie, each of whose
nodes represents an axis-aligned hyperrectangle. Each node specifies
an axis and splits the set of points based on whether their coordinate
along that axis is greater than or less than a particular value. 

During construction, the axis and splitting point are chosen by the 
sliding midpoint rule, which ensures that the cells do not all
become long and thin. 

The tree

Re: [Numpy-discussion] Are there command similar as Matlab find command?

2008-09-29 Thread Anne Archibald
2008/9/30 frank wang [EMAIL PROTECTED]:

 I really do not know the difference of debug mode and the pdb debugger. To
 me, it seems that pdb is only way to debug the python code. How do the
 expert of numpy/python debug their code? Are there any more efficient way to
 debug the code in python world? I used to use matlab which comes with a nice
 debugger. But now I really want to try the open source software and I found
 python/numpy is a nice tool. The only lagging piece of python/numpy
 comparing with matlab is the powerful debuggign capability.

I think you will find that the debugger, pdb, is a good analog to
matlab's debugging mode. The way I usually debug a piece of code looks
something like:

[in window A] edit foomodule.py to add a function frobnicate()
[in window B, which has a running ipython session]
 import foomodule
 reload(foomodule)
 frobnicate()
some exception happens
 %pdb
 frobnicate()
exception happens again
(pdb) print some_variable
47
[in window A] wait a minute, some_variable should be 13 at this
point... edit foomodule.py to fix the bug
[in window B]
 reload(foomodule)
 frobnicate()
21


Alternatively, I write a test_foomodule.py, which contains a suite of
unit tests. I run these tests and get dropped into a debugger when one
fails, and I try to track it down.

The main problem you ran into is that the debugger is not just any old
python prompt. It has its own commands. It also, confusingly, will
accept python commands, as long as they don't conflict with an
internal command. I got into the habit of, when I wanted to run some
python thing, writing
(pdb) p 17+13
rather than just
(pdb) 17+13
If you had simply written
(pdb) p where(a13)
everything would have been fine.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] nonuniform scatter operations

2008-09-28 Thread Anne Archibald
2008/9/28 Geoffrey Irving [EMAIL PROTECTED]:

 Is there an efficient way to implement a nonuniform gather operation
 in numpy?  Specifically, I want to do something like

 n,m = 100,1000
 X = random.uniform(size=n)
 K = random.randint(n, size=m)
 Y = random.uniform(size=m)

 for k,y in zip(K,Y):
X[k] += y

 but I want it to be fast.  The naive attempt X[K] += Y does not
 work, since the slice assumes the indices don't repeat.

I believe histogram can be persuaded to do this.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Minimum distance between 2 paths in 3D

2008-09-27 Thread Anne Archibald
2008/9/27 Andrea Gavana [EMAIL PROTECTED]:

I was wondering if someone had any suggestions/references/snippets
 of code on how to find the minimum distance between 2 paths in 3D.
 Basically, for every path, I have I series of points (x, y, z) and I
 would like to know if there is a better way, other than comparing
 point by point the 2 paths, to find the minimum distance between them.

If you treat the points as simply two clouds of points (that is,
ignore the fact that they represent points on paths), spatial data
structures can make this kind of question faster. For example a
kd-tree can be constructed in something like O(n log n) time, and you
can answer questions like what is the closest point in this set to
(a,b,c)? in something like O(log n) time. So you could do this by
putting one set of points in a kd-tree and just running queries
against each point in the other. There exist other spatial data
structures - octrees, voxel grids, bounding-volume hierarchies, other
binary space partitioning trees, as well as various more specialized
ones. As the number of dimensions becomes large they become
substantially more difficult to work with, and you do need to balance
construction time against lookup time, but they are definitely worth
thinking about in your problem. Sadly, no such data structure exists
in either numpy or scipy, though biopython apparently has an
implementation of kd-trees.


Good luck,
Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Minimum distance between 2 paths in 3D

2008-09-27 Thread Anne Archibald
2008/9/27 Robert Kern [EMAIL PROTECTED]:
 On Sat, Sep 27, 2008 at 15:23, Anne Archibald [EMAIL PROTECTED] wrote:
 2008/9/27 Andrea Gavana [EMAIL PROTECTED]:

I was wondering if someone had any suggestions/references/snippets
 of code on how to find the minimum distance between 2 paths in 3D.
 Basically, for every path, I have I series of points (x, y, z) and I
 would like to know if there is a better way, other than comparing
 point by point the 2 paths, to find the minimum distance between them.

 If you treat the points as simply two clouds of points (that is,
 ignore the fact that they represent points on paths), spatial data
 structures can make this kind of question faster. For example a
 kd-tree can be constructed in something like O(n log n) time, and you
 can answer questions like what is the closest point in this set to
 (a,b,c)? in something like O(log n) time. So you could do this by
 putting one set of points in a kd-tree and just running queries
 against each point in the other. There exist other spatial data
 structures - octrees, voxel grids, bounding-volume hierarchies, other
 binary space partitioning trees, as well as various more specialized
 ones. As the number of dimensions becomes large they become
 substantially more difficult to work with, and you do need to balance
 construction time against lookup time, but they are definitely worth
 thinking about in your problem. Sadly, no such data structure exists
 in either numpy or scipy, though biopython apparently has an
 implementation of kd-trees.

 Sturla Molden has just contributed a pure Python+numpy implementation
 on the Scipy Cookbook.

  http://www.scipy.org/Cookbook/KDTree

I think a kd-tree implementation would be a valuable addition to
scipy, perhaps in a submodule scipy.spatial that might eventually
contain other spatial data structures and algorithms. What do you
think? Should we have one? Should it be based on Sturla Molden's code,
if the license permits? I am willing to contribute one, if not.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proper NaN handling in max and co: a redux

2008-09-26 Thread Anne Archibald
2008/9/26 David Cournapeau [EMAIL PROTECTED]:
 Charles R Harris wrote:

 I'm also wondering about the sign ufunc. It should probably return nan
 for nans, but -1,0,1 are the current values. We also need to decide
 which end of the sorted values the nans should go to. I'm a bit
 partial to the beginning but the end would be fine with me, it might
 even be a bit more natural as searchsorted would find the beginning of
 the nans by default.

I would think sign should return NaN (does it not now?) unless its
return type is integer, in which case I can't see a better answer than
raising an exception (we certainly don't want it silently swallowing
NaNs).

 Note that in both suggest approaches, sort with NaN would raise an
 error. We could then provide a nansort, which ignore the NaN, and deal
 with your case; would it be hard to have an option for the beginning vs
 the end, or would that be difficult ?

Is it really a good idea to duplicate the maskedarray sorting code?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Standard functions (z-score) on nan (again)

2008-09-25 Thread Anne Archibald
2008/9/25 Peter Saffrey [EMAIL PROTECTED]:
 I've bodged my way through my median problems (see previous postings). Now I
 need to take a z-score of an array that might contain nans. At the moment, if
 the array, which is 7000 elements, contains 1 nan or more, all the results 
 come
 out as nan.

 My other problem is that my array is indexed from somewhere else - position 0
 holds the value for a particular id, position 1 holds the value for a 
 particular
 id and so on, so if I remove the nans, all the indices are messed up. Can
 anybody suggest a sensible fix for this problem?

 I know I should really be using masked arrays, but I can't seem to find a
 masked-array version of zs. Is there one?

Just keep an array of indices:

c = ~np.isnan(A)

Aclean = A[c]
indices = np.arange(len(A))[c]

Then the original index of Aclean[i] is indices[i].

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-20 Thread Anne Archibald
2008/9/19 Eric Firing [EMAIL PROTECTED]:
 Pierre GM wrote:

 It seems to me that there are pragmatic reasons
 why people work with NaNs for missing values,
 that perhaps shd not be dismissed so quickly.
 But maybe I am overlooking a simple solution.

 nansomething solutions tend to be considerably faster, that might be one
 reason. A lack of visibility of numpy.ma could be a second. In any case, I
 can't but agree with other posters: a NaN in an array usually means something
 went astray.

 Additional reasons for using nans:

 1) years of experience with Matlab, in which using nan for missing
 values is the standard idiom.

Users are already retraining to use zero-based indexing; I don't think
asking them to use a full-featured masked array package is an
unreasonable retraining burden, particularly since this idiom breaks
as soon as they want to work with arrays of integers or records.

 2) convenient interfacing with extension code in C or C++.

 The latter is a factor in the present use of nan in matplotlib; using
 nan for missing values in an array passed into extension code saves
 having to pass and process a second (mask) array.  It is fast and simple.

How hard is it to pass an array where the masked values have been
filled with nans? It's certainly easy to go the other way (mask all
nans). I think this is less painful than supporting two
differently-featured sets of functions for dealing with arrays
containing some invalid values.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Anne Archibald
2008/9/19 David Cournapeau [EMAIL PROTECTED]:
 Anne Archibald wrote:

 That was in amax/amin. Pretty much every other function that does
 comparisons needs to be fixed to work with nans. In some cases it's
 not even clear how: where should a sort put the nans in an array?

 The problem is more on how the functions use sort than sort itself in
 the case of median. There can't be a 'good' way to put nan in soft, for
 example, since nans cannot be ordered.

Well, for example, you might ask that all the non-nan elements be in
order, even if you don't specify where the nan goes.

 I don't know about the best strategy: either we fix every function using
 comparison, handling nan as a special case as you mentioned, or there
 may be a more clever thing to do to avoid special casing everywhere. I
 don't have a clear idea of how many functions rely on ordering in numpy.

You can always just set numpy to raise an exception whenever it comes
across a nan. In fact, apart from the difficulty of correctly frobbing
numpy's floating-point handling, how reasonable is it for (say) median
to just run as it is now, but if an exception is thrown, fall back to
a nan-aware version?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Anne Archibald
2008/9/19 Pierre GM [EMAIL PROTECTED]:
 On Friday 19 September 2008 03:11:05 David Cournapeau wrote:

 Hm, I am always puzzled when I think about nan handling :) It always
 seem there is not good answer.

 Which is why we have masked arrays, of course ;)

I think the numpy attitude to nans should be that they are unexpected
bogus values that signify that something went wrong with the
calculation somewhere. They can be left in place for most operations,
but any operation that depends on the value should (ideally) return
nan, or failing that, raise an exception. (If users want exceptions
all the time, that's what seterr is for.) If people want to flag bad
data, let's tell them to use masked arrays.

So by this rule amax/maximum/mean/median should all return nan when
there's a nan in their input; I don't think it's reasonable for sort
to return an array full of nans, so I think its default behaviour
should be to raise an exception if there's a nan. It's valuable (for
example in median) to be able to sort them all to the end, but I don't
think this should be the default. If people want nanmin, I would be
tempted to tell them to use masked arrays (is there a convenience
function that makes a masked array with a mask everywhere the data is
nan?).

I am assuming that appropriate masked sort/amax/maximum/mean/median
exist already. They're definitely needed, so how much effort is it
worth putting in to duplicate that functionality with nans instead of
masked elements?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Generating random samples without repeats

2008-09-19 Thread Anne Archibald
2008/9/19 Paul Moore [EMAIL PROTECTED]:
 Robert Kern robert.kern at gmail.com writes:
 On Thu, Sep 18, 2008 at 16:55, Paul Moore pf_moore at yahoo.co.uk wrote:
  I want to generate a series of random samples, to do simulations based
  on them. Essentially, I want to be able to produce a SAMPLESIZE * N
  matrix, where each row of N values consists of either
 
  1. Integers between 1 and M (simulating M rolls of an N-sided die), or
  2. A sample of N numbers between 1 and M without repeats (simulating
 deals of N cards from an M-card deck).
 
  Example (1) is easy, numpy.random.random_integers(1, M, (SAMPLESIZE, N))
 
  But I can't find an obvious equivalent for (2). Am I missing something
  glaringly obvious? I'm using numpy - is there maybe something in scipy I
  should be looking at?

 numpy.array([(numpy.random.permutation(M) + 1)[:N]
 for i in range(SAMPLESIZE)])


 Thanks.

 And yet, this takes over 70s and peaks at around 400M memory use, whereas the
 equivalent for (1)

 numpy.random.random_integers(1,M,(SAMPLESIZE,N))

 takes less than half a second, and negligible working memory (both end up
 allocating an array of the same size, but your suggestion consumes temporary
 working memory - I suspect, but can't prove, that the time taken comes from
 memory allocations rather than computation.

 As a one-off cost initialising my data, it's not a disaster, but I anticipate
 using idioms like this later in my calculations as well, where the costs could
 hurt more.

 If I'm going to need to write C code, are there any good examples of this? (I
 guess the source for numpy.random is a good place to start).

This was discussed on one of the mailing lists several months ago. It
turns out that there is no simple way to efficiently choose without
replacement in numpy/scipy. I posted a hack that does this somewhat
efficiently (if SAMPLESIZEM/2, choose the first SAMPLESIZE of a
permutation; if SAMPLESIZEM/2, choose with replacement and redraw any
duplicates) but it's not vectorized across many sample sets. Is your
problem large M or large N? what is SAMPLESIZE/M?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Anne Archibald
2008/9/19 David Cournapeau [EMAIL PROTECTED]:

 I guess my formulation was poor: I never use NaN as missing values
 because I never use missing values, which is why I wanted the opinion of
 people who use NaN in a different manner (because I don't have a good
 idea on how those people would like to see numpy behave). I was
 certainly not arguing they should not be use for the purpose of missing
 value.

I, on the other hand, was making specifically that suggestion: users
should not use nans to indicate missing values. Users should use
masked arrays to indicate missing values.

 The problem with NaN is that you cannot mix the missing value behavior
 and the error behavior. Dealing with them in a consistent manner is
 difficult. Because numpy is a general numerical computation tool, I
 think that NaN should be propagated and never ignored *by default*. If
 you have NaN because of divide by 0, etc... it should not be ignored at
 all. But if you want it to ignore, then numpy should make it possible:

- max, min: should return NaN if NaN is in the array, or maybe even
 fail ?
- argmax, argmin ?
- sort: should fail ?
- mean, std, variance: should return Nan
- median: should fail (to be consistent if sort fails) ? Should
 return NaN ?

This part I pretty much agree with.

 We could then add an argument to failing functions to tell them either
 to ignore NaN/put them at some special location (like R does, for
 example). The ones I am not sure are median and argmax/argmin. For
 median, failing when sort does is consistent; but this can break a lot
 of code. For argmin/argmax, failing is the most logical, but OTOH,
 making argmin/argmax failing and not max/min is not consistent either.
 Breaking the code is maybe not that bad because currently, neither
 max/min nor argmax/argmin nor sort does return a meaningful function.
 Does that sound reasonable to you ?

The problem with this approach is that all those decisions need to be
made and all that code needs to be implemented for masked arrays. In
fact I suspect that it has already been done in that case. So really
what you are suggesting here is that we duplicate all this effort to
implement the same functions for nans as we have for masked arrays.
It's important, too, that the masked array implementation and the nan
implementation behave the same way, or users will become badly
confused. Who gets the task of keeping the two implementations in
sync?

The current situation is that numpy has two ways to indicate bad data
for floating-point arrays: nans and masked arrays. We can't get rid of
either: nans appear on their own, and masked arrays are the only way
to mark bad data in non-floating-point arrays. We can try to make them
behave the same, which will be a lot of work to provide redundant
capabilities. Or we can make them behave drastically differently.
Masked arrays clearly need to be able to handle masked values flexibly
and explicitly. So I think nans should be handled simply and
conservatively: propagate them if possible, raise if not.

If users are concerned about performance, it's worth noting that on
some machines nans force a fallback to software floating-point
handling, with a corresponding very large performance hit. This
includes some but not all x86 (and I think x86-64) CPUs. How this
compares to the performance of masked arrays is not clear.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Anne Archibald
2008/9/18 David Cournapeau [EMAIL PROTECTED]:
 Peter Saffrey wrote:

 Is this the correct behavior for median with nan?

 That's the expected behavior, at least :) (this is also the expected
 behavior of most math packages I know, including matlab and R, so this
 should not be too surprising if you have used those).

I don't think I agree:

In [4]: np.median([1,3,nan])
Out[4]: 3.0

In [5]: np.median([1,nan,3])
Out[5]: nan

In [6]: np.median([nan,1,3])
Out[6]: 1.0

I think the expected behaviour would be for all of these to return nan.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] subsampling array without loops

2008-08-25 Thread Anne Archibald
2008/8/22 Catherine Moroney [EMAIL PROTECTED]:
 I'm looking for a way to acccomplish the following task without lots
 of loops involved, which are really slowing down my code.

 I have a 128x512 array which I want to break down into 2x2 squares.
 Then, for each 2x2 square I want to do some simple calculations
 such as finding the maximum value, which I then store in a 64x256
 array.

You should be able to do some of this with reshape and transpose:
In [1]: import numpy as np

In [3]: A = np.zeros((128,512))

In [4]: B = np.reshape(A,(64,2,256,2))

In [5]: C = np.transpose(B,(0,2,1,3))

In [6]: C.shape
Out[6]: (64, 256, 2, 2)

Now C[i,j] is the 2 by 2 array
[ [A[2*i,2*j], A[2*i,2*j+1]],
  [A[2*i+1,2*j], A[2*i+1, 2*j+1]] ]
(or maybe I got those indices the wrong way around.)

Now most operations you want to do on the two-by-two matrices can be
done, using ufuncs, on the whole array at once. For example if you
wanted the max of each two-by-two matrix, you'd write:

In [7]: D = np.amax(np.amax(C,axis=-1),axis=-1)

In [8]: D.shape
Out[8]: (64, 256)

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] doing zillions of 3x3 determinants fast

2008-08-24 Thread Anne Archibald
2008/8/25 Daniel Lenski [EMAIL PROTECTED]:
 On Mon, 25 Aug 2008 03:48:54 +, Daniel Lenski wrote:
   * it's fast enough for 100,000 determinants, but it bogs due to
 all the temporary arrays when I try to do 1,000,000 determinants
 (=72 MB array)

 I've managed to reduce the memory usage significantly by getting the
 number of temporary arrays down to exactly two:

 def det3(ar):
a=ar[...,0,0]; b=ar[...,0,1]; c=ar[...,0,2]
d=ar[...,1,0]; e=ar[...,1,1]; f=ar[...,1,2]
g=ar[...,2,0]; h=ar[...,2,1]; i=ar[...,2,2]

t=a.copy(); t*=e; t*=i; tot =t
t=b.copy(); t*=f; t*=g; tot+=t
t=c.copy(); t*=d; t*=h; tot+=t
t=g.copy(); t*=e; t*=c; tot-=t
t=h.copy(); t*=f; t*=a; tot-=t
t=i.copy(); t*=d; t*=b; tot-=t

return tot

 Now it runs very fast with 1,000,000 determinants to do (10X the time
 required to do 100,000) but I'm still worried about the stability with
 real-world data.

As far as I know, that's the best you can do (though come to think of
it, a 3D determinant is a cross-product followed by a dot product, so
you might be able to cheat and use some built-in routines). It's a
current limitation of numpy that there is not much support for doing
many linear algebra problems. We do have perennial discussions about
improving support for array-of-matrices calculations, and in fact
currently in numpy SVN is code to allow C code doing a 3x3 determinant
to be easily made into a generalized ufunc that does ufunc-like
broadcasting and iteration over all but the last two axes.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Generalized ufuncs?

2008-08-17 Thread Anne Archibald
2008/8/17 Robert Kern [EMAIL PROTECTED]:

 I suggested that we move it to a branch for the time being so we can
 play with it and come up with examples of its use. If you have
 examples that you have already written, I would love to see them. I,
 for one, am amenable to seeing this in 1.2.0, but only if we push back
 the release by at least a few weeks.

This is not a worked example, but this is exactly what is needed to
make possible the arrays of matrices functions that were discussed
some time ago. For example, users frequently want to do something like
multiply an array of matrices by an array of matrices; that is not
currently feasible without a very large temporary array (or a loop).

But I think you were looking for examples of code using the interface,
to see whether it worked well.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy 1.2.0b2 released

2008-08-15 Thread Anne Archibald
2008/8/15 Andrew Dalke [EMAIL PROTECTED]:
 On Aug 15, 2008, at 6:41 PM, Andrew Dalke wrote:
 I don't think it's enough.  I don't like environmental
 variable tricks like that.  My tests suggest:
current SVN: 0.12 seconds
my patch: 0.10 seconds
removing some top-level imports: 0.09 seconds
my patch and removing some
   additional top-level imports: 0.08 seconds (this is a guess)


 First, I reverted my patch, so my import times went from
 0.10 second to 0.12 seconds.

 Turns out I didn't revert everything.

 As of the SVN version from 10 minutes ago, import numpy on my
 machine takes 0.18 seconds, not 0.12 seconds.  My patch should
 cut the import time by about 30-40% more from what it is.
 On some machines.  Your milage may vary :)

I realize this is already a very complicated issue, but it's worth
pointing out that the times you measure are not necessarily the times
users care about. These numbers are once everything is loaded into
disk cache. They don't reflect, say, interactive startup time, or time
it takes in a script that uses substantial disk access (i.e. which
fills the cache with something else). I realize this is the only
available basis for comparison, but do keep in mind that improvements
of a few milliseconds here may make a much larger difference in
practice - or a much smaller difference.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] min() of array containing NaN

2008-08-14 Thread Anne Archibald
2008/8/14 Norbert Nemec [EMAIL PROTECTED]:
 Travis E. Oliphant wrote:
 NAN's don't play well with comparisons because comparison with them is
 undefined.See  numpy.nanmin

 This is not true! Each single comparison with a NaN has a well defined
 outcome. The difficulty is only that certain logical assumptions do not
 hold any more when NaNs are involved (e.g. [AB] is not equivalent with
 [not(A=B)]). Assuming an IEEE compliant processor and C compiler, it
 should be possible to code a NaN safe min routine without additional
 overhead.

Sadly, it's not possible without extra overhead. Specifically: the
NaN-ignorant implementation does a single comparison between each
array element and a placeholder, and decides based on the result which
to keep. If you try to rewrite the comparison to do the right thing
when a NaN is involved, you get stuck: any comparison with a NaN on
either side always returns False, so you cannot distinguish between
the temporary being a NaN and the new element being a non-NaN (keep
the temporary) and the temporary being a non-NaN and the new element
being a NaN (replace the temporary). If you're willing to do two
tests, sure, but that's overhead (and probably comparable to an
isnan). If you're willing to do arithmetic you might even be able to
pull it off, since NaNs tend to propagate:
if (newmin) min -= (min-new);
Whether the speed of this is worth its impenetrability I couldn't say.

What do compilers' min builtins do with NaNs? This might well be
faster than an if statement even in the absence of NaNs...

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] min() of array containing NaN

2008-08-12 Thread Anne Archibald
2008/8/12 Stéfan van der Walt [EMAIL PROTECTED]:
 Hi Andrew

 2008/8/12 Andrew Dalke [EMAIL PROTECTED]:
 This is buggy for the case of a list containing only NaNs.

   import numpy as np
   np.NAN
 nan
   np.min([np.NAN])
 nan
   np.nanmin([np.NAN])
 inf
  

 Thanks for the report.  This should be fixed in r5630.

Er, is this actually a bug? I would instead consider the fact that
np.min([]) raises an exception a bug of sorts - the identity of min is
inf. Really nanmin of an array containing only nans should be the same
as an empty array; both should be infinity.

I guess this is a problem for types that don't have an infinity
(returning maxint is a poor substitute), but what is the correct
behaviour here?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] min() of array containing NaN

2008-08-12 Thread Anne Archibald
2008/8/12 Joe Harrington [EMAIL PROTECTED]:

 So, I endorse extending min() and all other statistical routines to
 handle NaNs, possibly with a switch to turn it on if a suitably fast
 algorithm cannot be found (which is competitor IDL's solution).
 Certainly without a switch the default behavior should be to return
 NaN, not to return some random value, if a NaN is present.  Otherwise
 the user may never know a NaN is present, and therefore has to check
 every use for NaNs.  That constand manual NaN checking is slower and
 more error-prone than any numerical speed advantage.

 So to sum, proposed for statistical routnes:
 if NaN is not present, return value
 if NaN is present, return NaN
 if NaN is present and nan=True, return value ignoring all NaNs

 OR:
 if NaN is not present, return value
 if NaN is present, return value ignoring all NaNs
 if NaN is present and nan=True, return NaN

 I'd prefer the latter.  IDL does the former and it is a pain to do
 /nan all the time.  However, the latter might trip up the unwary,
 whereas the former never does.

 This would apply at least to:
 min
 max
 sum
 prod
 mean
 median
 std
 and possibly many others.

For almost all of these the current behaviour is to propagate NaNs
arithmetically. For example, the sum of anything with a NaN is NaN. I
think this is perfectly sufficient, given how easy it is to strip out
NaNs if that's what you want. The issue that started this thread (and
the many other threads that have come up as users stub their toes on
this behaviour) is that min (and other functions based on comparisons)
do not propagate NaNs. If you do np.amin(A) and A contains NaNs, you
can't count on getting a NaN back, unlike np.mean or np.std. the fact
that you get some random value not the minimum just adds insult to
injury. (It is probably also true that the value you get back depends
on how the array is stored in memory.)

It really isn't very hard to replace
np.sum(A)
with
np.sum(A[~isnan(A)])
if you want to ignore NaNs instead of propagating them. So I don't
feel a need for special code in sum() that treats NaN as 0. I would be
content if the comparison-based functions propagated NaNs
appropriately.

If you did decide it was essential to make versions of the functions
that removed NaNs, it would get you most of the way there to add an
optional keyword argument to ufuncs' reduce method that skipped NaNs.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] confusion on importing numpy

2008-08-06 Thread Anne Archibald
2008/8/6 Eric Firing [EMAIL PROTECTED]:

 While I agree with the other posters that import * is not preferred,
 if you want to use it, the solution is to use amin and amax, which are
 provided precisely to avoid the conflict.  Just as arange is a numpy
 analog of range, amin and amax are numpy analogs of min and max.

There are actually several analogs of min and max, depending on what
you want. np.minimum(a,b) takes the elementwise minimum, that is,
np.minimum([1,2,3],[3,2,1]) is [1,2,1]. np.amin(a) finds the minimum
value in the array a. (It's equivalent to np.minimum.reduce(a).)

One reason automatically overriding min is a Bad Idea is that you will
certainly shoot yourself in the foot:

In [4]: numpy.min(2,1)
Out[4]: 2

In [5]: min(2,1)
Out[5]: 1

Note that this does not raise any kind of error, and it sometimes
returns the right value. It's especially bad if you do, say,
numpy.max(a,0) and expect the result to be always positive: now you
get a bug in your program that only turns up with exceptional values.
numpy.min is not the same as python's min, and using it as a
substitute will get you in trouble.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy date/time types and the resolution concept

2008-07-15 Thread Anne Archibald
2008/7/15 Francesc Alted [EMAIL PROTECTED]:

 Maybe is only that.  But by using the term 'frequency' I tend to think
 that you are expecting to have one entry (observation) in your array
 for each time 'tick' since time start.  OTOH, the term 'resolution'
 doesn't have this implication, and only states the precision of the
 timestamp.

 Well, after reading the mails from Chris and Anne, I think the best is
 that the origin would be kept as an int64 with a resolution of
 microseconds (for compatibility with the ``datetime`` module, as I've
 said before).

A couple of details worth pointing out: we don't need a zillion
resolutions. One that's as good as the world time standards, and one
that spans an adequate length of time should cover it. After all, the
only reason for not using the highest available resolution is if you
want to cover a larger range of times. So there is no real need for
microseconds and milliseconds and seconds and days and weeks and...

There is also no need for the origin to be kept with a resolution as
high as microseconds; seconds would do just fine, since if necessary
it can be interpreted as exactly 7000 seconds after the epoch even
if you are using femtoseconds elsewhere.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy date/time types and the resolution concept

2008-07-14 Thread Anne Archibald
2008/7/14 Francesc Alted [EMAIL PROTECTED]:

 After pondering about the opinions about the first proposal, the idea we
 are incubating is to complement the ``datetime64`` with a 'resolution'
 metainfo.  The ``datetime64`` will still be based on a int64 type, but
 the meaning of the 'ticks' would depend on a 'resolution' property.

This is an interesting idea. To be useful, though, you would also need
a flexible offset defining the zero of time. After all, the reason
not to just always use (say) femtosecond accuracy is that 2**64
femtoseconds is only about five hours. So if you're going to use
femtosecond steps, you really want to choose your start point
carefully. (It's also worth noting that there is little need for more
time accuracy than atomic clocks can provide, since anyone looking for
more than that is going to be doing some tricky metrology anyway.)

One might take guidance from the FITS format, which represents (arrays
of) quantities as (usually) fixed-point numbers, but has a global
scale and offset parameter for each array. This allows one to
accurately represent many common arrays with relatively few bits. The
FITS libraries transparently convert these quantities. Of course, this
isn't so convenient if you don't have basic machine datatypes with
enough precision to handle all the quantities of interest.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] RFC: A proposal for implementing some date/time types in NumPy

2008-07-11 Thread Anne Archibald
2008/7/11 Pierre GM [EMAIL PROTECTED]:

 A final note on time scales
 ---
 Wow, indeed. In environmental sciences (my side) and in finances (Matt's), we
 very rarely have a need for that precision, thankfully...

We do, sometimes, in pulsar astronomy. But I think it's reasonable to
force us to use our own custom time representations. For example, I've
dealt with time series representing (X-ray) photon arrival times,
expressed in modified Julian days. The effect of representing
microsecond-scale quantities as differences between numbers on the
order of fifty thousand days I leave to your imagination. But 64 bits
was certainly not enough.

More, there are even more subtle effects relating to differences
between TAI and TT, issues relating to whether your times are measured
at the earth's barycenter or the solar system barycenter... A
date/time class that tries to do everything is quickly going to become
unfeasibly complicated.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] RFC: A proposal for implementing some date/time types in NumPy

2008-07-11 Thread Anne Archibald
2008/7/11 Jon Wright [EMAIL PROTECTED]:

 Timezones are a heck of a problem if you want to be accurate. You are
 talking about nanosecond resolutions, however, atomic clocks in orbit
 apparently suffer from relativistic corrections of the order 38000
 nanoseconds per day [1]. What will you do about data recorded on the
 international space station? Getting into time formats at this level
 seems to be rather complicated - there is no absolute time you can
 reference to - it is all relative :-)

This particular issue turns up in pulsar timing; if you use X-ray
observations, the satellite's orbit introduces all kinds of time
variations. If you care about this you need to think about using (say)
TAI, referenced to (say) the solar system barycenter (if there were no
mass there). You can do all this, but I think it's out of scope for an
ordinary date/time class.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] expected a single-segment buffer object

2008-07-10 Thread Anne Archibald
2008/7/9 Robert Kern [EMAIL PROTECTED]:

 Because that's just what a buffer= argument *is*. It is not a place
 for presenting the starting pointer to exotically-strided memory. Use
 __array_interface__s to describe the full range of representable
 memory. See below.

Aha! Is this stuff documented somewhere?

 I was about a week ahead of you. See numpy/lib/stride_tricks.py in the trunk.

Nice! Unfortunately it can't quite do what I want... for the linear
algebra I need something that can broadcast all but certain axes. For
example, take an array of matrices and an array of vectors. The
array axes need broadcasting, but you can't broadcast on all axes
without (incorrectly) turning the vector into a matrix. I've written a
(messy) implementation, but the corner cases are giving me headaches.
I'll let you know when I have something that works.

Thanks,
Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] expected a single-segment buffer object

2008-07-10 Thread Anne Archibald
2008/7/10 Charles R Harris [EMAIL PROTECTED]:

 On Thu, Jul 10, 2008 at 9:33 AM, Anne Archibald [EMAIL PROTECTED]
 wrote:

 2008/7/9 Robert Kern [EMAIL PROTECTED]:

  Because that's just what a buffer= argument *is*. It is not a place
  for presenting the starting pointer to exotically-strided memory. Use
  __array_interface__s to describe the full range of representable
  memory. See below.

 Aha! Is this stuff documented somewhere?

  I was about a week ahead of you. See numpy/lib/stride_tricks.py in the
  trunk.

 Nice! Unfortunately it can't quite do what I want... for the linear
 algebra I need something that can broadcast all but certain axes. For
 example, take an array of matrices and an array of vectors. The
 array axes need broadcasting, but you can't broadcast on all axes
 without (incorrectly) turning the vector into a matrix. I've written a
 (messy) implementation, but the corner cases are giving me headaches.
 I'll let you know when I have something that works.

 I think something like a matrix/vector dtype would be another way to go for
 stacks of matrices and vectors. It would have to be a user defined type to
 fit into the current type hierarchy for ufuncs, but I think the base
 machinery is there with the generic inner loops.

There's definitely room for discussion about how such a linear algebra
system should ultimately work. In the short term, though, I think it's
valuable to create a prototype system, even if much of the looping has
to be in python, just to figure out how it should look to the user.

For example, I'm not sure whether a matrix/vector dtype would make
easy things like indexing operations - fancy indexing spanning both
array and matrix axes, for example. dtypes can also be a little
impenetrable to use, so even if this is how elementwise linear algebra
is ultimately implemented, we may want to provide a different user
frontend.

My idea for a first sketch was simply to provide (for example)
np.elementwise_linalg.dot_mm, that interprets its arguments both as
arrays of matrices and yields an array of matrices. A second sketch
might be a subclass MatrixArray, which could serve much as Matrix does
now, to tell various functions which axes to do linear algebra over
and which axes to iterate over.

There's also the question of how to make these efficient; one could of
course write a C wrapper for each linear algebra function that simply
iterates, but your suggestion of using the ufunc machinery with a
matrix dtype might be better. Or, if cython acquires sensible
polymorphism over numpy dtypes, a cython wrapper might be the way to
go. But I think establishing what it should look like to users is a
good first step, and I think that's best done with sample
implementations.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] huge array calculation speed

2008-07-10 Thread Anne Archibald
2008/7/10 Dan Lussier [EMAIL PROTECTED]:

 I am relatively new to numpy and am having trouble with the speed of
 a specific array based calculation that I'm trying to do.

 What I'm trying to do is to calculate the total total potential
 energy and coordination number of each atom within a relatively large
 simulation.  Each atom is at a position (x,y,z) given by a row in a
 large array (approximately 1e6 by 3) and presently I have no
 information about its nearest neighbours so each its position must be
 checked against all others before cutting the list down prior to
 calculating the energy.

The way you've implemented this goes as the square of the number of
atoms. This is going to absolutely kill performance, and you can spend
weeks trying to optimize the code for a factor of two speedup. I would
look hard for algorithms that do this in less than O(n^2).

This problem of finding the neighbours of an object has seen
substantial research, and there are a variety of well-established
techniques covering many possible situations. My knowledge of it is
far from up-to-date, but since you have only a three-dimensional
problem, you could try a three-dimensional grid (if your atoms aren't
too clumpy) or octrees (if they are somewhat clumpy); kd-trees are
probably overkill (since they're designed for higher-dimensional
problems).

 My current numpy code is below and in it I have tried to push as much
 of the work for this computation into compiled C (ideally) of numpy.
 However it is still taking a very long time at approx 1 second per
 row.  At this speed even some simple speed ups like weave  won't
 really help.

 Are there any other numpy ways that it would be possible to calculate
 this, or should I start looking at going to C/C++?

 I am also left wondering if there is something wrong with my
 installation of numpy (i.e. not making use of lapack/ATLAS).  Other
 than that if it is relevant - I am running 32 bit x86 Centos 5.1
 linux on a dual Xeon 3.0 GHz Dell tower with 4 GB of memory.

Unfortunately, implementing most of the algorithms in the literature
within numpy is going to be fairly cumbersome. But I think there are
better algorithms you could try:

* Put all your atoms in a grid. Ideally the cells would be about the
size of your cutoff radius, so that for each atom you need to pull out
all atoms in at most eight cells for checking against the cutoff. I
think this can be done fairly efficiently in numpy.

* Sort the atoms along a coordinate and use searchsorted to pull out
those for which that coordinate is within the cutoff. This should get
you down to reasonably modest lists fairly quickly.

There is of course a tradeoff between preprocessing time and lookup time.

We seem to get quite a few posts from people wanting some kind of
spatial data structure (whether they know it or not). Would it make
sense to come up with some kind of compiled spatial data structure to
include in scipy?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] alterdot and restoredot

2008-07-09 Thread Anne Archibald
2008/7/9 Robert Kern [EMAIL PROTECTED]:

 - Which operations do the functions exactly affect?
  It seems that alterdot sets the dot function slot to a BLAS
  version, but what operations does this affect?

 dot(), vdot(), and innerproduct() on C-contiguous arrays which are
 Matrix-Matrix, Matrix-Vector or Vector-Vector products.

Really? Not, say, tensordot()?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] element-wise logical operations on numpy arrays

2008-07-09 Thread Anne Archibald
2008/7/9 Catherine Moroney [EMAIL PROTECTED]:

 I have a question about performing element-wise logical operations
 on numpy arrays.

 If a, b and c are numpy arrays of the same size, does the
 following
 syntax work?

 mask = (a  1.0)  ((b  3.0) | (c  10.0))

 It seems to be performing correctly, but the documentation that I've
 read
 indicates that  and | are for bitwise operations, not element-by-
 element operations in arrays.

 I'm trying to avoid using logical_and and logical_or because they
 make the code more cumbersome and difficult to read.  Are  and |
 acceptable substitutes for numpy arrays?

Yes. Unfortunately it is impossible to make python's usual logical
operators, and, or, etcetera, behave correctly on numpy arrays. So
the decision was made to use the bitwise operators to express logical
operations on boolean arrays. If you like, you can think of boolean
arrays as containing single bits, so that the bitwise operators *are*
the logical operators.

Confusing, but I'm afraid there really isn't anything the numpy
developers can do about it, besides write good documentation.

Good luck,
Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Multiplying every 3 elements by a vector?

2008-07-09 Thread Anne Archibald
2008/7/9 Charles R Harris [EMAIL PROTECTED]:

 On Wed, Jul 9, 2008 at 2:34 PM, Marlin Rowley [EMAIL PROTECTED]
 wrote:

 Thanks Chuck, but I wasn't quit clear with my question.

 You answered exactly according to what I asked, but I failed to mention
 needing the dot product instead of just the product.

 So,

 v dot A = v'

 v'[0] = v[0]*A[0] + v[1]*A[1] + v[2]*A[2]
 v'[1] = v[0]*A[3] + v[1]*A[4] + v[2]*A[5]
 v'[2] = v[0]*A[6] + v[1]*A[7] + v[2]*A[8]


 There is no built in method for this specific problem (stacks of vectors and
 matrices), but you can make things work:

 sum(A.reshape((-1,3))*v, axis=1)

 You can do lots of interesting things using such manipulations and newaxis.
 For instance, multiplying stacks of matrices by stacks of matrices etc. I
 put up a post of such things once if you are interested.

This particular instance can be viewed as a matrix multiplication
(np.dot(A.reshape((-1,3)),v) I think).

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] expected a single-segment buffer object

2008-07-09 Thread Anne Archibald
Hi,

When trying to construct an ndarray, I sometimes run into the
more-or-less mystifying error expected a single-segment buffer
object:

Out[54]: (0, 16, 8)
In [55]: A=np.zeros(2); A=A[np.newaxis,...];
np.ndarray(strides=A.strides,shape=A.shape,buffer=A,dtype=A.dtype)
---
type 'exceptions.TypeError' Traceback (most recent call last)

/home/peridot/ipython console in module()

type 'exceptions.TypeError': expected a single-segment buffer object

In [56]: A.strides
Out[56]: (0, 8)

That is, when I try to construct an ndarray based on an array with a
zero stride, I get this mystifying error. Zero-strided arrays appear
naturally when one uses newaxis, but they are valuable in their own
right (for example for broadcasting purposes). So it's a bit awkward
to have this error appearing when one tries to feed them to
ndarray.__new__ as a buffer. I can, I think, work around it by
removing all axes with stride zero:

def bufferize(A):
   idx = []
   for v in A.strides:
   if v==0:
   idx.append(0)
   else:
   idx.append(slice(None,None,None))
   return A[tuple(idx)]

Is there any reason for this restriction?

Thanks,
Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] expected a single-segment buffer object

2008-07-09 Thread Anne Archibald
2008/7/9 Robert Kern [EMAIL PROTECTED]:

 Yes, the buffer interface, at least the subset that ndarray()
 consumes, requires that all of the data be contiguous in memory.
 array_as_buffer() checks for that using PyArray_ISONE_SEGMENT(), which
 looks like this:

 #define PyArray_ISONESEGMENT(m) (PyArray_NDIM(m) == 0 ||  
 \
 PyArray_CHKFLAGS(m, NPY_CONTIGUOUS) ||   \
 PyArray_CHKFLAGS(m, NPY_FORTRAN))

 Trying to get a buffer object from anything that is neither C- or
 Fortran-contiguous will fail. E.g.

 In [1]: from numpy import *

 In [2]: A = arange(10)

 In [3]: B = A[::2]

 In [4]: ndarray(strides=B.strides, shape=B.shape, buffer=B, dtype=B.dtype)
 ---
 TypeError Traceback (most recent call last)

 /Users/rkern/today/ipython console in module()

 TypeError: expected a single-segment buffer object

Is this really necessary? What does making this restriction gain? It
certainly means that many arrays whose storage is a contiguous block
of memory can still not be used (just permute the axes of a 3d array,
say; it may even be possible for an array to be in C contiguous order
but for the flag not to be set), but how is one to construct exotic
slices of an array that is strided in memory? (The real part of a
complex array, say.)

I suppose one could follow the linked list of .bases up to the
original ndarray, which should normally be C- or Fortran-contiguous,
then work out the offset, but even this may not always work: what if
the original array was constructed with non-C-contiguous strides from
some preexisting buffer?

If the concern is that this allows users to shoot themselves in the
foot, it's worth noting that even with the current setup you can
easily fabricate strides and shapes that go outside the allocated part
of memory.

 What is the use case, here? One rarely has to use the ndarray
 constructor by itself. For example, the result you seem to want from
 the call you make above can be done just fine with .view().

I was presenting a simple example. I was actually trying to use
zero-strided arrays to implement broadcasting.  The code was rather
long, but essentially what it was meant to do was generate a view of
an array in which an axis of length one had been replaced by an axis
of length m with stride zero. (The point of all this was to create a
class like vectorize that was suitable for use on, for example,
np.linalg.inv().) But I also ran into this problem while writing
segmentaxis.py, the code to produce a matrix of sliding windows.
(See http://www.scipy.org/Cookbook/SegmentAxis .) There I caught the
exception and copied the array (unnecessarily) if this came up.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] chararray behavior

2008-07-08 Thread Anne Archibald
2008/7/8 Alan McIntyre [EMAIL PROTECTED]:
 On Tue, Jul 8, 2008 at 1:29 PM, Travis E. Oliphant
 [EMAIL PROTECTED] wrote:
 Alan McIntyre wrote:
 2. The behavior of __mul__ seems odd:

 What is odd about this?

 It is patterned after

   'a' * 3
   'a' * 4
   'a' * 5

 for regular python strings.

 That's what I would have expected, but for N = 4, Q*N is the same as Q*4.

In particular, the returned type is always string of length four,
which is very peculiar - why four?  I realize that variable-length
strings are a problem (object arrays, I guess?), as is returning
arrays of varying dtypes (strings of length N), but this definitely
violates the principle of least surprise...

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


<    1   2   3   4   5   >