[Numpy-discussion] vstack and hstack performance penalty

2014-01-24 Thread Dinesh Vadhia
When using vstack or hstack for large arrays, are there any performance 
penalties eg. takes longer time-wise or makes a copy of an array during 
operation ?___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] vstack and hstack performance penalty

2014-01-24 Thread Sebastian Berg
On Fri, 2014-01-24 at 06:13 -0800, Dinesh Vadhia wrote:
 When using vstack or hstack for large arrays, are there any
 performance penalties eg. takes longer time-wise or makes a copy of an
 array during operation ?

No, they all use concatenate. There are only constant overheads on top
of the necessary data copying. Though performance may vary because of
memory order, etc.

- Sebastian


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Catching out-of-memory error before it happens

2014-01-24 Thread Dinesh Vadhia
I want to write a general exception handler to warn if too much data is being 
loaded for the ram size in a machine for a successful numpy array operation to 
take place.  For example, the program multiplies two floating point arrays A 
and B which are populated with loadtext.  While the data is being loaded, want 
to continuously check that the data volume doesn't pass a threshold that will 
cause on out-of-memory error during the A*B operation.  The known variables are 
the amount of memory available, data type (floats in this case) and the numpy 
array operation to be performed. It seems this requires knowledge of the 
internal memory requirements of each numpy operation.  For sake of simplicity, 
can ignore other memory needs of program.  Is this possible?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Catching out-of-memory error before it happens

2014-01-24 Thread Nathaniel Smith
There is no reliable way to predict how much memory an arbitrary numpy
operation will need, no. However, in most cases the main memory cost will
be simply the need to store the input and output arrays; for large arrays,
all other allocations should be negligible.

The most effective way to avoid running out of memory, therefore, is to
avoid creating temporary arrays, by using only in-place operations.

E.g., if a and b each require N bytes of ram, then memory requirements
(roughly).

c = a + b: 3N
c = a + 2*b: 4N
a += b: 2N
np.add(a, b, out=a): 2N
b *= 2; a += b: 2N

Note that simply loading a and b requires 2N memory, so the latter code
samples are near-optimal.

Of course some calculations do require the use of temporary storage space...

-n
On 24 Jan 2014 15:19, Dinesh Vadhia dineshbvad...@hotmail.com wrote:

  I want to write a general exception handler to warn if too much data is
 being loaded for the ram size in a machine for a successful numpy array
 operation to take place.  For example, the program multiplies two floating
 point arrays A and B which are populated with loadtext.  While the data is
 being loaded, want to continuously check that the data volume doesn't pass
 a threshold that will cause on out-of-memory error during the A*B
 operation.  The known variables are the amount of memory available, data
 type (floats in this case) and the numpy array operation to be performed.
 It seems this requires knowledge of the internal memory requirements of
 each numpy operation.  For sake of simplicity, can ignore other memory
 needs of program.  Is this possible?


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Catching out-of-memory error before it happens

2014-01-24 Thread Francesc Alted

Yeah, numexpr is pretty cool for avoiding temporaries in an easy way:

https://github.com/pydata/numexpr

Francesc

El 24/01/14 16:30, Nathaniel Smith ha escrit:


There is no reliable way to predict how much memory an arbitrary numpy 
operation will need, no. However, in most cases the main memory cost 
will be simply the need to store the input and output arrays; for 
large arrays, all other allocations should be negligible.


The most effective way to avoid running out of memory, therefore, is 
to avoid creating temporary arrays, by using only in-place operations.


E.g., if a and b each require N bytes of ram, then memory requirements 
(roughly).


c = a + b: 3N
c = a + 2*b: 4N
a += b: 2N
np.add(a, b, out=a): 2N
b *= 2; a += b: 2N

Note that simply loading a and b requires 2N memory, so the latter 
code samples are near-optimal.


Of course some calculations do require the use of temporary storage 
space...


-n

On 24 Jan 2014 15:19, Dinesh Vadhia dineshbvad...@hotmail.com 
mailto:dineshbvad...@hotmail.com wrote:


I want to write a general exception handler to warn if too much
data is being loaded for the ram size in a machine for a
successful numpy array operation to take place.  For example, the
program multiplies two floating point arrays A and B which are
populated with loadtext.  While the data is being loaded, want to
continuously check that the data volume doesn't pass a threshold
that will cause on out-of-memory error during the A*B operation.
The known variables are the amount of memory available, data type
(floats in this case) and the numpy array operation to be
performed. It seems this requires knowledge of the internal memory
requirements of each numpy operation.  For sake of simplicity, can
ignore other memory needs of program.  Is this possible?

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion



--
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Catching out-of-memory error before it happens

2014-01-24 Thread Chris Barker - NOAA Federal
c = a + b: 3N
c = a + 2*b: 4N

Does python garbage collect mid-expression? I.e. :

C = (a + 2*b) + b

4 or 5 N?

Also note that when memory gets tight, fragmentation can be a problem. I.e.
if two size-n arrays where just freed, you still may not be able to
allocate a size-2n array. This seems to be worse on windows, not sure why.

a += b: 2N
np.add(a, b, out=a): 2N
b *= 2; a += b: 2N

Note that simply loading a and b requires 2N memory, so the latter code
samples are near-optimal.

And will run quite a bit faster for large arrays--pushing that memory
around takes time.

-Chris
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] vstack and hstack performance penalty

2014-01-24 Thread Dinesh Vadhia
If A is very large and B is very small then np.concatenate(A, B) will copy
B's data over to A which would take less time than the other way around - is
that so?

Does 'memory order' mean that it depends on sufficient contiguous
memory being available for B otherwise it will be fragmented or something
else? 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] vstack and hstack performance penalty

2014-01-24 Thread Robert Kern
On Fri, Jan 24, 2014 at 4:01 PM, Dinesh Vadhia dineshbvad...@hotmail.com
wrote:

 If A is very large and B is very small then np.concatenate(A, B) will copy
 B's data over to A which would take less time than the other way around -
is
 that so?

No, neither array is modified in-place. A new array is created and both A
and B are copied into it. The order is largely unimportant.

 Does 'memory order' mean that it depends on sufficient contiguous
 memory being available for B otherwise it will be fragmented or something
 else?

No, the output is never fragmented. numpy arrays may be strided, but never
fragmented arbitrarily to fit into a fragmented address space.

http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray

The issue is what axis the concatenation happens on. If it's the first axis
(and both inputs are contiguous), then it only takes two memcpy() calls to
copy the data, one for each input, because the regions where they go into
the output are juxtaposed. If you concatenate on one of the other axes,
though, then the memory regions for A and B will be interleaved and you
have to do 2*N memory copies (N being some number depending on the shape).

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Catching out-of-memory error before it happens

2014-01-24 Thread Nathaniel Smith
On 24 Jan 2014 15:57, Chris Barker - NOAA Federal chris.bar...@noaa.gov
wrote:


 c = a + b: 3N
 c = a + 2*b: 4N

 Does python garbage collect mid-expression? I.e. :

 C = (a + 2*b) + b

 4 or 5 N?

It should be collected as soon as the reference gets dropped, so 4N. (This
is the advantage of a greedy refcounting collector.)

 Also note that when memory gets tight, fragmentation can be a problem.
I.e. if two size-n arrays where just freed, you still may not be able to
allocate a size-2n array. This seems to be worse on windows, not sure why.

If your arrays are big enough that you're worried that making a stray copy
will ENOMEM, then you *shouldn't* have to worry about fragmentation -
malloc will give each array its own virtual mapping, which can be backed by
discontinuous physical memory. (I guess it's possible windows has a somehow
shoddy VM system and this isn't true, but that seems unlikely these days?)

Memory fragmentation is more a problem if you're allocating lots of small
objects of varying sizes.

On 32 bit, virtual address fragmentation could also be a problem, but if
you're working with giant data sets then you need 64 bits anyway :-).

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] np.array creation: unexpected behaviour

2014-01-24 Thread Emanuele Olivetti
Hi,

I just came across this unexpected behaviour when creating
a np.array() from two other np.arrays of different shape.
Have a look at this example:

import numpy as np
a = np.zeros(3)
b = np.zeros((2,3))
c = np.zeros((3,2))
ab = np.array([a, b])
print ab.shape, ab.dtype
ac = np.array([a, c], dtype=np.object)
print ac.shape, ac.dtype
ac_no_dtype = np.array([a, c])
print ac_no_dtype.shape, ac_no_dtype.dtype

The output, with NumPy v1.6.1 (Ubuntu 12.04) is:

(2,) object
(2, 3) object
Traceback (most recent call last):
   File /tmp/numpy_bug.py, line 9, in module
 ac_no_dtype = np.array([a, c])
ValueError: setting an array element with a sequence.


The result for 'ab' is what I expect. The one for 'ac' is
a bit surprising. The one for ac_no_dtype even
is more surprising.

Is this an expected behaviour?

Best,

Emanuele

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.array creation: unexpected behaviour

2014-01-24 Thread josef . pktd
On Fri, Jan 24, 2014 at 11:30 AM, Emanuele Olivetti
emanu...@relativita.com wrote:
 Hi,

 I just came across this unexpected behaviour when creating
 a np.array() from two other np.arrays of different shape.
 Have a look at this example:
 
 import numpy as np
 a = np.zeros(3)
 b = np.zeros((2,3))
 c = np.zeros((3,2))
 ab = np.array([a, b])
 print ab.shape, ab.dtype
 ac = np.array([a, c], dtype=np.object)
 print ac.shape, ac.dtype
 ac_no_dtype = np.array([a, c])
 print ac_no_dtype.shape, ac_no_dtype.dtype
 
 The output, with NumPy v1.6.1 (Ubuntu 12.04) is:
 
 (2,) object
 (2, 3) object
 Traceback (most recent call last):
File /tmp/numpy_bug.py, line 9, in module
  ac_no_dtype = np.array([a, c])
 ValueError: setting an array element with a sequence.
 

 The result for 'ab' is what I expect. The one for 'ac' is
 a bit surprising. The one for ac_no_dtype even
 is more surprising.

 Is this an expected behaviour?

the exception in ac_no_dtype is what I always expected, since it's not
a rectangular array. It usually happened when I make a mistake.
**Unfortunately** in newer numpy version it will also create an object array.

AFAIR

Josef


 Best,

 Emanuele

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Catching out-of-memory error before it happens

2014-01-24 Thread Dinesh Vadhia
So, with the example case, the approximate memory cost for an in-place 
operation would be:

A *= B : 2N

But, if the original A or B is to remain unchanged then it will be:

C = A * B : 3N ?

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Catching out-of-memory error before it happens

2014-01-24 Thread Nathaniel Smith
Yes.
On 24 Jan 2014 17:19, Dinesh Vadhia dineshbvad...@hotmail.com wrote:

  So, with the example case, the approximate memory cost for an in-place
 operation would be:

 A *= B : 2N

 But, if the original A or B is to remain unchanged then it will be:

 C = A * B : 3N ?



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Catching out-of-memory error before it happens

2014-01-24 Thread Dinesh Vadhia
Francesc: Thanks. I looked at numexpr a few years back but it didn't support 
array slicing/indexing.  Has that changed?


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] (no subject)

2014-01-24 Thread Ralf Gommers
On Thu, Jan 23, 2014 at 11:58 PM, jennifer stone
jenny.stone...@gmail.comwrote:





 Scipy doesn't have a function for the Laplace transform, it has only a
 Laplace distribution in scipy.stats and a Laplace filter in scipy.ndimage.
 An inverse Laplace transform would be very welcome I'd think - it has real
 world applications, and there's no good implementation in any open source
 library as far as I can tell. It's probably doable, but not the easiest
 topic for a GSoC I think. From what I can find, the paper Numerical
 Transform Inversion Using Gaussian Quadrature from den Iseger contains
 what's considered the current state of the art algorithm. Browsing that
 gives a reasonable idea of the difficulty of implementing `ilaplace`.


 A brief scanning through the paper Numerical Transform Inversion Using
 Gaussian Quadrature from den Iseger does indicate the complexity of the
 algorithm. But GSoC project or not, can't we work on it, step by step? As I
 would love to see a contender for Matlab's ilaplace on open source front!!


Yes, it would be quite nice to have. So if you're interested, by all means
give it a go. An issue for a GSoC will be how to maximize the chance of
success - typically merging smaller PRs frequently helps a lot in that
respect, but we can't merge an ilaplace implementation step by step.


 You can have a look at https://github.com/scipy/scipy/pull/2908/files for
 ideas. Most of the things that need improving or we really think we should
 have in Scipy are listed there. Possible topics are not restricted to that
 list though - it's more important that you pick something you're
 interested
 in and have the required background and coding skills for.


 Thanks a lot for the roadmap. Of the options provided, I found the
 'Cython'ization of Cluster great. Would it be possible to do it as the
 Summer project if I spend the month learning Cython?


There are a couple of things to consider. Your proposal should be neither
too easy nor too ambitious for one summer. Cythonizing cluster is probably
not enough for a full summer of work, especially if you can re-use some
Cython code that David WF or other people already have. So some new
functionality can be added to your proposal. The other important point is
that you need to find a mentor. Cluster is one of the smaller modules that
doesn't see a lot of development and most of the core devs may not know so
well. A good proposal may help find an interested mentor. I suggest you
start early with a draft proposal, and iterate a few times based on
feedback on this list.

You may want to have a look at your email client settings by the way, your
replies seem to start new threads.

Cheers,
Ralf


 Regards
 Janani



 Cheers,
 Ralf



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Catching out-of-memory error before it happens

2014-01-24 Thread Chris Barker
On Fri, Jan 24, 2014 at 8:25 AM, Nathaniel Smith n...@pobox.com wrote:

 If your arrays are big enough that you're worried that making a stray copy
 will ENOMEM, then you *shouldn't* have to worry about fragmentation -
 malloc will give each array its own virtual mapping, which can be backed by
 discontinuous physical memory. (I guess it's possible windows has a somehow
 shoddy VM system and this isn't true, but that seems unlikely these days?)

All I know is that when I push the limits with memory on a 32 bit Windows
system, it often crashed out when I've never seen more than about 1GB
of memory use by the application -- I would have thought that would
be plenty of overhead.

I also know that I've reached limits onWindows32 well before OS_X 32, but
that may be because IIUC, Windows32 only allows 2GB per process, whereas
OS-X32 allows 4GB per process.

Memory fragmentation is more a problem if you're allocating lots of small
 objects of varying sizes.

It could be that's what I've been doing

On 32 bit, virtual address fragmentation could also be a problem, but if
 you're working with giant data sets then you need 64 bits anyway :-).

well, giant is defined relative to the system capabilities... but yes, if
you're  pushing the limits of a 32 bit system , the easiest thing to do is
go to 64bits and some more memory!

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Text array dtype for numpy

2014-01-24 Thread Chris Barker
Oscar,

Cool stuff, thanks!

I'm wondering though what the use-case really is. The P3 text  model
(actually the py2 one, too), is quite clear that you want users to think
of, and work with, text as text -- and not care how things are encoding in
the underlying implementation. You only want the user to think about
encodings on I/O -- transferring stuff between systems where you can't
avoid it. And you might choose different encodings based on different needs.

So why have a different, the-user-needs-to-think-about-encodings numpy
 dtype? We already have 'U' for full-on unicode support for text. There is
a good argument for a more compact internal representation for text
compatible with one-byte-per-char encoding, thus the suggestion for such a
dtype. But I don't see the need for quite this. Maybe I'm not being a
creative enough thinker.

Also, we may want numpy to interact at a low level with other libs that
might have binary encoded text (HDF, etc) -- in which case we need a bytes
dtype that can store that data, and perhaps encoding and decoding ufuncs.

If we want a more efficient and compact unicode implementation  then the
py3 one is a good  place to start -it's pretty slick! Though maybe harder
to due in numpy as text in numpy probably wouldn't be immutable.

To make a slightly more concrete proposal, I've implemented a pure
 Python ndarray subclass that I believe can consistently handle
 text/bytes in Python 3.


this scares me right there -- is it text or bytes??? We really don't want
something that is both.


 The idea is that the array has an encoding. It stores strings as
 bytes. The bytes are encoded/decoded on insertion/access. Methods
 accessing the binary content of the array will see the encoded bytes.
 Methods accessing the elements of the array will see unicode strings.

 I believe it would not be as hard to implement as the proposals for
 variable length string arrays.


except that with some encodings, the number of bytes required is a function
of what the content of teh text is -- so it either has to be variable
length, or a fixed number of bytes, which is not a fixed number
of characters  which require both careful truncation (a pain), and
surprising results for users  why can't I fit 10 characters is a length-10
text object? And I can if they are different characters?)


 The one caveat is that it will strip
 null characters from the end of any string.


which is fatal, but you do want a new dtype after all, which presumably
wouldn't do that.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Catching out-of-memory error before it happens

2014-01-24 Thread Nathaniel Smith
On Fri, Jan 24, 2014 at 10:29 PM, Chris Barker chris.bar...@noaa.gov wrote:
 On Fri, Jan 24, 2014 at 8:25 AM, Nathaniel Smith n...@pobox.com wrote:

 If your arrays are big enough that you're worried that making a stray copy
 will ENOMEM, then you *shouldn't* have to worry about fragmentation - malloc
 will give each array its own virtual mapping, which can be backed by
 discontinuous physical memory. (I guess it's possible windows has a somehow
 shoddy VM system and this isn't true, but that seems unlikely these days?)

 All I know is that when I push the limits with memory on a 32 bit Windows
 system, it often crashed out when I've never seen more than about 1GB of
 memory use by the application -- I would have thought that would be plenty
 of overhead.

 I also know that I've reached limits onWindows32 well before OS_X 32, but
 that may be because IIUC, Windows32 only allows 2GB per process, whereas
 OS-X32 allows 4GB per process.

 Memory fragmentation is more a problem if you're allocating lots of small
 objects of varying sizes.

 It could be that's what I've been doing

 On 32 bit, virtual address fragmentation could also be a problem, but if
 you're working with giant data sets then you need 64 bits anyway :-).

 well, giant is defined relative to the system capabilities... but yes, if
 you're  pushing the limits of a 32 bit system , the easiest thing to do is
 go to 64bits and some more memory!

Oh, yeah, common confusion. Allowing 2 GiB of address space per
process doesn't mean you can actually practically use 2 GiB of
*memory* per process, esp. if you're allocating/deallocating a mix of
large and small objects, because address space fragmentation will kill
you way before that. The memory is there, there isn't anywhere to slot
it into the process's address space. So you don't need to add more
memory, just switch to a 64-bit OS.

On 64-bit you have oodles of address space, so the memory manager can
easily slot in large objects far away from small objects, and it's
only fragmentation within each small-object arena that hurts. A good
malloc will keep this overhead down pretty low though -- certainly
less than the factor of two you're thinking about.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Comparison changes

2014-01-24 Thread Sebastian Berg
Hi all,

in https://github.com/numpy/numpy/pull/3514 I proposed some changes to
the comparison operators. This includes:

1. Comparison with None will broadcast in the future, so that `arr ==
None` will actually compare all elements to None. (A FutureWarning for
now)

2. I added that == and != will give FutureWarning when an error was
raised. In the future they should not silence these errors anymore. (For
example shape mismatches)

3. We used to use PyObject_RichCompareBool for equality which includes
an identity check. I propose to not do that identity check since we have
elementwise equality (returning an object array for objects would be
nice in some ways, but I think that is only an option for a dedicated
function). The reason is that for example

 a = np.array([np.array([1, 2, 3]), 1])
 b = np.array([np.array([1, 2, 3]), 1])
 a == b

will happen to work if it happens to be that `a[0] is b[0]`. This
currently has no deprecation, since the logic is in the inner loop and I
am not sure if it is easy to add well there.

Are there objections/comments to these changes?

Regards,

Sebastian

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Comparison changes

2014-01-24 Thread Nathaniel Smith
On 25 Jan 2014 00:05, Sebastian Berg sebast...@sipsolutions.net wrote:

 Hi all,

 in https://github.com/numpy/numpy/pull/3514 I proposed some changes to
 the comparison operators. This includes:

 1. Comparison with None will broadcast in the future, so that `arr ==
 None` will actually compare all elements to None. (A FutureWarning for
 now)

 2. I added that == and != will give FutureWarning when an error was
 raised. In the future they should not silence these errors anymore. (For
 example shape mismatches)

This can just be a DeprecationWarning, because the only change is to raise
new more errors.

 3. We used to use PyObject_RichCompareBool for equality which includes
 an identity check. I propose to not do that identity check since we have
 elementwise equality (returning an object array for objects would be
 nice in some ways, but I think that is only an option for a dedicated
 function). The reason is that for example

  a = np.array([np.array([1, 2, 3]), 1])
  b = np.array([np.array([1, 2, 3]), 1])
  a == b

 will happen to work if it happens to be that `a[0] is b[0]`. This
 currently has no deprecation, since the logic is in the inner loop and I
 am not sure if it is easy to add well there.

Surely any environment where we can call PyObject_RichCompareBool is an
environment where we can issue a warning...?

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.array creation: unexpected behaviour

2014-01-24 Thread Stéfan van der Walt
On Fri, 24 Jan 2014 17:30:33 +0100, Emanuele Olivetti wrote:
 I just came across this unexpected behaviour when creating
 a np.array() from two other np.arrays of different shape.

The tuple parsing for the construction of new numpy arrays is pretty
tricky/hairy, and doesn't always do exactly what you'd expect.

The easiest workaround is probably to pre-allocate the array:

In [24]: data = [a, c]
In [25]: x = np.empty(len(data), dtype=object)
In [26]: x[:] = data
In [27]: x.shape
Out[27]: (2,)

Regards
Stéfan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Text array dtype for numpy

2014-01-24 Thread josef . pktd
On Fri, Jan 24, 2014 at 5:43 PM, Chris Barker chris.bar...@noaa.gov wrote:
 Oscar,

 Cool stuff, thanks!

 I'm wondering though what the use-case really is. The P3 text  model
 (actually the py2 one, too), is quite clear that you want users to think of,
 and work with, text as text -- and not care how things are encoding in the
 underlying implementation. You only want the user to think about encodings
 on I/O -- transferring stuff between systems where you can't avoid it. And
 you might choose different encodings based on different needs.

 So why have a different, the-user-needs-to-think-about-encodings numpy
 dtype? We already have 'U' for full-on unicode support for text. There is a
 good argument for a more compact internal representation for text compatible
 with one-byte-per-char encoding, thus the suggestion for such a dtype. But I
 don't see the need for quite this. Maybe I'm not being a creative enough
 thinker.

In my opinion something like Oscar's class would be very useful (with
some adjustments, especially making it easy to create an S view or put
a encoding view on top of an S array).

(Disclaimer: My only experience is in converting some examples in
statsmodels to bytes in py 3 and to play with some examples.)

My guess is that 'S'/bytes is very convenient for library code,
because it doesn't care about encodings (assuming we have enough
control that all bytes are in the same encoding), and we don't have
any overhead to convert to strings when comparing or working with
byte strings.
'S' is also very flexible because it doesn't tie us down to a minimum
size for the encoding nor any specific encoding.

The problem of 'S'/bytes is in input output and interactive work, as
in the examples of Tom Aldcroft. The textarray dtype would allow us to
view any 'S' array so we can have text/string interaction with python
and get the correct encoding on input and output.

Whether you live in an ascii, latin1, cp1252, iso8859_5 or in any
other world, you could get your favorite minimal memory
S/bytes/strings.

I think this is useful as a complement to the current 'S' type, and to
make that more useful on python 3, independent of what other small
memory unicode dtype with predefined encoding numpy could get.


 Also, we may want numpy to interact at a low level with other libs that
 might have binary encoded text (HDF, etc) -- in which case we need a bytes
 dtype that can store that data, and perhaps encoding and decoding ufuncs.

 If we want a more efficient and compact unicode implementation  then the py3
 one is a good  place to start -it's pretty slick! Though maybe harder to due
 in numpy as text in numpy probably wouldn't be immutable.

 To make a slightly more concrete proposal, I've implemented a pure
 Python ndarray subclass that I believe can consistently handle
 text/bytes in Python 3.


 this scares me right there -- is it text or bytes??? We really don't want
 something that is both.

Most users won't care about the internal representation of anything.
But when we want or find it useful we can view the memory with any
compatible dtype. That is, with numpy we always have also raw bytes.
And there are lot's of ways to shoot yourself

why would you want to to that? :
 a = np.arange(5)
 b = a.view('S4')
 b[1] = 'h'
 a
array([  0, 104,   2,   3,   4])

 a[1] = 'h'
Traceback (most recent call last):
  File pyshell#22, line 1, in module
a[1] = 'h'
ValueError: invalid literal for int() with base 10: 'h'



 The idea is that the array has an encoding. It stores strings as
 bytes. The bytes are encoded/decoded on insertion/access. Methods
 accessing the binary content of the array will see the encoded bytes.
 Methods accessing the elements of the array will see unicode strings.

 I believe it would not be as hard to implement as the proposals for
 variable length string arrays.


 except that with some encodings, the number of bytes required is a function
 of what the content of teh text is -- so it either has to be variable
 length, or a fixed number of bytes, which is not a fixed number of
 characters  which require both careful truncation (a pain), and surprising
 results for users  why can't I fit 10 characters is a length-10 text
 object? And I can if they are different characters?)

not really different to other places where you have to pay attention
to the underlying dtype, and a question of providing the underlying
information. (like itemsize)

1 - 1e-20 I had code like that when I wasn't thinking properly or
wasn't paying enough attention to what I was typing.



 The one caveat is that it will strip
 null characters from the end of any string.


 which is fatal, but you do want a new dtype after all, which presumably
 wouldn't do that.

The only place so far that I found where this really hurts is in the
decode examples (with utf32LE for example).
That's why I think numpy needs to have decode/encode functions, so it
can access the bytes before they are null truncated, besides being