[Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Aldcroft, Thomas
The idea of a one-byte string dtype has been extensively discussed twice
before, with a lot of good input and ideas, but no action [1, 2].

tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte string
dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3
usage in the near term?

A key consequence of not having a one-byte string dtype is that handling
ASCII data stored in binary formats such as HDF or FITS is basically broken
in Python 3.  Packages like h5py, pytables, and astropy.io.fits all return
text data arrays with the numpy 'S' type, and in fact have no direct
support for the numpy wide unicode 'U' type.  In Python 3, the 'S' type
array cannot be compared with the Python str type, so that something like
below fails:

  mask = (names_array == john)  # FAIL

Problems like this are now showing up in the wild [3].  Workarounds are
also showing up, like a way to easily convert from 'S' to 'U' within
astropy Tables [4], but this is really not a desirable way to go.
Gigabyte-sized string data arrays are not uncommon, so converting to UCS-4
is a real memory and performance hit.

For a good top-level summary of much of the previous thread discussion, see
[5] from Chris Barker.  Condensing this down to just a few points:

- *Changing* the behavior of the existing 'S' type is going to break code
and seems a bad idea.
- *Adding*  a new dtype 's' will work and allow highly performant
conversion from 'S' to 's' via view().
- Using the latin-1 encoding will minimize code breakage vis-a-vis what
works in Python 2 [6].

Using latin-1 is a pragmatic compromise that provides continuity to allow
scientists to run their existing code in Python 3 and have things just
work.  It isn't perfect and it should not be the end of the story, but it
would be good.  This single issue is the *only* thing blocking me and my
team from using Python 3 in operations.

As a final point, I don't know the numpy internals at all, but it *seems*
like this proposal is one of the easiest to implement amongst those that
were discussed.

Cheers,
Tom

[1]:
http://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html
[2]: http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070574.html
[3]: https://github.com/astropy/astropy/issues/3311
[4]:
http://astropy.readthedocs.org/en/latest/api/astropy.table.Table.html#astropy.table.Table.convert_bytestring_to_unicode
[5]: http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070631.html
[6]: It is not uncommon to store uint8 data in a bytestring
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Sturla Molden
On 22/02/15 19:21, Aldcroft, Thomas wrote:

 Problems like this are now showing up in the wild [3].  Workarounds are
 also showing up, like a way to easily convert from 'S' to 'U' within
 astropy Tables [4], but this is really not a desirable way to go.
 Gigabyte-sized string data arrays are not uncommon, so converting to
 UCS-4 is a real memory and performance hit.

Why UCS-4? The Python's internal flexible string respresentation will 
use ascii for ascii text.

By PEP 393 an application should not assume an internal string 
representation at all:

https://www.python.org/dev/peps/pep-0393/

If the problem is PEP 393 violation in NumPy string or unicode dtype, we 
shouldn't violate it even further by adding a latin-1 encoded ascii 
string. We should let Python represent strings as it wants, and it will 
not bloat.

I am m -1 on adding latin-1 and +1 on making the unicode dtype PEP 393 
compliant if it is not. And on Python 3 'U' and 'S' should just be synonyms.

You can also store an array of bytes with uint8. Then you can decode it 
however you like to make a Python string. If it is encoded as latin-1 
then decode it as latin-1:


In [1]: import numpy as np

In [2]: ascii_bytestr = The quick brown fox jumps over the lazy 
dog.encode('latin-1')

In [3]: numpy_bytestr = np.array(memoryview(ascii_bytestr))

In [4]: numpy_bytestr.dtype, numpy_bytestr.shape
Out[4]: (dtype('uint8'), (43,))

In [5]: bytes(numpy_bytestr).decode('latin-1')
Out[5]: 'The quick brown fox jumps over the lazy dog'


Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Nathaniel Smith
On Sun, Feb 22, 2015 at 11:29 AM, Sturla Molden sturla.mol...@gmail.com wrote:
 On 22/02/15 19:21, Aldcroft, Thomas wrote:

 Problems like this are now showing up in the wild [3].  Workarounds are
 also showing up, like a way to easily convert from 'S' to 'U' within
 astropy Tables [4], but this is really not a desirable way to go.
 Gigabyte-sized string data arrays are not uncommon, so converting to
 UCS-4 is a real memory and performance hit.

 Why UCS-4? The Python's internal flexible string respresentation will
 use ascii for ascii text.

This is a discussion about how strings are represented as bit-patterns
inside ndarrays; the internal storage representation used by 'str' is
irrelevant.

-n

-- 
Nathaniel J. Smith -- http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Robert Kern
On Sun, Feb 22, 2015 at 7:29 PM, Sturla Molden sturla.mol...@gmail.com
wrote:

 On 22/02/15 19:21, Aldcroft, Thomas wrote:

  Problems like this are now showing up in the wild [3].  Workarounds are
  also showing up, like a way to easily convert from 'S' to 'U' within
  astropy Tables [4], but this is really not a desirable way to go.
  Gigabyte-sized string data arrays are not uncommon, so converting to
  UCS-4 is a real memory and performance hit.

 Why UCS-4? The Python's internal flexible string respresentation will
 use ascii for ascii text.

numpy's 'U' dtype is UCS-4, and this is what Thomas is referring to, not
Python's string type. It cannot have a flexible representation as it *is*
the representation. Python 3's `str` type is opaque, so it can freely
choose how to represent the data in memory. numpy dtypes transparently
describe how the data is represented in memory.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Sturla Molden
On 22/02/15 21:04, Robert Kern wrote:

 Python 3's `str` type is opaque, so it can
 freely choose how to represent the data in memory. numpy dtypes
 transparently describe how the data is represented in memory.

Hm, yes, that is a good point.


Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Nathaniel Smith
On Sun, Feb 22, 2015 at 2:46 PM, Charles R Harris charlesr.har...@gmail.com
wrote:

 On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas
 aldcr...@head.cfa.harvard.edu wrote:



 On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
 aldcr...@head.cfa.harvard.edu wrote:
  The idea of a one-byte string dtype has been extensively discussed
  twice
  before, with a lot of good input and ideas, but no action [1, 2].
 
  tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte
  string
  dtype named 's' that uses latin-1 encoding as a bridge to enable
Python
  3
  usage in the near term?

 I think this is a good idea. I think overall it would be good for
 numpy to switch to using variable-length strings in most cases (cf.
 pandas), which is a different kind of change, but fixed-length 8-bit
 encoded text is obviously a common on-disk format in scientific
 applications, so numpy will still need some way to deal with it
 conveniently. In the long run we'd like to have more flexibility (e.g.
 allowing choice of character encoding), but since this proposal is a
 subset of that functionality, then it won't interfere with later
 improvements. I can see an argument for utf8 over latin1, but it
 really doesn't matter that much so whatever, blue and purple bikesheds
 are both fine.

 The tricky bit here is just :-). Do you want to implement this? Do
 you know someone who does? It's possible but will be somewhat
 annoying, since to do it directly without refactoring how dtypes work
 first then you'll have to add lots of copy-paste code to all the
 different ufuncs.


 I'm would be happy to have a go at this, with the caveat that someone who
 understands numpy would need to get me started with a minimal prototype.
 From there I can do the annoying copy-paste for ufuncs etc, writing
tests
 and docs. I'm assuming that with a prototype then the rest can be done
 without any deep understanding of numpy internals (which I do not have).

 - Tom



 The last two new types added to numpy were float16 and datetime64. Might
be
 worth looking at the steps needed to implement those. There was also a
user
 type, `rational` that got added, that could also provide a template. Maybe
 we need to have a way to add 'numpy certified' user data types. It might
 also be possible to reuse the `c` data type, currently implemented as `S1`
 IIRC, but that could cause some problems.

float16 and rational probably aren't too relevant because they are
fixed-size types, and variable-size dtypes are much trickier. datetime64
will be more similar, but also add its own irrelevant complexities -- you
might be best off just looking at how S and U work and copying them.

-n

-- 
Nathaniel J. Smith -- http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Aldcroft, Thomas
On Sun, Feb 22, 2015 at 5:56 PM, Aldcroft, Thomas 
aldcr...@head.cfa.harvard.edu wrote:



 On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:



 On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas 
 aldcr...@head.cfa.harvard.edu wrote:



 On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
 aldcr...@head.cfa.harvard.edu wrote:
  The idea of a one-byte string dtype has been extensively discussed
 twice
  before, with a lot of good input and ideas, but no action [1, 2].
 
  tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte
 string
  dtype named 's' that uses latin-1 encoding as a bridge to enable
 Python 3
  usage in the near term?

 I think this is a good idea. I think overall it would be good for
 numpy to switch to using variable-length strings in most cases (cf.
 pandas), which is a different kind of change, but fixed-length 8-bit
 encoded text is obviously a common on-disk format in scientific
 applications, so numpy will still need some way to deal with it
 conveniently. In the long run we'd like to have more flexibility (e.g.
 allowing choice of character encoding), but since this proposal is a
 subset of that functionality, then it won't interfere with later
 improvements. I can see an argument for utf8 over latin1, but it
 really doesn't matter that much so whatever, blue and purple bikesheds
 are both fine.

 The tricky bit here is just :-). Do you want to implement this? Do
 you know someone who does? It's possible but will be somewhat
 annoying, since to do it directly without refactoring how dtypes work
 first then you'll have to add lots of copy-paste code to all the
 different ufuncs.


 I'm would be happy to have a go at this, with the caveat that someone
 who understands numpy would need to get me started with a minimal
 prototype.  From there I can do the annoying copy-paste for ufuncs etc,
 writing tests and docs.  I'm assuming that with a prototype then the rest
 can be done without any deep understanding of numpy internals (which I do
 not have).

 - Tom



 The last two new types added to numpy were float16 and datetime64. Might
 be worth looking at the steps needed to implement those. There was also a
 user type, `rational` that got added, that could also provide a template.
 Maybe we need to have a way to add 'numpy certified' user data types. It
 might also be possible to reuse the `c` data type, currently implemented as
 `S1` IIRC, but that could cause some problems.


 OK I'll have a look at those.


On second thought..  Maybe I'm being naive, but I think that starting from
scratch looking at entirely new dtypes is harder than it needs to be, or at
least not the most straightforward path [EDIT: just saw email from Nathan
agreeing here].  What is being proposed is essentially:

- For Python 2, the 's' type is exactly a clone of 'S'.  In other words 's'
will interface with Python as a bytes (aka str) object just like 'S'.
- For Python 3, the 's' type is internally the same as 'S' (np.bytes_) in
all operations, but interfaces with Python as a latin-1 encoded string.  So
the only difference is at the interface layer with Python (initialization,
comparison, iteration, etc).

So as a starting point we would want to clone 'S' to 's', then fix up the
interface to Python 3.  Does that sound about right?

- Tom



 Thanks,
 Tom



 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Nathaniel Smith
On Feb 22, 2015 3:39 PM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu
wrote:



 On Sun, Feb 22, 2015 at 5:56 PM, Aldcroft, Thomas 
aldcr...@head.cfa.harvard.edu wrote:



 On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris 
charlesr.har...@gmail.com wrote:



 On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas 
aldcr...@head.cfa.harvard.edu wrote:



 On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
 aldcr...@head.cfa.harvard.edu wrote:
  The idea of a one-byte string dtype has been extensively discussed
twice
  before, with a lot of good input and ideas, but no action [1, 2].
 
  tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte
string
  dtype named 's' that uses latin-1 encoding as a bridge to enable
Python 3
  usage in the near term?

 I think this is a good idea. I think overall it would be good for
 numpy to switch to using variable-length strings in most cases (cf.
 pandas), which is a different kind of change, but fixed-length 8-bit
 encoded text is obviously a common on-disk format in scientific
 applications, so numpy will still need some way to deal with it
 conveniently. In the long run we'd like to have more flexibility (e.g.
 allowing choice of character encoding), but since this proposal is a
 subset of that functionality, then it won't interfere with later
 improvements. I can see an argument for utf8 over latin1, but it
 really doesn't matter that much so whatever, blue and purple bikesheds
 are both fine.

 The tricky bit here is just :-). Do you want to implement this? Do
 you know someone who does? It's possible but will be somewhat
 annoying, since to do it directly without refactoring how dtypes work
 first then you'll have to add lots of copy-paste code to all the
 different ufuncs.


 I'm would be happy to have a go at this, with the caveat that someone
who understands numpy would need to get me started with a minimal
prototype.  From there I can do the annoying copy-paste for ufuncs etc,
writing tests and docs.  I'm assuming that with a prototype then the rest
can be done without any deep understanding of numpy internals (which I do
not have).

 - Tom



 The last two new types added to numpy were float16 and datetime64.
Might be worth looking at the steps needed to implement those. There was
also a user type, `rational` that got added, that could also provide a
template. Maybe we need to have a way to add 'numpy certified' user data
types. It might also be possible to reuse the `c` data type, currently
implemented as `S1` IIRC, but that could cause some problems.


 OK I'll have a look at those.


 On second thought..  Maybe I'm being naive, but I think that starting
from scratch looking at entirely new dtypes is harder than it needs to be,
or at least not the most straightforward path [EDIT: just saw email from
Nathan agreeing here].  What is being proposed is essentially:

 - For Python 2, the 's' type is exactly a clone of 'S'.  In other words
's' will interface with Python as a bytes (aka str) object just like 'S'.
 - For Python 3, the 's' type is internally the same as 'S' (np.bytes_) in
all operations, but interfaces with Python as a latin-1 encoded string.  So
the only difference is at the interface layer with Python (initialization,
comparison, iteration, etc).

 So as a starting point we would want to clone 'S' to 's', then fix up the
interface to Python 3.  Does that sound about right?

Sounds reasonable to me.

You'll also want to consider interactions between the dtypes -- mixed
operations like
  array(a, dtype=s) == array(a, dtype=U)
should do the right thing, and casting s-U ditto.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Aldcroft, Thomas
On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
 aldcr...@head.cfa.harvard.edu wrote:
  The idea of a one-byte string dtype has been extensively discussed twice
  before, with a lot of good input and ideas, but no action [1, 2].
 
  tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte
 string
  dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3
  usage in the near term?

 I think this is a good idea. I think overall it would be good for
 numpy to switch to using variable-length strings in most cases (cf.
 pandas), which is a different kind of change, but fixed-length 8-bit
 encoded text is obviously a common on-disk format in scientific
 applications, so numpy will still need some way to deal with it
 conveniently. In the long run we'd like to have more flexibility (e.g.
 allowing choice of character encoding), but since this proposal is a
 subset of that functionality, then it won't interfere with later
 improvements. I can see an argument for utf8 over latin1, but it
 really doesn't matter that much so whatever, blue and purple bikesheds
 are both fine.

 The tricky bit here is just :-). Do you want to implement this? Do
 you know someone who does? It's possible but will be somewhat
 annoying, since to do it directly without refactoring how dtypes work
 first then you'll have to add lots of copy-paste code to all the
 different ufuncs.


I'm would be happy to have a go at this, with the caveat that someone who
understands numpy would need to get me started with a minimal prototype.
From there I can do the annoying copy-paste for ufuncs etc, writing tests
and docs.  I'm assuming that with a prototype then the rest can be done
without any deep understanding of numpy internals (which I do not have).

- Tom



 -n

 --
 Nathaniel J. Smith -- http://vorpus.org
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Charles R Harris
On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas 
aldcr...@head.cfa.harvard.edu wrote:



 On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
 aldcr...@head.cfa.harvard.edu wrote:
  The idea of a one-byte string dtype has been extensively discussed twice
  before, with a lot of good input and ideas, but no action [1, 2].
 
  tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte
 string
  dtype named 's' that uses latin-1 encoding as a bridge to enable Python
 3
  usage in the near term?

 I think this is a good idea. I think overall it would be good for
 numpy to switch to using variable-length strings in most cases (cf.
 pandas), which is a different kind of change, but fixed-length 8-bit
 encoded text is obviously a common on-disk format in scientific
 applications, so numpy will still need some way to deal with it
 conveniently. In the long run we'd like to have more flexibility (e.g.
 allowing choice of character encoding), but since this proposal is a
 subset of that functionality, then it won't interfere with later
 improvements. I can see an argument for utf8 over latin1, but it
 really doesn't matter that much so whatever, blue and purple bikesheds
 are both fine.

 The tricky bit here is just :-). Do you want to implement this? Do
 you know someone who does? It's possible but will be somewhat
 annoying, since to do it directly without refactoring how dtypes work
 first then you'll have to add lots of copy-paste code to all the
 different ufuncs.


 I'm would be happy to have a go at this, with the caveat that someone who
 understands numpy would need to get me started with a minimal prototype.
 From there I can do the annoying copy-paste for ufuncs etc, writing tests
 and docs.  I'm assuming that with a prototype then the rest can be done
 without any deep understanding of numpy internals (which I do not have).

 - Tom



The last two new types added to numpy were float16 and datetime64. Might be
worth looking at the steps needed to implement those. There was also a user
type, `rational` that got added, that could also provide a template. Maybe
we need to have a way to add 'numpy certified' user data types. It might
also be possible to reuse the `c` data type, currently implemented as `S1`
IIRC, but that could cause some problems.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Nathaniel Smith
On Sun, Feb 22, 2015 at 2:42 PM, Charles R Harris
charlesr.har...@gmail.com wrote:

 On Sun, Feb 22, 2015 at 12:52 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
 aldcr...@head.cfa.harvard.edu wrote:
  The idea of a one-byte string dtype has been extensively discussed twice
  before, with a lot of good input and ideas, but no action [1, 2].
 
  tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte
  string
  dtype named 's' that uses latin-1 encoding as a bridge to enable Python
  3
  usage in the near term?

 I think this is a good idea. I think overall it would be good for
 numpy to switch to using variable-length strings in most cases (cf.
 pandas), which is a different kind of change, but fixed-length 8-bit
 encoded text is obviously a common on-disk format in scientific
 applications, so numpy will still need some way to deal with it
 conveniently. In the long run we'd like to have more flexibility (e.g.
 allowing choice of character encoding), but since this proposal is a
 subset of that functionality, then it won't interfere with later
 improvements. I can see an argument for utf8 over latin1, but it
 really doesn't matter that much so whatever, blue and purple bikesheds
 are both fine.

 The tricky bit here is just :-). Do you want to implement this? Do
 you know someone who does? It's possible but will be somewhat
 annoying, since to do it directly without refactoring how dtypes work
 first then you'll have to add lots of copy-paste code to all the
 different ufuncs.

 We're also running out of letters for types. We need to decide on how to
 extend that representation. It would seem straight forward to just start
 using multiple letters, but there is a lot of code the uses things like `for
 dt in 'efdg':`. Can we perhaps introduce an extended dtype structure, maybe
 with some ideas from dynd and versioning.

I don't mind using s for this particular case, but in general I
think we should de-emphasise the string representations, and even
allow new dtypes to forgo them entirely. We have all of Python to work
with. It's much nicer for users and for us to write things like

dtype=np.someclass(special_option=True)

instead of

dtype=SC[S_O=T]

or whatever weird ad-hoc syntax we come up with.

(Obviously there are some details to work out with things like the
.npy format, but these seem solveable.)

-n

-- 
Nathaniel J. Smith -- http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Nathaniel Smith
On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
aldcr...@head.cfa.harvard.edu wrote:
 The idea of a one-byte string dtype has been extensively discussed twice
 before, with a lot of good input and ideas, but no action [1, 2].

 tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte string
 dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3
 usage in the near term?

I think this is a good idea. I think overall it would be good for
numpy to switch to using variable-length strings in most cases (cf.
pandas), which is a different kind of change, but fixed-length 8-bit
encoded text is obviously a common on-disk format in scientific
applications, so numpy will still need some way to deal with it
conveniently. In the long run we'd like to have more flexibility (e.g.
allowing choice of character encoding), but since this proposal is a
subset of that functionality, then it won't interfere with later
improvements. I can see an argument for utf8 over latin1, but it
really doesn't matter that much so whatever, blue and purple bikesheds
are both fine.

The tricky bit here is just :-). Do you want to implement this? Do
you know someone who does? It's possible but will be somewhat
annoying, since to do it directly without refactoring how dtypes work
first then you'll have to add lots of copy-paste code to all the
different ufuncs.

-n

-- 
Nathaniel J. Smith -- http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Sturla Molden
On 22/02/15 20:57, Nathaniel Smith wrote:

 This is a discussion about how strings are represented as bit-patterns
 inside ndarrays; the internal storage representation used by 'str' is
 irrelevant.

I thought it would be clever to just use the same internal 
representation as Python would choose. But obviously it is not. UTF-8 
would fail because it is not regularly stored. And every string in an 
ndarray will need to have the same encoding, but Python might think 
otherwise.

Sturla


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Charles R Harris
On Sun, Feb 22, 2015 at 12:52 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
 aldcr...@head.cfa.harvard.edu wrote:
  The idea of a one-byte string dtype has been extensively discussed twice
  before, with a lot of good input and ideas, but no action [1, 2].
 
  tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte
 string
  dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3
  usage in the near term?

 I think this is a good idea. I think overall it would be good for
 numpy to switch to using variable-length strings in most cases (cf.
 pandas), which is a different kind of change, but fixed-length 8-bit
 encoded text is obviously a common on-disk format in scientific
 applications, so numpy will still need some way to deal with it
 conveniently. In the long run we'd like to have more flexibility (e.g.
 allowing choice of character encoding), but since this proposal is a
 subset of that functionality, then it won't interfere with later
 improvements. I can see an argument for utf8 over latin1, but it
 really doesn't matter that much so whatever, blue and purple bikesheds
 are both fine.

 The tricky bit here is just :-). Do you want to implement this? Do
 you know someone who does? It's possible but will be somewhat
 annoying, since to do it directly without refactoring how dtypes work
 first then you'll have to add lots of copy-paste code to all the
 different ufuncs.


We're also running out of letters for types. We need to decide on how to
extend that representation. It would seem straight forward to just start
using multiple letters, but there is a lot of code the uses things like
`for dt in 'efdg':`. Can we perhaps introduce an extended dtype structure,
maybe with some ideas from dynd and versioning.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] One-byte string dtype: third time's the charm?

2015-02-22 Thread Aldcroft, Thomas
On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas 
 aldcr...@head.cfa.harvard.edu wrote:



 On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
 aldcr...@head.cfa.harvard.edu wrote:
  The idea of a one-byte string dtype has been extensively discussed
 twice
  before, with a lot of good input and ideas, but no action [1, 2].
 
  tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte
 string
  dtype named 's' that uses latin-1 encoding as a bridge to enable
 Python 3
  usage in the near term?

 I think this is a good idea. I think overall it would be good for
 numpy to switch to using variable-length strings in most cases (cf.
 pandas), which is a different kind of change, but fixed-length 8-bit
 encoded text is obviously a common on-disk format in scientific
 applications, so numpy will still need some way to deal with it
 conveniently. In the long run we'd like to have more flexibility (e.g.
 allowing choice of character encoding), but since this proposal is a
 subset of that functionality, then it won't interfere with later
 improvements. I can see an argument for utf8 over latin1, but it
 really doesn't matter that much so whatever, blue and purple bikesheds
 are both fine.

 The tricky bit here is just :-). Do you want to implement this? Do
 you know someone who does? It's possible but will be somewhat
 annoying, since to do it directly without refactoring how dtypes work
 first then you'll have to add lots of copy-paste code to all the
 different ufuncs.


 I'm would be happy to have a go at this, with the caveat that someone who
 understands numpy would need to get me started with a minimal prototype.
 From there I can do the annoying copy-paste for ufuncs etc, writing tests
 and docs.  I'm assuming that with a prototype then the rest can be done
 without any deep understanding of numpy internals (which I do not have).

 - Tom



 The last two new types added to numpy were float16 and datetime64. Might
 be worth looking at the steps needed to implement those. There was also a
 user type, `rational` that got added, that could also provide a template.
 Maybe we need to have a way to add 'numpy certified' user data types. It
 might also be possible to reuse the `c` data type, currently implemented as
 `S1` IIRC, but that could cause some problems.


OK I'll have a look at those.

Thanks,
Tom



 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] np.nonzero behavior with multidimensional arrays

2015-02-22 Thread Jaime Fernández del Río
This was raised in SO today:

http://stackoverflow.com/questions/28663142/why-is-np-wheres-result-read-only-for-multi-dimensional-arrays/28664009

np.nonzero (and np.where for boolean arrays) behave differently for 1-D and
higher dimensional arrays:

In the first case, a tuple with a single behaved base ndarray is returned:

 a = np.ma.array(range(6))
 np.where(a  3)
(array([4, 5]),)
 np.where(a  3)[0].flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

In the second, a tuple with as many arrays as dimensions in the passed
array is returned, but the arrays are not base ndarrays, but of the same
subtype as was passed to the function. These arrays are also set as
non-writeable:

 np.where(a.reshape(2, 3)  3)
(masked_array(data = [1 1],
 mask = False,
   fill_value = 99)
, masked_array(data = [1 2],
 mask = False,
   fill_value = 99)
)
 np.where(a.reshape(2, 3)  3)[0].flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  UPDATEIFCOPY : False

I can't think of any reason that justifies this difference, and believe
they should be made to return similar results. My feeling is that the
proper behavior is the 1-D one, and that the behavior for multidimensional
arrays should match it. Anyone can think of any reason that justifies the
current behavior?

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion