Re: [Numpy-discussion] String type again.

2014-07-18 Thread Chris Barker
On Wed, Jul 16, 2014 at 3:48 AM, Todd toddr...@gmail.com wrote: On Jul 16, 2014 11:43 AM, Chris Barker chris.bar...@noaa.gov wrote: So numpy should have dtypes to match these. We're a bit stuck, however, because 'S' mapped to the py2 string type, which no longer exists in py3. Sorry not

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Nathaniel Smith
On Tue, Jul 15, 2014 at 7:40 PM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith n...@pobox.com wrote: OTOH, fixed length nul padded latin1 would be useful for various flat file reading tasks. As one of the original agitators for this,

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Nathaniel Smith
On Thu, Jul 17, 2014 at 10:05 PM, Chris Barker chris.bar...@noaa.gov wrote: A bit of a higher-level view of the issues at hand. Python has three relevant data types: A unicode type (unicode in py2, str in py3) A one-byte-per-char stringtype (py2 string) A bytes type The big problem is

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Aldcroft, Thomas
On Tue, Jul 15, 2014 at 11:15 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg sebast...@sipsolutions.net wrote: On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote: As previous posts have pointed out, Numpy's `S` type is

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Aldcroft, Thomas
On Thu, Jul 17, 2014 at 11:52 AM, Nathaniel Smith n...@pobox.com wrote: On Tue, Jul 15, 2014 at 7:40 PM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith n...@pobox.com wrote: OTOH, fixed length nul padded latin1 would be useful for

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Julian Taylor
On Thu, Jul 17, 2014 at 5:48 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Jul 15, 2014 at 4:29 PM, Charles R Harris charlesr.har...@gmail.com wrote: Thinking more about it, the easiest thing to do might be to make the S dtype a UTF-8 encoding. Most of the machinery to deal with that is

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Nathaniel Smith
On Tue, Jul 15, 2014 at 4:29 PM, Charles R Harris charlesr.har...@gmail.com wrote: Thinking more about it, the easiest thing to do might be to make the S dtype a UTF-8 encoding. Most of the machinery to deal with that is already in place. That change might affect some users though, and we might

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Andrew Collette
Hi Chris, A Latin-1 based 'a' type would have similar problems. Maybe not -- latin1 is fixed width. Yes, Latin-1 is fixed width, but the issue is that when writing to a fixed-width UTF8 string in HDF5, it will expand, possibly losing data. What I would like to avoid is a situation where a

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Chris Barker
On Fri, Jul 18, 2014 at 9:07 AM, Pauli Virtanen p...@iki.fi wrote: Another approach would be to add a new 1-byte unicode you can't do unicode in 1-byte -- so what does this mean, exactly? This also is not perfect, since array(['foo']) on Py2 should for backward compatibility continue

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Chris Barker
On Fri, Jul 18, 2014 at 9:32 AM, Andrew Collette andrew.colle...@gmail.com wrote: A Latin-1 based 'a' type would have similar problems. Maybe not -- latin1 is fixed width. Yes, Latin-1 is fixed width, but the issue is that when writing to a fixed-width UTF8 string in HDF5, it will

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Nathaniel Smith
On Fri, Jul 18, 2014 at 5:54 PM, Chris Barker chris.bar...@noaa.gov wrote: This is why I see no downside to latin-1 -- if you don't use the 127 code points, it's the same thing -- if you do, you get some extra handy characters. The only difference is that a proper ascii type would not let

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Pauli Virtanen
18.07.2014 19:33, Chris Barker kirjoitti: On Fri, Jul 18, 2014 at 9:07 AM, Pauli Virtanen p...@iki.fi wrote: Another approach would be to add a new 1-byte unicode you can't do unicode in 1-byte -- so what does this mean, exactly? The first 256 unicode code points, which happen to coincide

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Andrew Collette
Hi Chris, Again, they shouldn't do that, they should be pushing a 10-character string into something -- and utf-8 is going to (Possible) truncate that. That's HDF/utf-8 limitation that people are going to have to deal with. I think you're suggesting that numpy follow the HDF model, so that

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Charles R Harris
On Fri, Jul 18, 2014 at 10:59 AM, Nathaniel Smith n...@pobox.com wrote: On Fri, Jul 18, 2014 at 5:54 PM, Chris Barker chris.bar...@noaa.gov wrote: This is why I see no downside to latin-1 -- if you don't use the 127 code points, it's the same thing -- if you do, you get some extra handy

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Chris Barker
On Fri, Jul 18, 2014 at 9:59 AM, Nathaniel Smith n...@pobox.com wrote: IMO the extra characters aren't the most compelling argument for latin1 over ascii. Latin1 gives the nice assurance that if some jerk *does* give me an ascii file that somewhere has some byte with the 8th bit set, then I

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Chris Barker
On Fri, Jul 18, 2014 at 9:59 AM, Nathaniel Smith n...@pobox.com wrote: IMO the extra characters aren't the most compelling argument for latin1 over ascii. Latin1 gives the nice assurance that if some jerk *does* give me an ascii file that somewhere has some byte with the 8th bit set, then I

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Andrew Collette
Hi Chris, What it would do is push the problem from the HDF5-numpy interface to the python-numpy interface. I'm not sure that's a good trade off. Maybe I'm being too paranoid about the truncation issue. We already perform truncation when going from e.g. vlen to fixed-width strings in

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Chris Barker
On Fri, Jul 18, 2014 at 12:52 PM, Andrew Collette andrew.colle...@gmail.com wrote: What it would do is push the problem from the HDF5-numpy interface to the python-numpy interface. I'm not sure that's a good trade off. Maybe I'm being too paranoid about the truncation issue.

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Andrew Collette
Hi Chris, Actually, I agree about the truncation issue, but it's a question of where to put it -- I'm suggesting that I don't want it at the python-numpy interface. Yes, that's a good point. Of course, by using Latin-1 rather than UTF-8 we can't support all Unicode code points (hence the ?

Re: [Numpy-discussion] String type again.

2014-07-18 Thread Charles R Harris
On Fri, Jul 18, 2014 at 3:30 PM, Andrew Collette andrew.colle...@gmail.com wrote: Hi Chris, Actually, I agree about the truncation issue, but it's a question of where to put it -- I'm suggesting that I don't want it at the python-numpy interface. Yes, that's a good point. Of course,

Re: [Numpy-discussion] String type again.

2014-07-17 Thread Joseph Martinot-Lagarde
Le 15/07/2014 18:18, Chris Barker a écrit : (or does HDF support var-length elements?) It does: http://www.hdfgroup.org/HDF5/doc/TechNotes/VLTypes.html ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org

Re: [Numpy-discussion] String type again.

2014-07-17 Thread Andrew Collette
Hi, good argument for ASCII, but utf-8 is a bad idea, as there is no 1:1 correspondence between length of string in bytes and length in characters -- as numpy needs to pre-allocate a defined number of bytes for a dtype, there is a disconnect between the user and numpy as to how long a string is

Re: [Numpy-discussion] String type again.

2014-07-17 Thread Charles R Harris
On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg sebast...@sipsolutions.net wrote: On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote: As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3. OTOH, the

Re: [Numpy-discussion] String type again.

2014-07-17 Thread Todd
On Jul 16, 2014 11:43 AM, Chris Barker chris.bar...@noaa.gov wrote: So numpy should have dtypes to match these. We're a bit stuck, however, because 'S' mapped to the py2 string type, which no longer exists in py3. Sorry not running py3 to see what 'S' does now, but I know it's bit broken, and may

Re: [Numpy-discussion] String type again.

2014-07-17 Thread Aldcroft, Thomas
On Wed, Jul 16, 2014 at 6:48 AM, Todd toddr...@gmail.com wrote: On Jul 16, 2014 11:43 AM, Chris Barker chris.bar...@noaa.gov wrote: So numpy should have dtypes to match these. We're a bit stuck, however, because 'S' mapped to the py2 string type, which no longer exists in py3. Sorry not

Re: [Numpy-discussion] String type again.

2014-07-17 Thread Chris Barker
On Tue, Jul 15, 2014 at 4:26 AM, Sebastian Berg sebast...@sipsolutions.net wrote: Just wondering, couldn't we have a type which actually has an (arbitrary, python supported) encoding (and bytes might even just be a special case of no encoding)? well, then we're back to the core issue here:

Re: [Numpy-discussion] String type again.

2014-07-17 Thread Stephan Hoyer
On Mon, Jul 14, 2014 at 10:00 AM, Olivier Grisel olivier.gri...@ensta.org wrote: 2014-07-13 19:05 GMT+02:00 Alexander Belopolsky ndar...@mac.com: I've been toying with the idea of creating an array type for interned strings. In many applications dealing with large arrays of variable size

Re: [Numpy-discussion] String type again.

2014-07-16 Thread Aldcroft, Thomas
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith n...@pobox.com wrote: On 12 Jul 2014 23:06, Charles R Harris charlesr.har...@gmail.com wrote: As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3. OTOH,

Re: [Numpy-discussion] String type again.

2014-07-16 Thread Chris Barker
On Mon, Jul 14, 2014 at 10:39 AM, Andrew Collette andrew.colle...@gmail.com wrote: For storing data in HDF5 (PyTables or h5py), it would be somewhat cleaner if either ASCII or UTF-8 are used, as these are the only two charsets officially supported by the library. good argument for ASCII,

Re: [Numpy-discussion] String type again.

2014-07-16 Thread Jeff Reback
in 0.15.0 pandas will have full fledged support for categoricals which in effect allow u 2 map a smaller number of strings to integers this is now in pandas master http://pandas-docs.github.io/pandas-docs-travis/categorical.html feedback welcome! On Jul 14, 2014, at 1:00 PM, Olivier Grisel

Re: [Numpy-discussion] String type again.

2014-07-16 Thread Charles R Harris
On Tue, Jul 15, 2014 at 9:15 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg sebast...@sipsolutions.net wrote: On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote: As previous posts have pointed out, Numpy's `S` type is

Re: [Numpy-discussion] String type again.

2014-07-16 Thread Chris Barker - NOAA Federal
But HDF5 additionally has a fixed-storage-width UTF8 type, so we could map to a NumPy fixed-storage-width type trivially. Sure -- this is why *nix uses utf-8 for filenames -- it can just be a char*. But that just punts the problem to client code. I think a UTF-8 string type does not match the

Re: [Numpy-discussion] String type again.

2014-07-15 Thread Andrew Collette
Hi Chuck, This note proposes to adapt the currently existing 'a' type letter, currently aliased to 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte internal representations for unicode strings, ascii and latin1. Ascii has the advantage that it is a subset of UTF-8,

Re: [Numpy-discussion] String type again.

2014-07-15 Thread Chris Barker
On Sat, Jul 12, 2014 at 10:17 AM, Charles R Harris charlesr.har...@gmail.com wrote: As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3. Also, a byte string in py3 is not, in fact the same as the py2

Re: [Numpy-discussion] String type again.

2014-07-15 Thread Olivier Grisel
2014-07-13 19:05 GMT+02:00 Alexander Belopolsky ndar...@mac.com: On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith n...@pobox.com wrote: I feel like for most purposes, what we *really* want is a variable length string dtype (I.e., where each element can be a different length.). I've been

Re: [Numpy-discussion] String type again.

2014-07-15 Thread Sebastian Berg
On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote: As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3. OTOH, the unicode type is stored as UCS4, which consumes a lot of space, especially for ascii

Re: [Numpy-discussion] String type again.

2014-07-14 Thread Alexander Belopolsky
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith n...@pobox.com wrote: I feel like for most purposes, what we *really* want is a variable length string dtype (I.e., where each element can be a different length.). I've been toying with the idea of creating an array type for interned strings.

Re: [Numpy-discussion] String type again.

2014-07-13 Thread Nathaniel Smith
On 12 Jul 2014 23:06, Charles R Harris charlesr.har...@gmail.com wrote: As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3. OTOH, the unicode type is stored as UCS4, which consumes a lot of space, especially

[Numpy-discussion] String type again.

2014-07-12 Thread Charles R Harris
As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3. OTOH, the unicode type is stored as UCS4, which consumes a lot of space, especially for ascii strings. This note proposes to adapt the currently existing 'a'