Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > > On Apr 26, 2017 12:09 PM, "Robert Kern" wrote: >> It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 4:30 PM, Stephan Hoyer wrote: > > Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and > myself have already given), but we seem to be talking past each other here. > yeah -- I think it's not clear what the use cases we are talking

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 10:45 AM, Robert Kern wrote: > >>> > The maximum length of an UTF-8 character is 4 bytes, so we could use > that to size arrays by character length. The advantage over UTF-32 is that > it is easily compressible, probably by a factor of 4 in many

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg wrote: > I remember talking with a colleague about something like that. And > basically an annoying thing there was that if you strip the zero bytes > in a zero padded string, some encodings (UTF16) may need one of the

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > Indeed, > Most of this discussion is irrelevant to numpy. > Numpy only really deals with the in memory storage of strings. And in > that it is limited to fixed length strings (in bytes/codepoints). > How you

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker - NOAA Federal
> > I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with > > a few extra characters" data. With all the sloppiness over the years, there > > are way to many files like that. > > That sloppiness that you mention is precisely the "unknown encoding" problem. Exactly -- but

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Eric Wieser
> I think we can implement viewers for strings as ndarray subclasses. Then one > could > do `my_string_array.view(latin_1)`, and so on. Essentially that just > changes the default > encoding of the 'S' array. That could also work for uint8 arrays if needed. > > Chuck To handle structured

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Aldcroft, Thomas
On Tue, Apr 25, 2017 at 7:11 PM, Chris Barker - NOAA Federal < chris.bar...@noaa.gov> wrote: > > On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > > > Eh... First, on Windows and MacOS, filenames are natively Unicode. > > Yeah, though once they are stored I. A text file --

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern wrote: > On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal < > chris.bar...@noaa.gov> wrote: > > >> Presumably you're getting byte strings (with unknown encoding. > > > > No -- thus is for creating and using mostly

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Tue, Apr 25, 2017 at 4:11 PM, Chris Barker - NOAA Federal wrote: >> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > >> Eh... First, on Windows and MacOS, filenames are natively Unicode. > > Yeah, though once they are stored I. A text file -- who

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker - NOAA Federal
> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > Eh... First, on Windows and MacOS, filenames are natively Unicode. Yeah, though once they are stored I. A text file -- who the heck knows? That may be simply unsolvable. > s. And then from in Python, if you want to

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 9:35 AM, "Chris Barker" wrote: - filenames File names are one of the key reasons folks struggled with the python3 data model (particularly on *nix) and why 'surrogateescape' was added. It's pretty common to store filenames in with our data, and thus in

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern wrote: > On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < > charlesr.har...@gmail.com> wrote: > > > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < > peridot.face...@gmail.com> wrote: > > >> Clearly there is a need for

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Eric Wieser
Chuck: That sounds like something we want to deprecate, for the same reason that python3 no longer allows str(b'123') to do the right thing. Specifically, it seems like astype should always be forbidden to go between unicode and byte arrays - so that would need to be written as: In [1]: a =

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 7:52 PM Phil Hodge wrote: > On 04/25/2017 01:34 PM, Anne Archibald wrote: > > I know they're not numpy-compatible, but FITS header values are > > space-padded; does that occur elsewhere? > > Strings in FITS headers are delimited by single quotes. Some

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 6:36 PM Chris Barker wrote: > > This is essentially my rant about use-case (2): > > A compact dtype for mostly-ascii text: > I'm a little confused about exactly what you're trying to do. Do you need your in-memory format for this data to be

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Ambrose LI
2017-04-25 12:34 GMT-04:00 Chris Barker : > I am totally euro-centric, but as I understand it, that is the whole point > of the desire for a compact one-byte-per character encoding. If there is a > strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we >

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer wrote: > > On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker wrote: >>> >>> On the other hand, if this is the use-case, perhaps we really want an encoding closer to "Python 2" string, i.e, "unknown", to let this

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker wrote: > > On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > >>> In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern wrote: > On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < > aldcr...@head.cfa.harvard.edu> wrote: > > > > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker > wrote: > > >> - round-tripping of binary data

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 11:21 AM, Chris Barker wrote: > > On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: >>> >>> BTW -- maybe we should keep the pathological use-case in mind: really short strings. I think we are all thinking in

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker wrote: > latin-1 or latin-9 buys you (over ASCII): > > ... > > - round-tripping of binary data (at least with Python's encoding/decoding) > -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the > same

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > In this case, we want something compatible with Python's string (i.e. full >> Unicode supporting) and I think should be as transparent as possible. >> Python's string has made the decision to present a character oriented

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-21 Thread Stephan Hoyer
On Fri, Apr 21, 2017 at 11:34 AM, Chris Barker wrote: > 1) Use with/from Python -- both creating and working with numpy arrays. > > In this case, we want something compatible with Python's string (i.e. full > Unicode supporting) and I think should be as transparent as

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-21 Thread Chris Barker
I just re-read the "Utf-8" manifesto, and it helped me clarify my thoughts: 1) most of it is focused on utf-8 vs utf-16. And that is a strong argument -- utf-16 is the worst of both worlds. 2) it isn't really addressing how to deal with fixed-size string storage as needed by numpy. It does

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:51 PM, Stephan Hoyer wrote: > > On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern wrote: >> >> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: >> > >> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald wrote: > > On Thu, Apr 20, 2017 at 8:55 PM Robert Kern wrote: >> For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length).

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Julian Taylor
On 20.04.2017 20:53, Robert Kern wrote: > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor > > > wrote: > >> Do you have comments on how to go forward, in particular in regards to >> new dtype vs modify np.unicode? > > Can we

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Anne Archibald
On Thu, Apr 20, 2017 at 8:55 PM Robert Kern wrote: > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < > jtaylor.deb...@googlemail.com> wrote: > > > Do you have comments on how to go forward, in particular in regards to > > new dtype vs modify np.unicode? > > Can we restate

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Eric Wieser
Perhaps `np.encoded_str[encoding]` as the name for the new type, if we decide a new type is necessary? Am I right in thinking that the general problem here is that it's very easy to discard metadata when working with dtypes, and that by adding metadata to `unicode_`, we risk existing code

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Anne Archibald
On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor wrote: > I probably have formulated my goal with the proposal a bit better, I am > not very interested in a repetition of which encoding to use debate. > In the end what will be done allows any encoding via a dtype with

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern wrote: > I don't know of a format off-hand that works with numpy uniform-length > strings and Unicode as well. HDF5 (to my recollection) supports arrays of > NULL-terminated, uniform-length ASCII like FITS, but only

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > Do you have comments on how to go forward, in particular in regards to > new dtype vs modify np.unicode? Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Antoine Pitrou
On Thu, 20 Apr 2017 10:26:13 -0700 Stephan Hoyer wrote: > > I agree with Anne here. Variable-length encoding would be great to have, > but even fixed length UTF-8 (in terms of memory usage, not characters) > would solve NumPy's Python 3 string problem. NumPy's memory model

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Eric Wieser
> if you truncate a utf-8 bytestring, you may get invalid data Note that in general truncating unicode codepoints is not a safe operation either, as combining characters are a thing. So I don't think this is a good argument against UTF8. Also, is silent truncation a think that we want to allow

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Chris Barker
On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer wrote: > I agree with Anne here. Variable-length encoding would be great to have, > but even fixed length UTF-8 (in terms of memory usage, not characters) > would solve NumPy's Python 3 string problem. NumPy's memory model needs a

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Neal Becker
I'm no unicode expert, but can't we truncate unicode strings so that only valid characters are included? On Thu, Apr 20, 2017 at 1:32 PM Chris Barker wrote: > On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald > wrote: > >> Is there any reason

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
Julian -- thanks for taking this on. NumPy's handling of strings on Python 3 certainly needs fixing. On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald wrote: > Variable-length encodings, of which UTF-8 is obviously the one that makes > good handling essential, are

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Anne Archibald
On Thu, Apr 20, 2017 at 3:17 PM Julian Taylor wrote: > To please everyone I think we need to go with a dtype that supports > multiple encodings via metadata, similar to how datatime supports > multiple units. > E.g.: 'U10[latin1]' are 10 characters in latin1

[Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Julian Taylor
Hello, As you probably know numpy does not deal well with strings in Python3. The np.string type is actually zero terminated bytes and not a string. In Python2 this happened to work out as it treats bytes and strings the same way. But in Python3 this type is pretty hard to work with as each time