On Wed, Jul 16, 2014 at 3:48 AM, Todd toddr...@gmail.com wrote:
On Jul 16, 2014 11:43 AM, Chris Barker chris.bar...@noaa.gov wrote:
So numpy should have dtypes to match these. We're a bit stuck, however,
because 'S' mapped to the py2 string type, which no longer exists in py3.
Sorry not
On Tue, Jul 15, 2014 at 7:40 PM, Aldcroft, Thomas
aldcr...@head.cfa.harvard.edu wrote:
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith n...@pobox.com wrote:
OTOH, fixed length nul padded latin1 would be useful for various flat file
reading tasks.
As one of the original agitators for this,
On Thu, Jul 17, 2014 at 10:05 PM, Chris Barker chris.bar...@noaa.gov wrote:
A bit of a higher-level view of the issues at hand.
Python has three relevant data types:
A unicode type (unicode in py2, str in py3)
A one-byte-per-char stringtype (py2 string)
A bytes type
The big problem is
On Tue, Jul 15, 2014 at 11:15 AM, Charles R Harris
charlesr.har...@gmail.com wrote:
On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg
sebast...@sipsolutions.net wrote:
On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
As previous posts have pointed out, Numpy's `S` type is
On Thu, Jul 17, 2014 at 11:52 AM, Nathaniel Smith n...@pobox.com wrote:
On Tue, Jul 15, 2014 at 7:40 PM, Aldcroft, Thomas
aldcr...@head.cfa.harvard.edu wrote:
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith n...@pobox.com wrote:
OTOH, fixed length nul padded latin1 would be useful for
On Thu, Jul 17, 2014 at 5:48 PM, Nathaniel Smith n...@pobox.com wrote:
On Tue, Jul 15, 2014 at 4:29 PM, Charles R Harris
charlesr.har...@gmail.com wrote:
Thinking more about it, the easiest thing to do might be to make the S dtype
a UTF-8 encoding. Most of the machinery to deal with that is
On Tue, Jul 15, 2014 at 4:29 PM, Charles R Harris
charlesr.har...@gmail.com wrote:
Thinking more about it, the easiest thing to do might be to make the S dtype
a UTF-8 encoding. Most of the machinery to deal with that is already in
place. That change might affect some users though, and we might
Hi Chris,
A Latin-1 based 'a' type
would have similar problems.
Maybe not -- latin1 is fixed width.
Yes, Latin-1 is fixed width, but the issue is that when writing to a
fixed-width UTF8 string in HDF5, it will expand, possibly losing data.
What I would like to avoid is a situation where a
On Fri, Jul 18, 2014 at 9:07 AM, Pauli Virtanen p...@iki.fi wrote:
Another approach would be to add a new 1-byte unicode
you can't do unicode in 1-byte -- so what does this mean, exactly?
This also is not perfect, since array(['foo']) on Py2 should for
backward compatibility continue
On Fri, Jul 18, 2014 at 9:32 AM, Andrew Collette andrew.colle...@gmail.com
wrote:
A Latin-1 based 'a' type
would have similar problems.
Maybe not -- latin1 is fixed width.
Yes, Latin-1 is fixed width, but the issue is that when writing to a
fixed-width UTF8 string in HDF5, it will
On Fri, Jul 18, 2014 at 5:54 PM, Chris Barker chris.bar...@noaa.gov wrote:
This is why I see no downside to latin-1 -- if you don't use the 127 code
points, it's the same thing -- if you do, you get some extra handy
characters. The only difference is that a proper ascii type would not let
18.07.2014 19:33, Chris Barker kirjoitti:
On Fri, Jul 18, 2014 at 9:07 AM, Pauli Virtanen p...@iki.fi
wrote:
Another approach would be to add a new 1-byte unicode
you can't do unicode in 1-byte -- so what does this mean, exactly?
The first 256 unicode code points, which happen to coincide
Hi Chris,
Again, they shouldn't do that, they should be pushing a 10-character string
into something -- and utf-8 is going to (Possible) truncate that. That's
HDF/utf-8 limitation that people are going to have to deal with. I think
you're suggesting that numpy follow the HDF model, so that
On Fri, Jul 18, 2014 at 10:59 AM, Nathaniel Smith n...@pobox.com wrote:
On Fri, Jul 18, 2014 at 5:54 PM, Chris Barker chris.bar...@noaa.gov
wrote:
This is why I see no downside to latin-1 -- if you don't use the 127
code
points, it's the same thing -- if you do, you get some extra handy
On Fri, Jul 18, 2014 at 9:59 AM, Nathaniel Smith n...@pobox.com wrote:
IMO the extra characters aren't the most compelling argument for
latin1 over ascii. Latin1 gives the nice assurance that if some jerk
*does* give me an ascii file that somewhere has some byte with the
8th bit set, then I
On Fri, Jul 18, 2014 at 9:59 AM, Nathaniel Smith n...@pobox.com wrote:
IMO the extra characters aren't the most compelling argument for
latin1 over ascii. Latin1 gives the nice assurance that if some jerk
*does* give me an ascii file that somewhere has some byte with the
8th bit set, then I
Hi Chris,
What it would do is push the problem from the HDF5-numpy interface to the
python-numpy interface.
I'm not sure that's a good trade off.
Maybe I'm being too paranoid about the truncation issue. We already
perform truncation when going from e.g. vlen to fixed-width strings in
On Fri, Jul 18, 2014 at 12:52 PM, Andrew Collette andrew.colle...@gmail.com
wrote:
What it would do is push the problem from the HDF5-numpy interface to
the
python-numpy interface.
I'm not sure that's a good trade off.
Maybe I'm being too paranoid about the truncation issue.
Hi Chris,
Actually, I agree about the truncation issue, but it's a question of where
to put it -- I'm suggesting that I don't want it at the python-numpy
interface.
Yes, that's a good point. Of course, by using Latin-1 rather than
UTF-8 we can't support all Unicode code points (hence the ?
On Fri, Jul 18, 2014 at 3:30 PM, Andrew Collette andrew.colle...@gmail.com
wrote:
Hi Chris,
Actually, I agree about the truncation issue, but it's a question of
where
to put it -- I'm suggesting that I don't want it at the python-numpy
interface.
Yes, that's a good point. Of course,
Le 15/07/2014 18:18, Chris Barker a écrit :
(or does HDF support var-length
elements?)
It does: http://www.hdfgroup.org/HDF5/doc/TechNotes/VLTypes.html
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
Hi,
good argument for ASCII, but utf-8 is a bad idea, as there is no 1:1
correspondence between length of string in bytes and length in characters
-- as numpy needs to pre-allocate a defined number of bytes for a dtype,
there is a disconnect between the user and numpy as to how long a string is
On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg sebast...@sipsolutions.net
wrote:
On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
As previous posts have pointed out, Numpy's `S` type is currently
treated as a byte string, which leads to more complicated code in
python3. OTOH, the
On Jul 16, 2014 11:43 AM, Chris Barker chris.bar...@noaa.gov wrote:
So numpy should have dtypes to match these. We're a bit stuck, however,
because 'S' mapped to the py2 string type, which no longer exists in py3.
Sorry not running py3 to see what 'S' does now, but I know it's bit broken,
and may
On Wed, Jul 16, 2014 at 6:48 AM, Todd toddr...@gmail.com wrote:
On Jul 16, 2014 11:43 AM, Chris Barker chris.bar...@noaa.gov wrote:
So numpy should have dtypes to match these. We're a bit stuck, however,
because 'S' mapped to the py2 string type, which no longer exists in py3.
Sorry not
On Tue, Jul 15, 2014 at 4:26 AM, Sebastian Berg sebast...@sipsolutions.net
wrote:
Just wondering, couldn't we have a type which actually has an
(arbitrary, python supported) encoding (and bytes might even just be a
special case of no encoding)?
well, then we're back to the core issue here:
On Mon, Jul 14, 2014 at 10:00 AM, Olivier Grisel olivier.gri...@ensta.org
wrote:
2014-07-13 19:05 GMT+02:00 Alexander Belopolsky ndar...@mac.com:
I've been toying with the idea of creating an array type for interned
strings. In many applications dealing with large arrays of variable size
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith n...@pobox.com wrote:
On 12 Jul 2014 23:06, Charles R Harris charlesr.har...@gmail.com
wrote:
As previous posts have pointed out, Numpy's `S` type is currently
treated as a byte string, which leads to more complicated code in python3.
OTOH,
On Mon, Jul 14, 2014 at 10:39 AM, Andrew Collette andrew.colle...@gmail.com
wrote:
For storing data in HDF5 (PyTables or h5py), it would be somewhat
cleaner if either ASCII or UTF-8 are used, as these are the only two
charsets officially supported by the library.
good argument for ASCII,
in 0.15.0 pandas will have full fledged support for categoricals which in
effect allow u 2 map a smaller number of strings to integers
this is now in pandas master
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
feedback welcome!
On Jul 14, 2014, at 1:00 PM, Olivier Grisel
On Tue, Jul 15, 2014 at 9:15 AM, Charles R Harris charlesr.har...@gmail.com
wrote:
On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg
sebast...@sipsolutions.net wrote:
On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
As previous posts have pointed out, Numpy's `S` type is
But HDF5
additionally has a fixed-storage-width UTF8 type, so we could map to a
NumPy fixed-storage-width type trivially.
Sure -- this is why *nix uses utf-8 for filenames -- it can just be a
char*. But that just punts the problem to client code.
I think a UTF-8 string type does not match the
Hi Chuck,
This note proposes to adapt the currently existing 'a'
type letter, currently aliased to 'S', as a new fixed encoding dtype. Python
3.3 introduced two one byte internal representations for unicode strings,
ascii and latin1. Ascii has the advantage that it is a subset of UTF-8,
On Sat, Jul 12, 2014 at 10:17 AM, Charles R Harris
charlesr.har...@gmail.com wrote:
As previous posts have pointed out, Numpy's `S` type is currently treated
as a byte string, which leads to more complicated code in python3.
Also, a byte string in py3 is not, in fact the same as the py2
2014-07-13 19:05 GMT+02:00 Alexander Belopolsky ndar...@mac.com:
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith n...@pobox.com wrote:
I feel like for most purposes, what we *really* want is a variable length
string dtype (I.e., where each element can be a different length.).
I've been
On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
As previous posts have pointed out, Numpy's `S` type is currently
treated as a byte string, which leads to more complicated code in
python3. OTOH, the unicode type is stored as UCS4, which consumes a
lot of space, especially for ascii
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith n...@pobox.com wrote:
I feel like for most purposes, what we *really* want is a variable length
string dtype (I.e., where each element can be a different length.).
I've been toying with the idea of creating an array type for interned
strings.
On 12 Jul 2014 23:06, Charles R Harris charlesr.har...@gmail.com wrote:
As previous posts have pointed out, Numpy's `S` type is currently treated
as a byte string, which leads to more complicated code in python3. OTOH,
the unicode type is stored as UCS4, which consumes a lot of space,
especially
As previous posts have pointed out, Numpy's `S` type is currently treated
as a byte string, which leads to more complicated code in python3. OTOH,
the unicode type is stored as UCS4, which consumes a lot of space,
especially for ascii strings. This note proposes to adapt the currently
existing 'a'
39 matches
Mail list logo