[Numpy-discussion] vstack and hstack performance penalty
When using vstack or hstack for large arrays, are there any performance penalties eg. takes longer time-wise or makes a copy of an array during operation ?___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] vstack and hstack performance penalty
On Fri, 2014-01-24 at 06:13 -0800, Dinesh Vadhia wrote: When using vstack or hstack for large arrays, are there any performance penalties eg. takes longer time-wise or makes a copy of an array during operation ? No, they all use concatenate. There are only constant overheads on top of the necessary data copying. Though performance may vary because of memory order, etc. - Sebastian ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Catching out-of-memory error before it happens
I want to write a general exception handler to warn if too much data is being loaded for the ram size in a machine for a successful numpy array operation to take place. For example, the program multiplies two floating point arrays A and B which are populated with loadtext. While the data is being loaded, want to continuously check that the data volume doesn't pass a threshold that will cause on out-of-memory error during the A*B operation. The known variables are the amount of memory available, data type (floats in this case) and the numpy array operation to be performed. It seems this requires knowledge of the internal memory requirements of each numpy operation. For sake of simplicity, can ignore other memory needs of program. Is this possible? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Catching out-of-memory error before it happens
There is no reliable way to predict how much memory an arbitrary numpy operation will need, no. However, in most cases the main memory cost will be simply the need to store the input and output arrays; for large arrays, all other allocations should be negligible. The most effective way to avoid running out of memory, therefore, is to avoid creating temporary arrays, by using only in-place operations. E.g., if a and b each require N bytes of ram, then memory requirements (roughly). c = a + b: 3N c = a + 2*b: 4N a += b: 2N np.add(a, b, out=a): 2N b *= 2; a += b: 2N Note that simply loading a and b requires 2N memory, so the latter code samples are near-optimal. Of course some calculations do require the use of temporary storage space... -n On 24 Jan 2014 15:19, Dinesh Vadhia dineshbvad...@hotmail.com wrote: I want to write a general exception handler to warn if too much data is being loaded for the ram size in a machine for a successful numpy array operation to take place. For example, the program multiplies two floating point arrays A and B which are populated with loadtext. While the data is being loaded, want to continuously check that the data volume doesn't pass a threshold that will cause on out-of-memory error during the A*B operation. The known variables are the amount of memory available, data type (floats in this case) and the numpy array operation to be performed. It seems this requires knowledge of the internal memory requirements of each numpy operation. For sake of simplicity, can ignore other memory needs of program. Is this possible? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Catching out-of-memory error before it happens
Yeah, numexpr is pretty cool for avoiding temporaries in an easy way: https://github.com/pydata/numexpr Francesc El 24/01/14 16:30, Nathaniel Smith ha escrit: There is no reliable way to predict how much memory an arbitrary numpy operation will need, no. However, in most cases the main memory cost will be simply the need to store the input and output arrays; for large arrays, all other allocations should be negligible. The most effective way to avoid running out of memory, therefore, is to avoid creating temporary arrays, by using only in-place operations. E.g., if a and b each require N bytes of ram, then memory requirements (roughly). c = a + b: 3N c = a + 2*b: 4N a += b: 2N np.add(a, b, out=a): 2N b *= 2; a += b: 2N Note that simply loading a and b requires 2N memory, so the latter code samples are near-optimal. Of course some calculations do require the use of temporary storage space... -n On 24 Jan 2014 15:19, Dinesh Vadhia dineshbvad...@hotmail.com mailto:dineshbvad...@hotmail.com wrote: I want to write a general exception handler to warn if too much data is being loaded for the ram size in a machine for a successful numpy array operation to take place. For example, the program multiplies two floating point arrays A and B which are populated with loadtext. While the data is being loaded, want to continuously check that the data volume doesn't pass a threshold that will cause on out-of-memory error during the A*B operation. The known variables are the amount of memory available, data type (floats in this case) and the numpy array operation to be performed. It seems this requires knowledge of the internal memory requirements of each numpy operation. For sake of simplicity, can ignore other memory needs of program. Is this possible? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Catching out-of-memory error before it happens
c = a + b: 3N c = a + 2*b: 4N Does python garbage collect mid-expression? I.e. : C = (a + 2*b) + b 4 or 5 N? Also note that when memory gets tight, fragmentation can be a problem. I.e. if two size-n arrays where just freed, you still may not be able to allocate a size-2n array. This seems to be worse on windows, not sure why. a += b: 2N np.add(a, b, out=a): 2N b *= 2; a += b: 2N Note that simply loading a and b requires 2N memory, so the latter code samples are near-optimal. And will run quite a bit faster for large arrays--pushing that memory around takes time. -Chris ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] vstack and hstack performance penalty
If A is very large and B is very small then np.concatenate(A, B) will copy B's data over to A which would take less time than the other way around - is that so? Does 'memory order' mean that it depends on sufficient contiguous memory being available for B otherwise it will be fragmented or something else? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] vstack and hstack performance penalty
On Fri, Jan 24, 2014 at 4:01 PM, Dinesh Vadhia dineshbvad...@hotmail.com wrote: If A is very large and B is very small then np.concatenate(A, B) will copy B's data over to A which would take less time than the other way around - is that so? No, neither array is modified in-place. A new array is created and both A and B are copied into it. The order is largely unimportant. Does 'memory order' mean that it depends on sufficient contiguous memory being available for B otherwise it will be fragmented or something else? No, the output is never fragmented. numpy arrays may be strided, but never fragmented arbitrarily to fit into a fragmented address space. http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray The issue is what axis the concatenation happens on. If it's the first axis (and both inputs are contiguous), then it only takes two memcpy() calls to copy the data, one for each input, because the regions where they go into the output are juxtaposed. If you concatenate on one of the other axes, though, then the memory regions for A and B will be interleaved and you have to do 2*N memory copies (N being some number depending on the shape). -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Catching out-of-memory error before it happens
On 24 Jan 2014 15:57, Chris Barker - NOAA Federal chris.bar...@noaa.gov wrote: c = a + b: 3N c = a + 2*b: 4N Does python garbage collect mid-expression? I.e. : C = (a + 2*b) + b 4 or 5 N? It should be collected as soon as the reference gets dropped, so 4N. (This is the advantage of a greedy refcounting collector.) Also note that when memory gets tight, fragmentation can be a problem. I.e. if two size-n arrays where just freed, you still may not be able to allocate a size-2n array. This seems to be worse on windows, not sure why. If your arrays are big enough that you're worried that making a stray copy will ENOMEM, then you *shouldn't* have to worry about fragmentation - malloc will give each array its own virtual mapping, which can be backed by discontinuous physical memory. (I guess it's possible windows has a somehow shoddy VM system and this isn't true, but that seems unlikely these days?) Memory fragmentation is more a problem if you're allocating lots of small objects of varying sizes. On 32 bit, virtual address fragmentation could also be a problem, but if you're working with giant data sets then you need 64 bits anyway :-). -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] np.array creation: unexpected behaviour
Hi, I just came across this unexpected behaviour when creating a np.array() from two other np.arrays of different shape. Have a look at this example: import numpy as np a = np.zeros(3) b = np.zeros((2,3)) c = np.zeros((3,2)) ab = np.array([a, b]) print ab.shape, ab.dtype ac = np.array([a, c], dtype=np.object) print ac.shape, ac.dtype ac_no_dtype = np.array([a, c]) print ac_no_dtype.shape, ac_no_dtype.dtype The output, with NumPy v1.6.1 (Ubuntu 12.04) is: (2,) object (2, 3) object Traceback (most recent call last): File /tmp/numpy_bug.py, line 9, in module ac_no_dtype = np.array([a, c]) ValueError: setting an array element with a sequence. The result for 'ab' is what I expect. The one for 'ac' is a bit surprising. The one for ac_no_dtype even is more surprising. Is this an expected behaviour? Best, Emanuele ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.array creation: unexpected behaviour
On Fri, Jan 24, 2014 at 11:30 AM, Emanuele Olivetti emanu...@relativita.com wrote: Hi, I just came across this unexpected behaviour when creating a np.array() from two other np.arrays of different shape. Have a look at this example: import numpy as np a = np.zeros(3) b = np.zeros((2,3)) c = np.zeros((3,2)) ab = np.array([a, b]) print ab.shape, ab.dtype ac = np.array([a, c], dtype=np.object) print ac.shape, ac.dtype ac_no_dtype = np.array([a, c]) print ac_no_dtype.shape, ac_no_dtype.dtype The output, with NumPy v1.6.1 (Ubuntu 12.04) is: (2,) object (2, 3) object Traceback (most recent call last): File /tmp/numpy_bug.py, line 9, in module ac_no_dtype = np.array([a, c]) ValueError: setting an array element with a sequence. The result for 'ab' is what I expect. The one for 'ac' is a bit surprising. The one for ac_no_dtype even is more surprising. Is this an expected behaviour? the exception in ac_no_dtype is what I always expected, since it's not a rectangular array. It usually happened when I make a mistake. **Unfortunately** in newer numpy version it will also create an object array. AFAIR Josef Best, Emanuele ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Catching out-of-memory error before it happens
So, with the example case, the approximate memory cost for an in-place operation would be: A *= B : 2N But, if the original A or B is to remain unchanged then it will be: C = A * B : 3N ? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Catching out-of-memory error before it happens
Yes. On 24 Jan 2014 17:19, Dinesh Vadhia dineshbvad...@hotmail.com wrote: So, with the example case, the approximate memory cost for an in-place operation would be: A *= B : 2N But, if the original A or B is to remain unchanged then it will be: C = A * B : 3N ? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Catching out-of-memory error before it happens
Francesc: Thanks. I looked at numexpr a few years back but it didn't support array slicing/indexing. Has that changed? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] (no subject)
On Thu, Jan 23, 2014 at 11:58 PM, jennifer stone jenny.stone...@gmail.comwrote: Scipy doesn't have a function for the Laplace transform, it has only a Laplace distribution in scipy.stats and a Laplace filter in scipy.ndimage. An inverse Laplace transform would be very welcome I'd think - it has real world applications, and there's no good implementation in any open source library as far as I can tell. It's probably doable, but not the easiest topic for a GSoC I think. From what I can find, the paper Numerical Transform Inversion Using Gaussian Quadrature from den Iseger contains what's considered the current state of the art algorithm. Browsing that gives a reasonable idea of the difficulty of implementing `ilaplace`. A brief scanning through the paper Numerical Transform Inversion Using Gaussian Quadrature from den Iseger does indicate the complexity of the algorithm. But GSoC project or not, can't we work on it, step by step? As I would love to see a contender for Matlab's ilaplace on open source front!! Yes, it would be quite nice to have. So if you're interested, by all means give it a go. An issue for a GSoC will be how to maximize the chance of success - typically merging smaller PRs frequently helps a lot in that respect, but we can't merge an ilaplace implementation step by step. You can have a look at https://github.com/scipy/scipy/pull/2908/files for ideas. Most of the things that need improving or we really think we should have in Scipy are listed there. Possible topics are not restricted to that list though - it's more important that you pick something you're interested in and have the required background and coding skills for. Thanks a lot for the roadmap. Of the options provided, I found the 'Cython'ization of Cluster great. Would it be possible to do it as the Summer project if I spend the month learning Cython? There are a couple of things to consider. Your proposal should be neither too easy nor too ambitious for one summer. Cythonizing cluster is probably not enough for a full summer of work, especially if you can re-use some Cython code that David WF or other people already have. So some new functionality can be added to your proposal. The other important point is that you need to find a mentor. Cluster is one of the smaller modules that doesn't see a lot of development and most of the core devs may not know so well. A good proposal may help find an interested mentor. I suggest you start early with a draft proposal, and iterate a few times based on feedback on this list. You may want to have a look at your email client settings by the way, your replies seem to start new threads. Cheers, Ralf Regards Janani Cheers, Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Catching out-of-memory error before it happens
On Fri, Jan 24, 2014 at 8:25 AM, Nathaniel Smith n...@pobox.com wrote: If your arrays are big enough that you're worried that making a stray copy will ENOMEM, then you *shouldn't* have to worry about fragmentation - malloc will give each array its own virtual mapping, which can be backed by discontinuous physical memory. (I guess it's possible windows has a somehow shoddy VM system and this isn't true, but that seems unlikely these days?) All I know is that when I push the limits with memory on a 32 bit Windows system, it often crashed out when I've never seen more than about 1GB of memory use by the application -- I would have thought that would be plenty of overhead. I also know that I've reached limits onWindows32 well before OS_X 32, but that may be because IIUC, Windows32 only allows 2GB per process, whereas OS-X32 allows 4GB per process. Memory fragmentation is more a problem if you're allocating lots of small objects of varying sizes. It could be that's what I've been doing On 32 bit, virtual address fragmentation could also be a problem, but if you're working with giant data sets then you need 64 bits anyway :-). well, giant is defined relative to the system capabilities... but yes, if you're pushing the limits of a 32 bit system , the easiest thing to do is go to 64bits and some more memory! -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Text array dtype for numpy
Oscar, Cool stuff, thanks! I'm wondering though what the use-case really is. The P3 text model (actually the py2 one, too), is quite clear that you want users to think of, and work with, text as text -- and not care how things are encoding in the underlying implementation. You only want the user to think about encodings on I/O -- transferring stuff between systems where you can't avoid it. And you might choose different encodings based on different needs. So why have a different, the-user-needs-to-think-about-encodings numpy dtype? We already have 'U' for full-on unicode support for text. There is a good argument for a more compact internal representation for text compatible with one-byte-per-char encoding, thus the suggestion for such a dtype. But I don't see the need for quite this. Maybe I'm not being a creative enough thinker. Also, we may want numpy to interact at a low level with other libs that might have binary encoded text (HDF, etc) -- in which case we need a bytes dtype that can store that data, and perhaps encoding and decoding ufuncs. If we want a more efficient and compact unicode implementation then the py3 one is a good place to start -it's pretty slick! Though maybe harder to due in numpy as text in numpy probably wouldn't be immutable. To make a slightly more concrete proposal, I've implemented a pure Python ndarray subclass that I believe can consistently handle text/bytes in Python 3. this scares me right there -- is it text or bytes??? We really don't want something that is both. The idea is that the array has an encoding. It stores strings as bytes. The bytes are encoded/decoded on insertion/access. Methods accessing the binary content of the array will see the encoded bytes. Methods accessing the elements of the array will see unicode strings. I believe it would not be as hard to implement as the proposals for variable length string arrays. except that with some encodings, the number of bytes required is a function of what the content of teh text is -- so it either has to be variable length, or a fixed number of bytes, which is not a fixed number of characters which require both careful truncation (a pain), and surprising results for users why can't I fit 10 characters is a length-10 text object? And I can if they are different characters?) The one caveat is that it will strip null characters from the end of any string. which is fatal, but you do want a new dtype after all, which presumably wouldn't do that. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Catching out-of-memory error before it happens
On Fri, Jan 24, 2014 at 10:29 PM, Chris Barker chris.bar...@noaa.gov wrote: On Fri, Jan 24, 2014 at 8:25 AM, Nathaniel Smith n...@pobox.com wrote: If your arrays are big enough that you're worried that making a stray copy will ENOMEM, then you *shouldn't* have to worry about fragmentation - malloc will give each array its own virtual mapping, which can be backed by discontinuous physical memory. (I guess it's possible windows has a somehow shoddy VM system and this isn't true, but that seems unlikely these days?) All I know is that when I push the limits with memory on a 32 bit Windows system, it often crashed out when I've never seen more than about 1GB of memory use by the application -- I would have thought that would be plenty of overhead. I also know that I've reached limits onWindows32 well before OS_X 32, but that may be because IIUC, Windows32 only allows 2GB per process, whereas OS-X32 allows 4GB per process. Memory fragmentation is more a problem if you're allocating lots of small objects of varying sizes. It could be that's what I've been doing On 32 bit, virtual address fragmentation could also be a problem, but if you're working with giant data sets then you need 64 bits anyway :-). well, giant is defined relative to the system capabilities... but yes, if you're pushing the limits of a 32 bit system , the easiest thing to do is go to 64bits and some more memory! Oh, yeah, common confusion. Allowing 2 GiB of address space per process doesn't mean you can actually practically use 2 GiB of *memory* per process, esp. if you're allocating/deallocating a mix of large and small objects, because address space fragmentation will kill you way before that. The memory is there, there isn't anywhere to slot it into the process's address space. So you don't need to add more memory, just switch to a 64-bit OS. On 64-bit you have oodles of address space, so the memory manager can easily slot in large objects far away from small objects, and it's only fragmentation within each small-object arena that hurts. A good malloc will keep this overhead down pretty low though -- certainly less than the factor of two you're thinking about. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Comparison changes
Hi all, in https://github.com/numpy/numpy/pull/3514 I proposed some changes to the comparison operators. This includes: 1. Comparison with None will broadcast in the future, so that `arr == None` will actually compare all elements to None. (A FutureWarning for now) 2. I added that == and != will give FutureWarning when an error was raised. In the future they should not silence these errors anymore. (For example shape mismatches) 3. We used to use PyObject_RichCompareBool for equality which includes an identity check. I propose to not do that identity check since we have elementwise equality (returning an object array for objects would be nice in some ways, but I think that is only an option for a dedicated function). The reason is that for example a = np.array([np.array([1, 2, 3]), 1]) b = np.array([np.array([1, 2, 3]), 1]) a == b will happen to work if it happens to be that `a[0] is b[0]`. This currently has no deprecation, since the logic is in the inner loop and I am not sure if it is easy to add well there. Are there objections/comments to these changes? Regards, Sebastian ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Comparison changes
On 25 Jan 2014 00:05, Sebastian Berg sebast...@sipsolutions.net wrote: Hi all, in https://github.com/numpy/numpy/pull/3514 I proposed some changes to the comparison operators. This includes: 1. Comparison with None will broadcast in the future, so that `arr == None` will actually compare all elements to None. (A FutureWarning for now) 2. I added that == and != will give FutureWarning when an error was raised. In the future they should not silence these errors anymore. (For example shape mismatches) This can just be a DeprecationWarning, because the only change is to raise new more errors. 3. We used to use PyObject_RichCompareBool for equality which includes an identity check. I propose to not do that identity check since we have elementwise equality (returning an object array for objects would be nice in some ways, but I think that is only an option for a dedicated function). The reason is that for example a = np.array([np.array([1, 2, 3]), 1]) b = np.array([np.array([1, 2, 3]), 1]) a == b will happen to work if it happens to be that `a[0] is b[0]`. This currently has no deprecation, since the logic is in the inner loop and I am not sure if it is easy to add well there. Surely any environment where we can call PyObject_RichCompareBool is an environment where we can issue a warning...? -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.array creation: unexpected behaviour
On Fri, 24 Jan 2014 17:30:33 +0100, Emanuele Olivetti wrote: I just came across this unexpected behaviour when creating a np.array() from two other np.arrays of different shape. The tuple parsing for the construction of new numpy arrays is pretty tricky/hairy, and doesn't always do exactly what you'd expect. The easiest workaround is probably to pre-allocate the array: In [24]: data = [a, c] In [25]: x = np.empty(len(data), dtype=object) In [26]: x[:] = data In [27]: x.shape Out[27]: (2,) Regards Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Text array dtype for numpy
On Fri, Jan 24, 2014 at 5:43 PM, Chris Barker chris.bar...@noaa.gov wrote: Oscar, Cool stuff, thanks! I'm wondering though what the use-case really is. The P3 text model (actually the py2 one, too), is quite clear that you want users to think of, and work with, text as text -- and not care how things are encoding in the underlying implementation. You only want the user to think about encodings on I/O -- transferring stuff between systems where you can't avoid it. And you might choose different encodings based on different needs. So why have a different, the-user-needs-to-think-about-encodings numpy dtype? We already have 'U' for full-on unicode support for text. There is a good argument for a more compact internal representation for text compatible with one-byte-per-char encoding, thus the suggestion for such a dtype. But I don't see the need for quite this. Maybe I'm not being a creative enough thinker. In my opinion something like Oscar's class would be very useful (with some adjustments, especially making it easy to create an S view or put a encoding view on top of an S array). (Disclaimer: My only experience is in converting some examples in statsmodels to bytes in py 3 and to play with some examples.) My guess is that 'S'/bytes is very convenient for library code, because it doesn't care about encodings (assuming we have enough control that all bytes are in the same encoding), and we don't have any overhead to convert to strings when comparing or working with byte strings. 'S' is also very flexible because it doesn't tie us down to a minimum size for the encoding nor any specific encoding. The problem of 'S'/bytes is in input output and interactive work, as in the examples of Tom Aldcroft. The textarray dtype would allow us to view any 'S' array so we can have text/string interaction with python and get the correct encoding on input and output. Whether you live in an ascii, latin1, cp1252, iso8859_5 or in any other world, you could get your favorite minimal memory S/bytes/strings. I think this is useful as a complement to the current 'S' type, and to make that more useful on python 3, independent of what other small memory unicode dtype with predefined encoding numpy could get. Also, we may want numpy to interact at a low level with other libs that might have binary encoded text (HDF, etc) -- in which case we need a bytes dtype that can store that data, and perhaps encoding and decoding ufuncs. If we want a more efficient and compact unicode implementation then the py3 one is a good place to start -it's pretty slick! Though maybe harder to due in numpy as text in numpy probably wouldn't be immutable. To make a slightly more concrete proposal, I've implemented a pure Python ndarray subclass that I believe can consistently handle text/bytes in Python 3. this scares me right there -- is it text or bytes??? We really don't want something that is both. Most users won't care about the internal representation of anything. But when we want or find it useful we can view the memory with any compatible dtype. That is, with numpy we always have also raw bytes. And there are lot's of ways to shoot yourself why would you want to to that? : a = np.arange(5) b = a.view('S4') b[1] = 'h' a array([ 0, 104, 2, 3, 4]) a[1] = 'h' Traceback (most recent call last): File pyshell#22, line 1, in module a[1] = 'h' ValueError: invalid literal for int() with base 10: 'h' The idea is that the array has an encoding. It stores strings as bytes. The bytes are encoded/decoded on insertion/access. Methods accessing the binary content of the array will see the encoded bytes. Methods accessing the elements of the array will see unicode strings. I believe it would not be as hard to implement as the proposals for variable length string arrays. except that with some encodings, the number of bytes required is a function of what the content of teh text is -- so it either has to be variable length, or a fixed number of bytes, which is not a fixed number of characters which require both careful truncation (a pain), and surprising results for users why can't I fit 10 characters is a length-10 text object? And I can if they are different characters?) not really different to other places where you have to pay attention to the underlying dtype, and a question of providing the underlying information. (like itemsize) 1 - 1e-20 I had code like that when I wasn't thinking properly or wasn't paying enough attention to what I was typing. The one caveat is that it will strip null characters from the end of any string. which is fatal, but you do want a new dtype after all, which presumably wouldn't do that. The only place so far that I found where this really hurts is in the decode examples (with utf32LE for example). That's why I think numpy needs to have decode/encode functions, so it can access the bytes before they are null truncated, besides being