Re: [Numpy-discussion] loadtxt and usecols
On 11/11/2015 18:38, Sebastian Berg wrote: Sounds fine to me, and considering the squeeze logic (which I think is unfortunate, but it is not something you can easily change), I would be for simply adding logic to accept a single integral argument and otherwise not change anything. [...] As said before, the other/additional thing that might be very helpful is trying to give a more useful error message. I've modified my PR to (hopefully) match these requests. https://github.com/numpy/numpy/pull/6656 Regards. -- Irvin ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On Di, 2015-11-10 at 17:39 +0100, Irvin Probst wrote: > On 10/11/2015 16:52, Daπid wrote: > > 42, is exactly the same as (42,) If you want a tuple of > > tuples, you have to do ((42,),), but then it raises: TypeError: list > > indices must be integers, not tuple. > > My bad, I wrote that too fast, please forget this. > > > I think loadtxt should be a tool to read text files in the least > > surprising fashion, and a text file is a 1 or 2D container, so it > > shouldn't return any other shapes. > > And I *do* agree with the "shouldn't return any other shapes" part of > your phrase. What I was trying to say, admitedly with a very bogus > example, is that either loadtxt() should always output an array whose > shape matches the shape of the object passed to usecol or it should > never do it, and I'm if favor of never. Sounds fine to me, and considering the squeeze logic (which I think is unfortunate, but it is not something you can easily change), I would be for simply adding logic to accept a single integral argument and otherwise not change anything. I am personally against the flattening and even the array-like logic [1] currently in the PR, it seems like arbitrary generality for my taste without any obvious application. As said before, the other/additional thing that might be very helpful is trying to give a more useful error message. - Sebastian [1] Almost all 1-d array-likes will be sequences/iterables in any case, those that are not are so obscure that there is no point in explicitly supporting them. > I'm perfectly aware that what I suggest would break the current behavior > of usecols=(2,) so I know it does not have the slightest probability of > being accepted but still, I think that the "least surprising fashion" is > to always return an 2-D array because for many, many, many people a text > data file has N lines and M columns and N=1 or M=1 is not a specific case. > > Anyway I will of course modify my PR according to any decision made here. > > In your example: > > > > a=[[[2,],[],[],],[],[],[]] > > foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a) > > > > What would the shape of foo be? > > As I said in my previous email: > > > should just work and return me a 2-D (or 1-D if you like) array with > the data I asked for > > So, 1-D or 2-D it is up to you, but as long as there is no ambiguity in > which columns the user is asking for it should imho work. > > Regards. > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion signature.asc Description: This is a digitally signed message part ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
Just pointing out np.loadtxt(..., ndmin=2) will always return a 2D array. Notice that without that option, the result is effectively squeezed. So if you don't specify that option, and you load up a CSV file with only one row, you will get a very differently shaped array than if you load up a CSV file with two rows. Ben Root On Tue, Nov 10, 2015 at 10:07 AM, Irvin Probst < irvin.pro...@ensta-bretagne.fr> wrote: > On 10/11/2015 14:17, Sebastian Berg wrote: > >> Actually, it is the "sequence special case" type ;). (matlab does not >> have this, since matlab always returns 2-D I realized). >> >> As I said, if usecols is like indexing, the result should mimic: >> >> arr = np.loadtxt(f) >> arr = arr[usecols] >> >> in which case a 1-D array is returned if you put in a scalar into >> usecols (and you could even generalize usecols to higher dimensional >> array-likes). >> The way you implemented it -- which is fine, but I want to stress that >> there is a real decision being made here --, you always see it as a >> sequence but allow a scalar for convenience (i.e. always return a 2-D >> array). It is a `sequence of ints or int` type argument and not an >> array-like argument in my opinion. >> > > I think we have two separate problems here: > > The first one is whether loadtxt should always return a 2D array or should > it match the shape of the usecol argument. From a CS guy point of view I do > understand your concern here. Now from a teacher point of view I know many > people expect to get a "matrix" (thank you Matlab...) and the "purity" of > matching the dimension of the usecol variable will be seen by many people > [1] as a nerdy useless heavyness noone cares of (no offense). So whatever > you, seadoned numpy devs from this mailing list, decide I think it should > be explained in the docstring with a very clear wording. > > My own opinion on this first problem is that loadtxt() should always > return a 2D array, no less, no more. If I write np.loadtxt(f)[42] it means > I want to read the whole file and then I explicitely ask for transforming > the 2-D array loadtxt() returned into a 1-D array. Otoh if I write > loadtxt(f, usecol=42) it means I don't want to read the other columns and I > want only this one, but it does not mean that I want to change the returned > array from 2-D to 1-D. I know this new behavior might break a lot of > existing code as usecol=(42,) used to return a 1-D array, but > usecol=42, also returns a 1-D array so the current behavior is not > consistent imho. > > The second problem is about the wording in the docstring, when I see > "sequence of int or int" I uderstand I will have to cast into a 1-D python > list whatever wicked N-dimensional object I use to store my column indexes, > or hope list(my_object) will do it fine. On the other hand when I read > "array-like" the function is telling me I don't have to worry about my > object, as long as numpy knows how to cast it into an array it will be fine. > > Anyway I think something like that: > > import numpy as np > a=[[[2,],[],[],],[],[],[]] > foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a) > > should just work and return me a 2-D (or 1-D if you like) array with the > data I asked for and I don't think "a" here is an int or a sequence of int > (but it's a good example of why loadtxt() should not match the shape of the > usecol argument). > > To make it short, let the reading function read the data in a consistent > and predictible way and then let the user explicitely change the data's > shape into anything he likes. > > Regards. > > [1] read non CS people trying to switch to numpy/scipy > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On Di, 2015-11-10 at 10:24 -0500, Benjamin Root wrote: > Just pointing out np.loadtxt(..., ndmin=2) will always return a 2D > array. Notice that without that option, the result is effectively > squeezed. So if you don't specify that option, and you load up a CSV > file with only one row, you will get a very differently shaped array > than if you load up a CSV file with two rows. > Oh, well I personally think that default squeeze is an abomination :). Anyway, I just wanted to point out that it is two different possible logics, and we have to pick one. I have a slight preference for the indexing/array-like interpretation, but I am aware that from a usage point of view the sequence one is likely better. I could throw in another option: Throw an explicit error instead of the general. Anyway, I *really* do not have an opinion about what is better. Array-like would only suggest that you also accept buffer interface objects or array_interface stuff. Which in this case is really unnecessary I think. - Sebastian > > Ben Root > > > On Tue, Nov 10, 2015 at 10:07 AM, Irvin Probst >wrote: > On 10/11/2015 14:17, Sebastian Berg wrote: > Actually, it is the "sequence special case" type ;). > (matlab does not > have this, since matlab always returns 2-D I > realized). > > As I said, if usecols is like indexing, the result > should mimic: > > arr = np.loadtxt(f) > arr = arr[usecols] > > in which case a 1-D array is returned if you put in a > scalar into > usecols (and you could even generalize usecols to > higher dimensional > array-likes). > The way you implemented it -- which is fine, but I > want to stress that > there is a real decision being made here --, you > always see it as a > sequence but allow a scalar for convenience (i.e. > always return a 2-D > array). It is a `sequence of ints or int` type > argument and not an > array-like argument in my opinion. > > I think we have two separate problems here: > > The first one is whether loadtxt should always return a 2D > array or should it match the shape of the usecol argument. > From a CS guy point of view I do understand your concern here. > Now from a teacher point of view I know many people expect to > get a "matrix" (thank you Matlab...) and the "purity" of > matching the dimension of the usecol variable will be seen by > many people [1] as a nerdy useless heavyness noone cares of > (no offense). So whatever you, seadoned numpy devs from this > mailing list, decide I think it should be explained in the > docstring with a very clear wording. > > My own opinion on this first problem is that loadtxt() should > always return a 2D array, no less, no more. If I write > np.loadtxt(f)[42] it means I want to read the whole file and > then I explicitely ask for transforming the 2-D array > loadtxt() returned into a 1-D array. Otoh if I write > loadtxt(f, usecol=42) it means I don't want to read the other > columns and I want only this one, but it does not mean that I > want to change the returned array from 2-D to 1-D. I know this > new behavior might break a lot of existing code as > usecol=(42,) used to return a 1-D array, but > usecol=42, also returns a 1-D array so the current > behavior is not consistent imho. > > The second problem is about the wording in the docstring, when > I see "sequence of int or int" I uderstand I will have to cast > into a 1-D python list whatever wicked N-dimensional object I > use to store my column indexes, or hope list(my_object) will > do it fine. On the other hand when I read "array-like" the > function is telling me I don't have to worry about my object, > as long as numpy knows how to cast it into an array it will be > fine. > > Anyway I think something like that: > > import numpy as np > a=[[[2,],[],[],],[],[],[]] > foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a) > > should just work and return me a 2-D (or 1-D if you like) > array with the data I asked for and I don't think "a" here is > an int or a sequence of int (but it's a good example of why > loadtxt() should not match the shape of the usecol argument). > > To make it short, let the reading function read the data in a >
Re: [Numpy-discussion] loadtxt and usecols
On 10/11/2015 14:17, Sebastian Berg wrote: Actually, it is the "sequence special case" type ;). (matlab does not have this, since matlab always returns 2-D I realized). As I said, if usecols is like indexing, the result should mimic: arr = np.loadtxt(f) arr = arr[usecols] in which case a 1-D array is returned if you put in a scalar into usecols (and you could even generalize usecols to higher dimensional array-likes). The way you implemented it -- which is fine, but I want to stress that there is a real decision being made here --, you always see it as a sequence but allow a scalar for convenience (i.e. always return a 2-D array). It is a `sequence of ints or int` type argument and not an array-like argument in my opinion. I think we have two separate problems here: The first one is whether loadtxt should always return a 2D array or should it match the shape of the usecol argument. From a CS guy point of view I do understand your concern here. Now from a teacher point of view I know many people expect to get a "matrix" (thank you Matlab...) and the "purity" of matching the dimension of the usecol variable will be seen by many people [1] as a nerdy useless heavyness noone cares of (no offense). So whatever you, seadoned numpy devs from this mailing list, decide I think it should be explained in the docstring with a very clear wording. My own opinion on this first problem is that loadtxt() should always return a 2D array, no less, no more. If I write np.loadtxt(f)[42] it means I want to read the whole file and then I explicitely ask for transforming the 2-D array loadtxt() returned into a 1-D array. Otoh if I write loadtxt(f, usecol=42) it means I don't want to read the other columns and I want only this one, but it does not mean that I want to change the returned array from 2-D to 1-D. I know this new behavior might break a lot of existing code as usecol=(42,) used to return a 1-D array, but usecol=42, also returns a 1-D array so the current behavior is not consistent imho. The second problem is about the wording in the docstring, when I see "sequence of int or int" I uderstand I will have to cast into a 1-D python list whatever wicked N-dimensional object I use to store my column indexes, or hope list(my_object) will do it fine. On the other hand when I read "array-like" the function is telling me I don't have to worry about my object, as long as numpy knows how to cast it into an array it will be fine. Anyway I think something like that: import numpy as np a=[[[2,],[],[],],[],[],[]] foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a) should just work and return me a 2-D (or 1-D if you like) array with the data I asked for and I don't think "a" here is an int or a sequence of int (but it's a good example of why loadtxt() should not match the shape of the usecol argument). To make it short, let the reading function read the data in a consistent and predictible way and then let the user explicitely change the data's shape into anything he likes. Regards. [1] read non CS people trying to switch to numpy/scipy ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On 10 November 2015 at 16:07, Irvin Probstwrote: > I know this new behavior might break a lot of existing code as > usecol=(42,) used to return a 1-D array, but usecol=42, also > returns a 1-D array so the current behavior is not consistent imho. 42, is exactly the same as (42,) If you want a tuple of tuples, you have to do ((42,),), but then it raises: TypeError: list indices must be integers, not tuple. What numpy cares about is that whatever object you give it is iterable, and its entries are ints, so usecol={0:'a', 5:'b'} is perfectly valid. I think loadtxt should be a tool to read text files in the least surprising fashion, and a text file is a 1 or 2D container, so it shouldn't return any other shapes. Any fancy stuff one may want to do with the output should be done with the typical indexing tricks. If I want a single column, I would first be very surprised if I got a 2D array (I was bitten by this design in MATLAB many many times). For the rare cases where I do want a "fake" 2D array, I can make it explicit by expanding it with arr[:, np.newaxis], and then I know that the shape will be (N, 1) and not (1, N). Thus, usecols should be int or sequence of ints, and the result 1 or 2D. In your example: a=[[[2,],[],[],],[],[],[]] foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a) What would the shape of foo be? /David. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On 10/11/2015 16:52, Daπid wrote: 42, is exactly the same as (42,) If you want a tuple of tuples, you have to do ((42,),), but then it raises: TypeError: list indices must be integers, not tuple. My bad, I wrote that too fast, please forget this. I think loadtxt should be a tool to read text files in the least surprising fashion, and a text file is a 1 or 2D container, so it shouldn't return any other shapes. And I *do* agree with the "shouldn't return any other shapes" part of your phrase. What I was trying to say, admitedly with a very bogus example, is that either loadtxt() should always output an array whose shape matches the shape of the object passed to usecol or it should never do it, and I'm if favor of never. I'm perfectly aware that what I suggest would break the current behavior of usecols=(2,) so I know it does not have the slightest probability of being accepted but still, I think that the "least surprising fashion" is to always return an 2-D array because for many, many, many people a text data file has N lines and M columns and N=1 or M=1 is not a specific case. Anyway I will of course modify my PR according to any decision made here. In your example: a=[[[2,],[],[],],[],[],[]] foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a) What would the shape of foo be? As I said in my previous email: > should just work and return me a 2-D (or 1-D if you like) array with the data I asked for So, 1-D or 2-D it is up to you, but as long as there is no ambiguity in which columns the user is asking for it should imho work. Regards. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On Mo, 2015-11-09 at 20:36 +0100, Ralf Gommers wrote: > > > On Mon, Nov 9, 2015 at 7:42 PM, Benjamin Root> wrote: > My personal rule for flexible inputs like that is that it > should be encouraged so long as it does not introduce > ambiguity. Furthermore, Allowing a scalar as an input doesn't > add a congitive disconnect on the user on how to specify > multiple columns. Therefore, I'd give this a +1. > > > On Mon, Nov 9, 2015 at 4:15 AM, Irvin Probst > wrote: > Hi, > I've recently seen many students, coming from Matlab, > struggling against the usecols argument of loadtxt. > Most of them tried something like: > loadtxt("foo.bar", usecols=2) or the ones with better > documentation reading skills tried loadtxt("foo.bar", > usecols=(2)) but none of them understood they had to > write usecols=[2] or usecols=(2,). > > Is there a policy in numpy stating that this kind of > arguments must be sequences ? > > > There isn't. In many/most cases it's array_like, which means scalar, > sequence or array. > Agree, I think we have, or should have, to types of things there (well, three since we certainly have "must be sequence"). Args such as "axes" which is typically just one, so we allow scalar, but can often be generalized to a sequence. And things that are array-likes (and broadcasting). So, if this is an array-like, however, the "correct" result could be different by broadcasting between `1` and `(1,)` analogous to indexing the full array with usecols: usecols=1 result: array([2, 3, 4, 5]) usecols=(1,) result [1]: array([[2, 3, 4, 5]]) since a scalar row (so just one row) is read and not a 2D array. I tend to say it should be an array-like argument and not a generalized sequence argument, just wanted to note that, since I am not sure what matlab does. - Sebastian [1] could go further and do `usecols=[[1]]` and get `array([[[2, 3, 4, 5]]])` > > I think that being able to an int or a sequence when a > single column is needed would make this function a bit > more user friendly for beginners. I would gladly > submit a PR if noone disagrees. > > +1 > > > Ralf > > > > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion signature.asc Description: This is a digitally signed message part ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On 10/11/2015 09:19, Sebastian Berg wrote: since a scalar row (so just one row) is read and not a 2D array. I tend to say it should be an array-like argument and not a generalized sequence argument, just wanted to note that, since I am not sure what matlab does. Hi, By default Matlab reads everything, silently fails on what can't be converted into a float and the user has to guess what was read or not. Say you have a file like this: 2010-01-01 00:00:00 3.026 2010-01-01 01:00:00 4.049 2010-01-01 02:00:00 4.865 >> M=load('CONCARNEAU_2010.txt'); >> M(1:3,:) ans = 1.0e+03 * 2.0100 00.0030 2.01000.00100.0040 2.01000.00200.0049 I think this is a terrible way of doing it even if newcomers might find this handy. There are of course optionnal arguments (even regexps !) but to my knowledge almost no Matlab user even knows these arguments are there. Anyway, I made a PR here https://github.com/numpy/numpy/pull/6656 with usecols as an array-like. Regards. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On Di, 2015-11-10 at 10:24 +0100, Irvin Probst wrote: > On 10/11/2015 09:19, Sebastian Berg wrote: > > since a scalar row (so just one row) is read and not a 2D array. I tend > > to say it should be an array-like argument and not a generalized > > sequence argument, just wanted to note that, since I am not sure what > > matlab does. > > Hi, > By default Matlab reads everything, silently fails on what can't be > converted into a float and the user has to guess what was read or not. > Say you have a file like this: > > 2010-01-01 00:00:00 3.026 > 2010-01-01 01:00:00 4.049 > 2010-01-01 02:00:00 4.865 > > > >> M=load('CONCARNEAU_2010.txt'); > >> M(1:3,:) > > ans = > > 1.0e+03 * > > 2.0100 00.0030 > 2.01000.00100.0040 > 2.01000.00200.0049 > > > I think this is a terrible way of doing it even if newcomers might find > this handy. There are of course optionnal arguments (even regexps !) but > to my knowledge almost no Matlab user even knows these arguments are there. > > Anyway, I made a PR here https://github.com/numpy/numpy/pull/6656 with > usecols as an array-like. > Actually, it is the "sequence special case" type ;). (matlab does not have this, since matlab always returns 2-D I realized). As I said, if usecols is like indexing, the result should mimic: arr = np.loadtxt(f) arr = arr[usecols] in which case a 1-D array is returned if you put in a scalar into usecols (and you could even generalize usecols to higher dimensional array-likes). The way you implemented it -- which is fine, but I want to stress that there is a real decision being made here --, you always see it as a sequence but allow a scalar for convenience (i.e. always return a 2-D array). It is a `sequence of ints or int` type argument and not an array-like argument in my opinion. - Sebastian > Regards. > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > signature.asc Description: This is a digitally signed message part ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
My personal rule for flexible inputs like that is that it should be encouraged so long as it does not introduce ambiguity. Furthermore, Allowing a scalar as an input doesn't add a congitive disconnect on the user on how to specify multiple columns. Therefore, I'd give this a +1. On Mon, Nov 9, 2015 at 4:15 AM, Irvin Probstwrote: > Hi, > I've recently seen many students, coming from Matlab, struggling against > the usecols argument of loadtxt. Most of them tried something like: > loadtxt("foo.bar", usecols=2) or the ones with better documentation > reading skills tried loadtxt("foo.bar", usecols=(2)) but none of them > understood they had to write usecols=[2] or usecols=(2,). > > Is there a policy in numpy stating that this kind of arguments must be > sequences ? I think that being able to an int or a sequence when a single > column is needed would make this function a bit more user friendly for > beginners. I would gladly submit a PR if noone disagrees. > > Regards. > > -- > Irvin > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On Mon, Nov 9, 2015 at 7:42 PM, Benjamin Rootwrote: > My personal rule for flexible inputs like that is that it should be > encouraged so long as it does not introduce ambiguity. Furthermore, > Allowing a scalar as an input doesn't add a congitive disconnect on the > user on how to specify multiple columns. Therefore, I'd give this a +1. > > On Mon, Nov 9, 2015 at 4:15 AM, Irvin Probst < > irvin.pro...@ensta-bretagne.fr> wrote: > >> Hi, >> I've recently seen many students, coming from Matlab, struggling against >> the usecols argument of loadtxt. Most of them tried something like: >> loadtxt("foo.bar", usecols=2) or the ones with better documentation >> reading skills tried loadtxt("foo.bar", usecols=(2)) but none of them >> understood they had to write usecols=[2] or usecols=(2,). >> >> Is there a policy in numpy stating that this kind of arguments must be >> sequences ? > > There isn't. In many/most cases it's array_like, which means scalar, sequence or array. > I think that being able to an int or a sequence when a single column is >> needed would make this function a bit more user friendly for beginners. I >> would gladly submit a PR if noone disagrees. >> > +1 Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] loadtxt and usecols
Hi, I've recently seen many students, coming from Matlab, struggling against the usecols argument of loadtxt. Most of them tried something like: loadtxt("foo.bar", usecols=2) or the ones with better documentation reading skills tried loadtxt("foo.bar", usecols=(2)) but none of them understood they had to write usecols=[2] or usecols=(2,). Is there a policy in numpy stating that this kind of arguments must be sequences ? I think that being able to an int or a sequence when a single column is needed would make this function a bit more user friendly for beginners. I would gladly submit a PR if noone disagrees. Regards. -- Irvin ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On 31 May 2011, at 21:28, Pierre GM wrote: On May 31, 2011, at 6:37 PM, Derek Homeier wrote: On 31 May 2011, at 18:25, Pierre GM wrote: On May 31, 2011, at 5:52 PM, Derek Homeier wrote: I think stuff like multiple delimiters should have been dealt with before, as the right place to insert the ndmin code (which includes the decision to squeeze or not to squeeze as well as to add additional dimensions, if required) would be right at the end before the 'unpack' switch, or rather replacing the bit: if usemask: output = output.view(MaskedArray) output._mask = outputmask if unpack: return output.squeeze().T return output.squeeze() But there it's already not clear to me how to deal with the MaskedArray case... Oh, easy. You need to replace only the last three lines of genfromtxt with the ones from loadtxt (808-833). Then, if usemask is True, you need to use ma.atleast_Xd instead of np.atleast_Xd. Et voilà. Comments: * I would raise an exception if ndmin isn't correct *before* trying to read the file... * You could define a `collapse_function` that would be `np.atleast_1d`, `np.atleast_2d`, `ma.atleast_1d`... depending on the values of `usemask` and `ndmin`... Thanks, that helped to clean up a little bit. If you have any question about numpy.ma, don't hesitate to contact me directly. Thanks for the directions! I was not sure about the usemask case because it presently does not invoke .squeeze() either... The idea is that if `usemask` is True, you build a second array (the mask), that you attach to your main array at the very end (in the `output=output.view(MaskedArray), output._mask = mask` combo...). Afterwards, it's a regular MaskedArray that supports the .squeeze() method... OK, in both cases output.squeeze() is now used if ndimndmin and usemask is False - at least it does not break any tests, so it seems to work with MaskedArrays as well. On a possibly related note, genfromtxt also treats the 'unpack'ing of structured arrays differently from loadtxt (which returns a list of arrays in that case) - do you know if this is on purpose, or also rather missing functionality (I guess it might break recfromtxt()...)? Keep in mind that I haven't touched genfromtxt since 8-10 months or so. I wouldn't be surprised that it were lagging a bit behind loadtxt in terms of development. Yes, there'll be some tweaking to do for recfromtxt (it's OK for now if `ndmin` and `unpack` are the defaults) and others, but nothing major. Well, at long last I got to implement the above and added the corresponding tests for genfromtxt - with the exception of the dimension-0 cases, since genfromtxt raises an error on empty files. There already is a comment it should perhaps better return an empty array, so I am putting that idea up for discussion here again. I tried to devise a very basic test with masked arrays, just added to test_withmissing now. I also implemented the same unpacking behaviour for structured arrays and just made recfromtxt set unpack=False to work (or should it issue a warning?). The patches are up for review as commit 8ac01636 in my iocnv-wildcard branch: https://github.com/dhomeier/numpy/compare/master...iocnv-wildcard Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
Derek Homeier wrote: Hi Chris, On 31 May 2011, at 13:56, cgraves wrote: I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen? you are mistaken, genfromtxt is not using loadtxt (and could not possibly, since it has the more complex parser to handle missing data); thus ndmin could not be inherited automatically. It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it. Does anyone have an opinion whether this is a case for reopening (yet again...) http://projects.scipy.org/numpy/ticket/1562 or create a new ticket? Thanks Derek. That would be greatly appreciated! Based on the follow-up messages in this thread, it looks like (hopefully) there will not be too much additional work in implementing it. For now I'll just use the temporary fix, a .reshape(-1), on any recfromtxt's that might read in a single row of data.. Kind regards, Chris -- View this message in context: http://old.nabble.com/loadtxt-savetxt-tickets-tp31238871p31769169.html Sent from the Numpy-discussion mailing list archive at Nabble.com. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
Ralf Gommers-2 wrote: On Fri, May 6, 2011 at 12:57 PM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 6 May 2011, at 07:53, Ralf Gommers wrote: Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3. If in the meantime a few more people could test this, that would be helpful. Ralf I agree, wish I had time to push this before rc2. I could add the explanatory comments mentioned above and switch to use the atleast_[12]d() solution, test that and push it in a couple of minutes, or should I better leave it as is now for testing? Quick follow-up: I just applied the above changes, added some tests to cover Ben's test cases and tested this with 1.6.0rc2 on OS X 10.5 i386+ppc + 10.6 x86_64 (Python2.7+3.2). So I'd be ready to push it to my repo and do my (first) pull request... Go ahead, I'll have a look at it tonight. Thanks for testing on several Pythons, that definitely helps. Done, the request only appears on my repo https://github.com/dhomeier/numpy/ is that correct? If someone could test it on Linux and Windows as well... Committed, thanks for all the work. The pull request was in the wrong place, that's a minor flaw in the github UI. After you press Pull Request you need to read the small print to see where it's going. Cheers, Ralf Dear all, I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen? Best, Chris -- View this message in context: http://old.nabble.com/loadtxt-savetxt-tickets-tp31238871p31740152.html Sent from the Numpy-discussion mailing list archive at Nabble.com. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
Hi Chris, On 31 May 2011, at 13:56, cgraves wrote: I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen? you are mistaken, genfromtxt is not using loadtxt (and could not possibly, since it has the more complex parser to handle missing data); thus ndmin could not be inherited automatically. It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it. Does anyone have an opinion whether this is a case for reopening (yet again...) http://projects.scipy.org/numpy/ticket/1562 or create a new ticket? Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On May 31, 2011, at 4:53 PM, Derek Homeier wrote: Hi Chris, On 31 May 2011, at 13:56, cgraves wrote: I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen? you are mistaken, genfromtxt is not using loadtxt (and could not possibly, since it has the more complex parser to handle missing data); thus ndmin could not be inherited automatically. It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it. Oh, that shouldn't be too difficult: 'ndmin' tells whether the array must be squeezed before being returned, right ? You can add some test at the very end of genfromtxt to check what to do with the output (whether to squeeze it or not, whether to transpose it or not)... If you don't mind doing it, I'd be quite grateful (I don't have time to work on numpy these days, much to my regret). Don't forget to change the user manual as well... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On 05/31/2011 10:18 AM, Pierre GM wrote: On May 31, 2011, at 4:53 PM, Derek Homeier wrote: Hi Chris, On 31 May 2011, at 13:56, cgraves wrote: I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen? you are mistaken, genfromtxt is not using loadtxt (and could not possibly, since it has the more complex parser to handle missing data); thus ndmin could not be inherited automatically. It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it. Oh, that shouldn't be too difficult: 'ndmin' tells whether the array must be squeezed before being returned, right ? You can add some test at the very end of genfromtxt to check what to do with the output (whether to squeeze it or not, whether to transpose it or not)... If you don't mind doing it, I'd be quite grateful (I don't have time to work on numpy these days, much to my regret). Don't forget to change the user manual as well... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion (Different function so different ticket.) Sure you can change the end of the code but that may hide various problem. Unlike loadtxt, genfromtxt has a lot of flexibility especially handling missing values and using converter functions. So I think that some examples must be provided that can not be handled by providing a suitable converter or that require multiple assumptions about input file (such as having more than one delimiter). Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On Tue, May 31, 2011 at 10:33 AM, Bruce Southey bsout...@gmail.com wrote: On 05/31/2011 10:18 AM, Pierre GM wrote: On May 31, 2011, at 4:53 PM, Derek Homeier wrote: Hi Chris, On 31 May 2011, at 13:56, cgraves wrote: I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen? you are mistaken, genfromtxt is not using loadtxt (and could not possibly, since it has the more complex parser to handle missing data); thus ndmin could not be inherited automatically. It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it. Oh, that shouldn't be too difficult: 'ndmin' tells whether the array must be squeezed before being returned, right ? You can add some test at the very end of genfromtxt to check what to do with the output (whether to squeeze it or not, whether to transpose it or not)... If you don't mind doing it, I'd be quite grateful (I don't have time to work on numpy these days, much to my regret). Don't forget to change the user manual as well... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion (Different function so different ticket.) Sure you can change the end of the code but that may hide various problem. Unlike loadtxt, genfromtxt has a lot of flexibility especially handling missing values and using converter functions. So I think that some examples must be provided that can not be handled by providing a suitable converter or that require multiple assumptions about input file (such as having more than one delimiter). Bruce At this point, I wonder if it might be smarter to create a .atleast_Nd() function and use that everywhere it is needed. Having similar logic tailored for each loading function might be a little dangerous if bug fixes are made to one, but not the others. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On 31 May 2011, at 17:33, Bruce Southey wrote: It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it. Oh, that shouldn't be too difficult: 'ndmin' tells whether the array must be squeezed before being returned, right ? You can add some test at the very end of genfromtxt to check what to do with the output (whether to squeeze it or not, whether to transpose it or not)... If you don't mind doing it, I'd be quite grateful (I don't have time to work on numpy these days, much to my regret). Don't forget to change the user manual as well... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion (Different function so different ticket.) Sure you can change the end of the code but that may hide various problem. Unlike loadtxt, genfromtxt has a lot of flexibility especially handling missing values and using converter functions. So I think that some examples must be provided that can not be handled by providing a suitable converter or that require multiple assumptions about input file (such as having more than one delimiter). I think stuff like multiple delimiters should have been dealt with before, as the right place to insert the ndmin code (which includes the decision to squeeze or not to squeeze as well as to add additional dimensions, if required) would be right at the end before the 'unpack' switch, or rather replacing the bit: if usemask: output = output.view(MaskedArray) output._mask = outputmask if unpack: return output.squeeze().T return output.squeeze() But there it's already not clear to me how to deal with the MaskedArray case... Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On 31 May 2011, at 17:45, Benjamin Root wrote: At this point, I wonder if it might be smarter to create a .atleast_Nd() function and use that everywhere it is needed. Having similar logic tailored for each loading function might be a little dangerous if bug fixes are made to one, but not the others. Like a generalised version of .atleast_1d / .atleast_2d? It would also have to include an .atmost_Nd functionality of some sorts, to replace the .squeeze(), generally a good idea (e.g. something like np.atleast_Nd(X, ndmin=0, ndmax=-1), where the default is not to reduce the maximum number of dimensions...). But for the io routines the situation is a bit more complex, since different shapes are expected to be returned depending on the text input (e.g. (1, N) for a single row vs. (N, 1) for a single column of data). Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On May 31, 2011, at 5:52 PM, Derek Homeier wrote: I think stuff like multiple delimiters should have been dealt with before, as the right place to insert the ndmin code (which includes the decision to squeeze or not to squeeze as well as to add additional dimensions, if required) would be right at the end before the 'unpack' switch, or rather replacing the bit: if usemask: output = output.view(MaskedArray) output._mask = outputmask if unpack: return output.squeeze().T return output.squeeze() But there it's already not clear to me how to deal with the MaskedArray case... Oh, easy. You need to replace only the last three lines of genfromtxt with the ones from loadtxt (808-833). Then, if usemask is True, you need to use ma.atleast_Xd instead of np.atleast_Xd. Et voilà. Comments: * I would raise an exception if ndmin isn't correct *before* trying to read the file... * You could define a `collapse_function` that would be `np.atleast_1d`, `np.atleast_2d`, `ma.atleast_1d`... depending on the values of `usemask` and `ndmin`... If you have any question about numpy.ma, don't hesitate to contact me directly. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On 31 May 2011, at 18:25, Pierre GM wrote: On May 31, 2011, at 5:52 PM, Derek Homeier wrote: I think stuff like multiple delimiters should have been dealt with before, as the right place to insert the ndmin code (which includes the decision to squeeze or not to squeeze as well as to add additional dimensions, if required) would be right at the end before the 'unpack' switch, or rather replacing the bit: if usemask: output = output.view(MaskedArray) output._mask = outputmask if unpack: return output.squeeze().T return output.squeeze() But there it's already not clear to me how to deal with the MaskedArray case... Oh, easy. You need to replace only the last three lines of genfromtxt with the ones from loadtxt (808-833). Then, if usemask is True, you need to use ma.atleast_Xd instead of np.atleast_Xd. Et voilà. Comments: * I would raise an exception if ndmin isn't correct *before* trying to read the file... * You could define a `collapse_function` that would be `np.atleast_1d`, `np.atleast_2d`, `ma.atleast_1d`... depending on the values of `usemask` and `ndmin`... If you have any question about numpy.ma, don't hesitate to contact me directly. Thanks for the directions! I was not sure about the usemask case because it presently does not invoke .squeeze() either... On a possibly related note, genfromtxt also treats the 'unpack'ing of structured arrays differently from loadtxt (which returns a list of arrays in that case) - do you know if this is on purpose, or also rather missing functionality (I guess it might break recfromtxt()...)? Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On May 31, 2011, at 6:37 PM, Derek Homeier wrote: On 31 May 2011, at 18:25, Pierre GM wrote: On May 31, 2011, at 5:52 PM, Derek Homeier wrote: I think stuff like multiple delimiters should have been dealt with before, as the right place to insert the ndmin code (which includes the decision to squeeze or not to squeeze as well as to add additional dimensions, if required) would be right at the end before the 'unpack' switch, or rather replacing the bit: if usemask: output = output.view(MaskedArray) output._mask = outputmask if unpack: return output.squeeze().T return output.squeeze() But there it's already not clear to me how to deal with the MaskedArray case... Oh, easy. You need to replace only the last three lines of genfromtxt with the ones from loadtxt (808-833). Then, if usemask is True, you need to use ma.atleast_Xd instead of np.atleast_Xd. Et voilà. Comments: * I would raise an exception if ndmin isn't correct *before* trying to read the file... * You could define a `collapse_function` that would be `np.atleast_1d`, `np.atleast_2d`, `ma.atleast_1d`... depending on the values of `usemask` and `ndmin`... If you have any question about numpy.ma, don't hesitate to contact me directly. Thanks for the directions! I was not sure about the usemask case because it presently does not invoke .squeeze() either... The idea is that if `usemask` is True, you build a second array (the mask), that you attach to your main array at the very end (in the `output=output.view(MaskedArray), output._mask = mask` combo...). Afterwards, it's a regular MaskedArray that supports the .squeeze() method... On a possibly related note, genfromtxt also treats the 'unpack'ing of structured arrays differently from loadtxt (which returns a list of arrays in that case) - do you know if this is on purpose, or also rather missing functionality (I guess it might break recfromtxt()...)? Keep in mind that I haven't touched genfromtxt since 8-10 months or so. I wouldn't be surprised that it were lagging a bit behind loadtxt in terms of development. Yes, there'll be some tweaking to do for recfromtxt (it's OK for now if `ndmin` and `unpack` are the defaults) and others, but nothing major. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On Fri, May 6, 2011 at 12:57 PM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 6 May 2011, at 07:53, Ralf Gommers wrote: Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3. If in the meantime a few more people could test this, that would be helpful. Ralf I agree, wish I had time to push this before rc2. I could add the explanatory comments mentioned above and switch to use the atleast_[12]d() solution, test that and push it in a couple of minutes, or should I better leave it as is now for testing? Quick follow-up: I just applied the above changes, added some tests to cover Ben's test cases and tested this with 1.6.0rc2 on OS X 10.5 i386+ppc + 10.6 x86_64 (Python2.7+3.2). So I'd be ready to push it to my repo and do my (first) pull request... Go ahead, I'll have a look at it tonight. Thanks for testing on several Pythons, that definitely helps. Done, the request only appears on my repo https://github.com/dhomeier/numpy/ is that correct? If someone could test it on Linux and Windows as well... Committed, thanks for all the work. The pull request was in the wrong place, that's a minor flaw in the github UI. After you press Pull Request you need to read the small print to see where it's going. Cheers, Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On 6 May 2011, at 07:53, Ralf Gommers wrote: Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3. If in the meantime a few more people could test this, that would be helpful. Ralf I agree, wish I had time to push this before rc2. I could add the explanatory comments mentioned above and switch to use the atleast_[12]d() solution, test that and push it in a couple of minutes, or should I better leave it as is now for testing? Quick follow-up: I just applied the above changes, added some tests to cover Ben's test cases and tested this with 1.6.0rc2 on OS X 10.5 i386+ppc + 10.6 x86_64 (Python2.7+3.2). So I'd be ready to push it to my repo and do my (first) pull request... Go ahead, I'll have a look at it tonight. Thanks for testing on several Pythons, that definitely helps. Done, the request only appears on my repo https://github.com/dhomeier/numpy/ is that correct? If someone could test it on Linux and Windows as well... Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 4. mai 2011, at 20.33, Benjamin Root wrote: On Wed, May 4, 2011 at 7:54 PM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote: But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc... Cheers, Derek I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap). What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ... Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array. Paul @Derek, good catch with noticing the error in the tests. We do still need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed. @Paul, we can't call squeeze after doing the atleast_Nd() magic. That would just undo whatever we had just done. Also, wrt the transpose, a (1, 10) array looks the same in memory as a (10, 1) array, right? Ben Root loadtest.py Description: Binary data ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On Thu, May 5, 2011 at 10:49 AM, Benjamin Root ben.r...@ou.edu wrote: On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 4. mai 2011, at 20.33, Benjamin Root wrote: On Wed, May 4, 2011 at 7:54 PM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote: But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc... Cheers, Derek I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap). What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ... Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array. Paul @Derek, good catch with noticing the error in the tests. We do still need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed. Apologies Derek, your patch does address the issue I raised. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On 5. mai 2011, at 08.49, Benjamin Root wrote: On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 4. mai 2011, at 20.33, Benjamin Root wrote: On Wed, May 4, 2011 at 7:54 PM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote: But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc... Cheers, Derek I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap). What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ... Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array. Paul @Derek, good catch with noticing the error in the tests. We do still need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed. @Paul, we can't call squeeze after doing the atleast_Nd() magic. That would just undo whatever we had just done. Also, wrt the transpose, a (1, 10) array looks the same in memory as a (10, 1) array, right? Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError? That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is! Paul ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 5. mai 2011, at 08.49, Benjamin Root wrote: On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 4. mai 2011, at 20.33, Benjamin Root wrote: On Wed, May 4, 2011 at 7:54 PM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote: But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc... Cheers, Derek I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap). What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ... Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array. Paul @Derek, good catch with noticing the error in the tests. We do still need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed. @Paul, we can't call squeeze after doing the atleast_Nd() magic. That would just undo whatever we had just done. Also, wrt the transpose, a (1, 10) array looks the same in memory as a (10, 1) array, right? Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError? That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is! Paul I have thought of that too, but the problem with that approach is that after reading the file, X will have 2 or 3 dimensions, regardless of how many singleton dims were in the file. A squeeze will always be needed. Also, the purpose of squeeze is opposite that of the atleast_*d() functions: squeeze reduces dimensions, while atleast_*d will add dimensions. Therefore, I re-iterate... the patch by Derek gets the job done. I have tested it for a wide variety of inputs for both regular arrays and record arrays. Is there room for improvements? Yes, but I think that can wait for later. Derek's patch however fixes an important bug in the ndmin implementation and should be included for the release. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On Thu, May 5, 2011 at 9:18 PM, Benjamin Root ben.r...@ou.edu wrote: On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 5. mai 2011, at 08.49, Benjamin Root wrote: On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 4. mai 2011, at 20.33, Benjamin Root wrote: On Wed, May 4, 2011 at 7:54 PM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote: But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc... Cheers, Derek I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap). What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ... Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array. Paul @Derek, good catch with noticing the error in the tests. We do still need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed. @Paul, we can't call squeeze after doing the atleast_Nd() magic. That would just undo whatever we had just done. Also, wrt the transpose, a (1, 10) array looks the same in memory as a (10, 1) array, right? Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError? That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is! Paul I have thought of that too, but the problem with that approach is that after reading the file, X will have 2 or 3 dimensions, regardless of how many singleton dims were in the file. A squeeze will always be needed. Also, the purpose of squeeze is opposite that of the atleast_*d() functions: squeeze reduces dimensions, while atleast_*d will add dimensions. Therefore, I re-iterate... the patch by Derek gets the job done. I have tested it for a wide variety of inputs for both regular arrays and record arrays. Is there room for improvements? Yes, but I think that can wait for later. Derek's patch however fixes an important bug in the ndmin implementation and should be included for the release. Two questions: can you point me to the patch/ticket, and is this a regression? Thanks, Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On Thu, May 5, 2011 at 2:33 PM, Ralf Gommers ralf.gomm...@googlemail.comwrote: On Thu, May 5, 2011 at 9:18 PM, Benjamin Root ben.r...@ou.edu wrote: On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 5. mai 2011, at 08.49, Benjamin Root wrote: On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 4. mai 2011, at 20.33, Benjamin Root wrote: On Wed, May 4, 2011 at 7:54 PM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote: But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc... Cheers, Derek I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap). What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ... Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array. Paul @Derek, good catch with noticing the error in the tests. We do still need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed. @Paul, we can't call squeeze after doing the atleast_Nd() magic. That would just undo whatever we had just done. Also, wrt the transpose, a (1, 10) array looks the same in memory as a (10, 1) array, right? Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError? That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is! Paul I have thought of that too, but the problem with that approach is that after reading the file, X will have 2 or 3 dimensions, regardless of how many singleton dims were in the file. A squeeze will always be needed. Also, the purpose of squeeze is opposite that of the atleast_*d() functions: squeeze reduces dimensions, while atleast_*d will add dimensions. Therefore, I re-iterate... the patch by Derek gets the job done. I have tested it for a wide variety of inputs for both regular arrays and record arrays. Is there room for improvements? Yes, but I think that can wait for later. Derek's patch however fixes an important bug in the ndmin implementation and should be included for the release. Two questions: can you point me to the patch/ticket, and is this a regression? Thanks, Ralf I don't know if he did a pull-request or not, but here is the link he provided earlier in the thread. https://github.com/dhomeier/numpy/compare/master...ndmin-cols Technically, this is not a regression as the ndmin feature is new in this release. However, the problem that ndmin is supposed to address is not fixed by the current implementation for the rc. Essentially, a single-row, multi-column file with ndmin=2 comes out as a Nx1 array which is the same result for a multi-row, single-column file. My feeling is that if we let the current implementation stand as is, and developers use it in their code, then fixing it in a later
Re: [Numpy-discussion] loadtxt ndmin option
On Thu, May 5, 2011 at 9:46 PM, Benjamin Root ben.r...@ou.edu wrote: On Thu, May 5, 2011 at 2:33 PM, Ralf Gommers ralf.gomm...@googlemail.comwrote: On Thu, May 5, 2011 at 9:18 PM, Benjamin Root ben.r...@ou.edu wrote: On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 5. mai 2011, at 08.49, Benjamin Root wrote: On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 4. mai 2011, at 20.33, Benjamin Root wrote: On Wed, May 4, 2011 at 7:54 PM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote: But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc... Cheers, Derek I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap). What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ... Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array. Paul @Derek, good catch with noticing the error in the tests. We do still need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed. @Paul, we can't call squeeze after doing the atleast_Nd() magic. That would just undo whatever we had just done. Also, wrt the transpose, a (1, 10) array looks the same in memory as a (10, 1) array, right? Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError? That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is! Paul I have thought of that too, but the problem with that approach is that after reading the file, X will have 2 or 3 dimensions, regardless of how many singleton dims were in the file. A squeeze will always be needed. Also, the purpose of squeeze is opposite that of the atleast_*d() functions: squeeze reduces dimensions, while atleast_*d will add dimensions. Therefore, I re-iterate... the patch by Derek gets the job done. I have tested it for a wide variety of inputs for both regular arrays and record arrays. Is there room for improvements? Yes, but I think that can wait for later. Derek's patch however fixes an important bug in the ndmin implementation and should be included for the release. Two questions: can you point me to the patch/ticket, and is this a regression? Thanks, Ralf I don't know if he did a pull-request or not, but here is the link he provided earlier in the thread. https://github.com/dhomeier/numpy/compare/master...ndmin-cols Technically, this is not a regression as the ndmin feature is new in this release. Yes right, I forgot this was a recent change. However, the problem that ndmin is supposed to address is not fixed by the current implementation for the rc. Essentially, a single-row, multi-column file with ndmin=2 comes out as a Nx1 array which is the same result for a multi-row, single-column
Re: [Numpy-discussion] loadtxt ndmin option
On 5 May 2011, at 22:53, Derek Homeier wrote: However, the problem that ndmin is supposed to address is not fixed by the current implementation for the rc. Essentially, a single- row, multi-column file with ndmin=2 comes out as a Nx1 array which is the same result for a multi-row, single-column file. My feeling is that if we let the current implementation stand as is, and developers use it in their code, then fixing it in a later release would introduce more problems (maybe the devels would transpose the result themselves or something). Better to fix it now in rc with the two lines of code (and the correction to the tests), then to introduce a buggy feature that will be hard to fix in future releases, IMHO. Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3. If in the meantime a few more people could test this, that would be helpful. Ralf I agree, wish I had time to push this before rc2. I could add the explanatory comments mentioned above and switch to use the atleast_[12]d() solution, test that and push it in a couple of minutes, or should I better leave it as is now for testing? Quick follow-up: I just applied the above changes, added some tests to cover Ben's test cases and tested this with 1.6.0rc2 on OS X 10.5 i386+ppc + 10.6 x86_64 (Python2.7+3.2). So I'd be ready to push it to my repo and do my (first) pull request... Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On Fri, May 6, 2011 at 12:12 AM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 5 May 2011, at 22:53, Derek Homeier wrote: However, the problem that ndmin is supposed to address is not fixed by the current implementation for the rc. Essentially, a single- row, multi-column file with ndmin=2 comes out as a Nx1 array which is the same result for a multi-row, single-column file. My feeling is that if we let the current implementation stand as is, and developers use it in their code, then fixing it in a later release would introduce more problems (maybe the devels would transpose the result themselves or something). Better to fix it now in rc with the two lines of code (and the correction to the tests), then to introduce a buggy feature that will be hard to fix in future releases, IMHO. Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3. If in the meantime a few more people could test this, that would be helpful. Ralf I agree, wish I had time to push this before rc2. I could add the explanatory comments mentioned above and switch to use the atleast_[12]d() solution, test that and push it in a couple of minutes, or should I better leave it as is now for testing? Quick follow-up: I just applied the above changes, added some tests to cover Ben's test cases and tested this with 1.6.0rc2 on OS X 10.5 i386+ppc + 10.6 x86_64 (Python2.7+3.2). So I'd be ready to push it to my repo and do my (first) pull request... Go ahead, I'll have a look at it tonight. Thanks for testing on several Pythons, that definitely helps. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
Hi Paul, I've got back to your suggestion re. the ndmin flag for loadtxt from a few weeks ago... On 27.03.2011, at 12:09PM, Paul Anton Letnes wrote: 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however. See comments on Trac as well. Your patch is better, but there is one thing I disagree with. 808if X.ndim ndmin: 809if ndmin == 1: 810X.shape = (X.size, ) 811elif ndmin == 2: 812X.shape = (X.size, 1) The last line should be: 812X.shape = (1, X.size) If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO. I think you are completely right for the test case with one row. More generally though, since a file of N rows and M columns is read into an array of shape (N, M), ndmin=2 should enforce X.shape = (1, X.size) for single-row input, and X.shape = (X.size, 1) for single-column input. I thought this would be handled automatically by preserving the original 2 dimensions, but apparently with single-row/multi-column input an extra dimension 1 is prepended when the array is returned from the parser. I've put up a fix for this at https://github.com/dhomeier/numpy/compare/master...ndmin-cols and also tested the patch against 1.6.0.rc2. Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On 4. mai 2011, at 17.34, Derek Homeier wrote: Hi Paul, I've got back to your suggestion re. the ndmin flag for loadtxt from a few weeks ago... On 27.03.2011, at 12:09PM, Paul Anton Letnes wrote: 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however. See comments on Trac as well. Your patch is better, but there is one thing I disagree with. 808if X.ndim ndmin: 809if ndmin == 1: 810X.shape = (X.size, ) 811elif ndmin == 2: 812X.shape = (X.size, 1) The last line should be: 812X.shape = (1, X.size) If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO. I think you are completely right for the test case with one row. More generally though, since a file of N rows and M columns is read into an array of shape (N, M), ndmin=2 should enforce X.shape = (1, X.size) for single-row input, and X.shape = (X.size, 1) for single-column input. I thought this would be handled automatically by preserving the original 2 dimensions, but apparently with single-row/multi-column input an extra dimension 1 is prepended when the array is returned from the parser. I've put up a fix for this at https://github.com/dhomeier/numpy/compare/master...ndmin-cols and also tested the patch against 1.6.0.rc2. Cheers, Derek Looks sensible to me at least! But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Paul ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote: But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc... Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote: But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc... Cheers, Derek I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap). Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt ndmin option
On 4. mai 2011, at 20.33, Benjamin Root wrote: On Wed, May 4, 2011 at 7:54 PM, Derek Homeier de...@astro.physik.uni-goettingen.de wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote: But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc... Cheers, Derek I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap). What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ... Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array. Paul ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On 03/31/2011 12:02 PM, Derek Homeier wrote: On 31 Mar 2011, at 17:03, Bruce Southey wrote: This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here: Each row in the text file must have the same number of values. genfromtxt : Load data with missing values handled as specified. This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values. Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail. OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement Converters can also be used to provide a default value for missing data: then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way). The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed? course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket. Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay I see that 1071 got closed which I am fine with. I think that your following example should be a test because the two spaces should not be removed with a tab delimiter: np.loadtxt(StringIO(aa\tbb\n \t \ncc\t), delimiter='\t', dtype=np.dtype([('label', 'S4'), ('comment', 'S4')])) Thanks very much for fixing this! Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey bsout...@gmail.com wrote: On 03/31/2011 12:02 PM, Derek Homeier wrote: On 31 Mar 2011, at 17:03, Bruce Southey wrote: This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here: Each row in the text file must have the same number of values. genfromtxt : Load data with missing values handled as specified. This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values. Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail. OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement Converters can also be used to provide a default value for missing data: then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way). The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed? course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket. Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay I see that 1071 got closed which I am fine with. I think that your following example should be a test because the two spaces should not be removed with a tab delimiter: np.loadtxt(StringIO(aa\tbb\n \t \ncc\t), delimiter='\t', dtype=np.dtype([('label', 'S4'), ('comment', 'S4')])) Make a test and we'll put it in. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On 04/04/2011 11:20 AM, Charles R Harris wrote: On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey bsout...@gmail.com mailto:bsout...@gmail.com wrote: On 03/31/2011 12:02 PM, Derek Homeier wrote: On 31 Mar 2011, at 17:03, Bruce Southey wrote: This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here: Each row in the text file must have the same number of values. genfromtxt : Load data with missing values handled as specified. This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values. Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail. OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement Converters can also be used to provide a default value for missing data: then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way). The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed? course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket. Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay I see that 1071 got closed which I am fine with. I think that your following example should be a test because the two spaces should not be removed with a tab delimiter: np.loadtxt(StringIO(aa\tbb\n \t \ncc\t), delimiter='\t', dtype=np.dtype([('label', 'S4'), ('comment', 'S4')])) Make a test and we'll put it in. Chuck I know! Trying to write one made me realize that loadtxt is not handling string arrays correctly. So I have to check more on this as I think loadtxt is giving a 1-d array instead of a 2-d array. I do agree with you Pierre but this is a nice corner case that Derek raised where a space does not necessarily mean a missing value when there is a tab delimiter: data = StringIO(aa\tbb\n \t \ncc\tdd) dt=np.dtype([('label', 'S2'), ('comment', 'S2')]) test = np.loadtxt(data, delimiter=\t, dtype=dt) control = np.array([['aa','bb'], [' ', ' '],['cc','dd']], dtype=dt) So 'test' and 'control' should give the same array. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On Mon, Apr 4, 2011 at 11:01 AM, Bruce Southey bsout...@gmail.com wrote: On 04/04/2011 11:20 AM, Charles R Harris wrote: On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey bsout...@gmail.com wrote: On 03/31/2011 12:02 PM, Derek Homeier wrote: On 31 Mar 2011, at 17:03, Bruce Southey wrote: This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here: Each row in the text file must have the same number of values. genfromtxt : Load data with missing values handled as specified. This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values. Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail. OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement Converters can also be used to provide a default value for missing data: then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way). The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed? course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket. Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay I see that 1071 got closed which I am fine with. I think that your following example should be a test because the two spaces should not be removed with a tab delimiter: np.loadtxt(StringIO(aa\tbb\n \t \ncc\t), delimiter='\t', dtype=np.dtype([('label', 'S4'), ('comment', 'S4')])) Make a test and we'll put it in. Chuck I know! Trying to write one made me realize that loadtxt is not handling string arrays correctly. So I have to check more on this as I think loadtxt is giving a 1-d array instead of a 2-d array. Tests often have that side effect. snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On Thu, Mar 31, 2011 at 4:53 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 26. mars 2011, at 21.44, Derek Homeier wrote: Hi Paul, having had a look at the other tickets you dug up, My opinions are my own, and in detail, they are: 1752: I attach a possible patch. FWIW, I agree with the request. The patch is written to be compatible with the fix in ticket #1562, but I did not test that yet. Tested, see also my comments on Trac. Great! 1731: This seems like a rather trivial feature enhancement. I attach a possible patch. Agreed. Haven't tested it though. Great! 1616: The suggested patch seems reasonable to me, but I do not have a full list of what objects loadtxt supports today as opposed to what this patch will support. Looks like you got this one. Just remember to make it compatible with #1752. Should be easy. 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however. See comments on Trac as well. Your patch is better, but there is one thing I disagree with. 808 if X.ndim ndmin: 809 if ndmin == 1: 810 X.shape = (X.size, ) 811 elif ndmin == 2: 812 X.shape = (X.size, 1) The last line should be: 812 X.shape = (1, X.size) If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO. 1458: The fix suggested in the ticket seems reasonable, but I have never used record arrays, so I am not sure of this. There were some issues with Python3, and I also had some general reservations as noted on Trac - basically, it makes 'unpack' equivalent to transposing for 2D-arrays, but to splitting into fields for 1D-recarrays. My question was, what's going to happen when you get to 2D-recarrays? Currently this is not an issue since loadtxt can only read 2D regular or 1D structured arrays. But this might change if the data block functionality (see below) were to be implemented - data could then be returned as 3D arrays or 2D structured arrays... Still, it would probably make most sense (or at least give the widest functionality) to have 'unpack=True' always return a list or iterator over columns. OK, I don't know recarrays, as I said. 1445: Adding this functionality could break old code, as some old datafiles may have empty lines which are now simply ignored. I do not think the feature is a good idea. It could rather be implemented as a separate function. 1107: I do not see the need for this enhancement. In my eyes, the usecols kwarg does this and more. Perhaps I am misunderstanding something here. Agree about #1445, and the bit about 'usecols' - 'numcols' would just provide a shorter call to e.g. read the first 20 columns of a file (well, not even that much over 'usecols=range(20)'...), don't think that justifies an extra argument. But the 'datablocks' provides something new, that a number of people seem to miss from e.g. gnuplot (including me, actually ;-). And it would also satisfy the request from #1445 without breaking backwards compatibility. I've been wondering if could instead specify the separator lines through the parameter, e.g. blocksep=['None', 'blank','invalid'], not sure if that would make it more useful... What about writing a separate function, e.g. loadblocktxt, and have it separate the chunks and call loadtxt for each chunk? Just a thought. Another possibility would be to write a function that would let you load a set of text files in a directory, and return a dict of datasets, one per file. One could write a similar save-function, too. They would just need to call loadtxt/savetxt on a per-file basis. 1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket. In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or ''. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with delimiter='\t' - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at
Re: [Numpy-discussion] loadtxt/savetxt tickets
On Wed, Mar 30, 2011 at 9:53 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 26. mars 2011, at 21.44, Derek Homeier wrote: Hi Paul, having had a look at the other tickets you dug up, [snip] 1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket. In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or ''. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with delimiter='\t' - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well. Great! This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here: Each row in the text file must have the same number of values. genfromtxt : Load data with missing values handled as specified. This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values. Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail. The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On Thu, Mar 31, 2011 at 5:03 PM, Bruce Southey bsout...@gmail.com wrote: On Wed, Mar 30, 2011 at 9:53 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 26. mars 2011, at 21.44, Derek Homeier wrote: Hi Paul, having had a look at the other tickets you dug up, [snip] 1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket. In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or ''. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with delimiter='\t' - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well. Great! This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here: Each row in the text file must have the same number of values. genfromtxt : Load data with missing values handled as specified. This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values. Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail. I agree with you Bruce, but it would be easier to discuss this on the tickets instead of here. Could you add your comments there please? Ralf The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On 03/31/2011 10:08 AM, Ralf Gommers wrote: On Thu, Mar 31, 2011 at 5:03 PM, Bruce Southeybsout...@gmail.com wrote: On Wed, Mar 30, 2011 at 9:53 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 26. mars 2011, at 21.44, Derek Homeier wrote: Hi Paul, having had a look at the other tickets you dug up, [snip] 1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket. In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or ''. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with delimiter='\t' - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well. Great! This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here: Each row in the text file must have the same number of values. genfromtxt : Load data with missing values handled as specified. This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values. Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail. I agree with you Bruce, but it would be easier to discuss this on the tickets instead of here. Could you add your comments there please? Ralf 'Easier' seems a contradiction when you have use captcha... Sure I will add more comments there. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On 31 Mar 2011, at 17:03, Bruce Southey wrote: This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here: Each row in the text file must have the same number of values. genfromtxt : Load data with missing values handled as specified. This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values. Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail. OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement Converters can also be used to provide a default value for missing data: then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way). The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed? course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket. Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On 03/31/2011 12:02 PM, Derek Homeier wrote: On 31 Mar 2011, at 17:03, Bruce Southey wrote: This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here: Each row in the text file must have the same number of values. genfromtxt : Load data with missing values handled as specified. This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values. Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail. OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement Converters can also be used to provide a default value for missing data: then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way). The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed? course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket. Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I am really not disagreeing that much with you. Rather that, as you have shown, it is very easy to increase the complexity of examples that loadtxt does not handle. By missing value I mean when no data value is stored for the variable in the current observation (via Wikipedia) since encoded missing values (such as '.', 'NA' and 'NaN') can be recovered. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On Thu, Mar 31, 2011 at 8:42 AM, Ralf Gommers ralf.gomm...@googlemail.comwrote: On Thu, Mar 31, 2011 at 4:53 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: snip If you look in Trac under All Tickets by Milestone you'll find all nine tickets together under 1.6.0. Five are bug fixes, four are enhancements. There are some missing tests, but all tickets have proposed patches. OK. I changed 1562 to enhancement because it adds a keyword. With that change the current status looks like this. Bug Fixes: 1163 -- convert int64 correctly 1458 -- make loadtxt unpack structured arrays 1071 -- loadtxt fails if the last column contains empty value, under discussion 1565 -- duplicate of 1163 Enhancements: 1107 -- support for blocks of data, adds two keywords. 1562 -- add ndmin keyword to aid in getting correct dimensions, doesn't apply on top of previous. 1616 -- remove use of readline so input isn't restricted to files. 1731 -- allow loadtxt to read given number of rows, adds keyword. 1752 -- return empty array when empty file encountered, conflicts with 1616. Some of this might should go into genfromtxt. None of the patches have tests. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: On 26. mars 2011, at 21.44, Derek Homeier wrote: Hi Paul, having had a look at the other tickets you dug up, My opinions are my own, and in detail, they are: 1752: I attach a possible patch. FWIW, I agree with the request. The patch is written to be compatible with the fix in ticket #1562, but I did not test that yet. Tested, see also my comments on Trac. Great! 1731: This seems like a rather trivial feature enhancement. I attach a possible patch. Agreed. Haven't tested it though. Great! 1616: The suggested patch seems reasonable to me, but I do not have a full list of what objects loadtxt supports today as opposed to what this patch will support. Looks like you got this one. Just remember to make it compatible with #1752. Should be easy. 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however. See comments on Trac as well. Your patch is better, but there is one thing I disagree with. 808if X.ndim ndmin: 809if ndmin == 1: 810X.shape = (X.size, ) 811elif ndmin == 2: 812X.shape = (X.size, 1) The last line should be: 812X.shape = (1, X.size) If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO. 1458: The fix suggested in the ticket seems reasonable, but I have never used record arrays, so I am not sure of this. There were some issues with Python3, and I also had some general reservations as noted on Trac - basically, it makes 'unpack' equivalent to transposing for 2D-arrays, but to splitting into fields for 1D-recarrays. My question was, what's going to happen when you get to 2D-recarrays? Currently this is not an issue since loadtxt can only read 2D regular or 1D structured arrays. But this might change if the data block functionality (see below) were to be implemented - data could then be returned as 3D arrays or 2D structured arrays... Still, it would probably make most sense (or at least give the widest functionality) to have 'unpack=True' always return a list or iterator over columns. OK, I don't know recarrays, as I said. 1445: Adding this functionality could break old code, as some old datafiles may have empty lines which are now simply ignored. I do not think the feature is a good idea. It could rather be implemented as a separate function. 1107: I do not see the need for this enhancement. In my eyes, the usecols kwarg does this and more. Perhaps I am misunderstanding something here. Agree about #1445, and the bit about 'usecols' - 'numcols' would just provide a shorter call to e.g. read the first 20 columns of a file (well, not even that much over 'usecols=range(20)'...), don't think that justifies an extra argument. But the 'datablocks' provides something new, that a number of people seem to miss from e.g. gnuplot (including me, actually ;-). And it would also satisfy the request from #1445 without breaking backwards compatibility. I've been wondering if could instead specify the separator lines through the parameter, e.g. blocksep=['None', 'blank','invalid'], not sure if that would make it more useful... What about writing a separate function, e.g. loadblocktxt, and have it separate the chunks and call loadtxt for each chunk? Just a thought. Another possibility would be to write a function that would let you load a set of text files in a directory, and return a dict of datasets, one per file. One could write a similar save-function, too. They would just need to call loadtxt/savetxt on a per-file basis. 1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket. In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or ''. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with delimiter='\t' - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3
Re: [Numpy-discussion] loadtxt/savetxt tickets
On Sun, Mar 27, 2011 at 12:09 PM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: I am sure someone has been using this functionality to convert floats to ints. Changing will break their code. I am not sure how big a deal that would be. Also, I am of the opinion that one should _first_ write a program that works _correctly_, and only afterwards worry about performance. While I'd agree in most cases, keep in mind that np.loadtxt is supposed to be a fast but simpler alternative to np.genfromtxt. If np.loadtxt becomes much slower, there's not much need to keep these separate any longer. Regards Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On 26. mars 2011, at 21.44, Derek Homeier wrote: Hi Paul, having had a look at the other tickets you dug up, My opinions are my own, and in detail, they are: 1752: I attach a possible patch. FWIW, I agree with the request. The patch is written to be compatible with the fix in ticket #1562, but I did not test that yet. Tested, see also my comments on Trac. Great! 1731: This seems like a rather trivial feature enhancement. I attach a possible patch. Agreed. Haven't tested it though. Great! 1616: The suggested patch seems reasonable to me, but I do not have a full list of what objects loadtxt supports today as opposed to what this patch will support. Looks like you got this one. Just remember to make it compatible with #1752. Should be easy. 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however. See comments on Trac as well. Your patch is better, but there is one thing I disagree with. 808if X.ndim ndmin: 809if ndmin == 1: 810X.shape = (X.size, ) 811elif ndmin == 2: 812X.shape = (X.size, 1) The last line should be: 812X.shape = (1, X.size) If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO. 1458: The fix suggested in the ticket seems reasonable, but I have never used record arrays, so I am not sure of this. There were some issues with Python3, and I also had some general reservations as noted on Trac - basically, it makes 'unpack' equivalent to transposing for 2D-arrays, but to splitting into fields for 1D-recarrays. My question was, what's going to happen when you get to 2D-recarrays? Currently this is not an issue since loadtxt can only read 2D regular or 1D structured arrays. But this might change if the data block functionality (see below) were to be implemented - data could then be returned as 3D arrays or 2D structured arrays... Still, it would probably make most sense (or at least give the widest functionality) to have 'unpack=True' always return a list or iterator over columns. OK, I don't know recarrays, as I said. 1445: Adding this functionality could break old code, as some old datafiles may have empty lines which are now simply ignored. I do not think the feature is a good idea. It could rather be implemented as a separate function. 1107: I do not see the need for this enhancement. In my eyes, the usecols kwarg does this and more. Perhaps I am misunderstanding something here. Agree about #1445, and the bit about 'usecols' - 'numcols' would just provide a shorter call to e.g. read the first 20 columns of a file (well, not even that much over 'usecols=range(20)'...), don't think that justifies an extra argument. But the 'datablocks' provides something new, that a number of people seem to miss from e.g. gnuplot (including me, actually ;-). And it would also satisfy the request from #1445 without breaking backwards compatibility. I've been wondering if could instead specify the separator lines through the parameter, e.g. blocksep=['None', 'blank','invalid'], not sure if that would make it more useful... What about writing a separate function, e.g. loadblocktxt, and have it separate the chunks and call loadtxt for each chunk? Just a thought. Another possibility would be to write a function that would let you load a set of text files in a directory, and return a dict of datasets, one per file. One could write a similar save-function, too. They would just need to call loadtxt/savetxt on a per-file basis. 1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket. In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or ''. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with delimiter='\t' - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well. Great! 1163: 1565: These tickets seem to have the same origin of the problem. I attach one
Re: [Numpy-discussion] loadtxt/savetxt tickets
Hi! I have had a look at the list of numpy.loadtxt tickets. I have never contributed to numpy before, so I may be doing stupid things - don't be afraid to let me know! My opinions are my own, and in detail, they are: 1752: I attach a possible patch. FWIW, I agree with the request. The patch is written to be compatible with the fix in ticket #1562, but I did not test that yet. 1731: This seems like a rather trivial feature enhancement. I attach a possible patch. 1616: The suggested patch seems reasonable to me, but I do not have a full list of what objects loadtxt supports today as opposed to what this patch will support. 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however. 1458: The fix suggested in the ticket seems reasonable, but I have never used record arrays, so I am not sure of this. 1445: Adding this functionality could break old code, as some old datafiles may have empty lines which are now simply ignored. I do not think the feature is a good idea. It could rather be implemented as a separate function. 1107: I do not see the need for this enhancement. In my eyes, the usecols kwarg does this and more. Perhaps I am misunderstanding something here. 1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket. 1163: 1565: These tickets seem to have the same origin of the problem. I attach one possible patch. The previously suggested patches that I've seen will not correctly convert floats to ints, which I believe my patch will. I hope you find this useful! Is there some way of submitting the patches for review in a more convenient fashion than e-mail? Cheers, Paul. 1562.patch Description: Binary data 1163.patch Description: Binary data 1731.patch Description: Binary data 1752.patch Description: Binary data On 25. mars 2011, at 16.06, Charles R Harris wrote: Hi All, Could someone with an interest in loadtxt/savetxt look through the associated tickets? A search on the tickets using either of those keys will return fairly lengthy lists. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
Hi, Thanks! On Sat, 26 Mar 2011 13:11:46 +0100, Paul Anton Letnes wrote: [clip] I hope you find this useful! Is there some way of submitting the patches for review in a more convenient fashion than e-mail? You can attach them on the trac to each ticket. That way they'll be easy to find later on. Pauli ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
Hi, On 26 Mar 2011, at 14:36, Pauli Virtanen wrote: On Sat, 26 Mar 2011 13:11:46 +0100, Paul Anton Letnes wrote: [clip] I hope you find this useful! Is there some way of submitting the patches for review in a more convenient fashion than e-mail? You can attach them on the trac to each ticket. That way they'll be easy to find later on. I've got some comments on 1562, and I'd attach a revised patch then - just a general question: should I then change Milestone to 1.6.0 and Version to 'devel'? 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, Seems the fastest solution unless someone wants to change numpy.squeeze as well. But the present patch does not call np.squeeze any more at all, so I propose to restore that behaviour for X.ndim ndmin to remain really backwards compatible. It also seems easier to code when making the default ndmin=0. Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
Hi again, On 26 Mar 2011, at 15:20, Derek Homeier wrote: 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, Seems the fastest solution unless someone wants to change numpy.squeeze as well. But the present patch does not call np.squeeze any more at all, so I propose to restore that behaviour for X.ndim ndmin to remain really backwards compatible. It also seems easier to code when making the default ndmin=0. I've got another somewhat general question: since it would probably be nice to have a test for this, I found one could simply add something along the lines of assert_equal(a.shape, x.shape) to test_io.py - test_shaped_dtype(self) or should one generally create a new test for such things (might still be better in this case, since test_shaped_dtype does not really test different ndim)? Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
Hi Derek! On 26. mars 2011, at 15.48, Derek Homeier wrote: Hi again, On 26 Mar 2011, at 15:20, Derek Homeier wrote: 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, Seems the fastest solution unless someone wants to change numpy.squeeze as well. But the present patch does not call np.squeeze any more at all, so I propose to restore that behaviour for X.ndim ndmin to remain really backwards compatible. It also seems easier to code when making the default ndmin=0. I've got another somewhat general question: since it would probably be nice to have a test for this, I found one could simply add something along the lines of assert_equal(a.shape, x.shape) to test_io.py - test_shaped_dtype(self) or should one generally create a new test for such things (might still be better in this case, since test_shaped_dtype does not really test different ndim)? Cheers, Derek It would be nice to see your patch. I uploaded all of mine as mentioned. I'm no testing expert, but I am sure someone else will comment on it. Paul. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
On Sat, Mar 26, 2011 at 8:53 AM, Paul Anton Letnes paul.anton.let...@gmail.com wrote: Hi Derek! On 26. mars 2011, at 15.48, Derek Homeier wrote: Hi again, On 26 Mar 2011, at 15:20, Derek Homeier wrote: 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, Seems the fastest solution unless someone wants to change numpy.squeeze as well. But the present patch does not call np.squeeze any more at all, so I propose to restore that behaviour for X.ndim ndmin to remain really backwards compatible. It also seems easier to code when making the default ndmin=0. I've got another somewhat general question: since it would probably be nice to have a test for this, I found one could simply add something along the lines of assert_equal(a.shape, x.shape) to test_io.py - test_shaped_dtype(self) or should one generally create a new test for such things (might still be better in this case, since test_shaped_dtype does not really test different ndim)? Cheers, Derek It would be nice to see your patch. I uploaded all of mine as mentioned. I'm no testing expert, but I am sure someone else will comment on it. I put all these patches together at https://github.com/charris/numpy/tree/loadtxt-savetxt. Please pull from there to continue work on loadtxt/savetxt so as to avoid conflicts in the patches. One of the numpy tests is failing, I assume from patch conflicts, and more tests for the tickets are needed in any case. Also, new keywords should be added to the end, not put in the middle of existing keywords. I haven't reviewed the patches, just tried to get them organized. Also, I have Derek as the author on all of them, that can be changed if it is decided the credit should go elsewhere ;) Thanks for the work you all have been doing on these tickets. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt/savetxt tickets
Hi Paul, having had a look at the other tickets you dug up, My opinions are my own, and in detail, they are: 1752: I attach a possible patch. FWIW, I agree with the request. The patch is written to be compatible with the fix in ticket #1562, but I did not test that yet. Tested, see also my comments on Trac. 1731: This seems like a rather trivial feature enhancement. I attach a possible patch. Agreed. Haven't tested it though. 1616: The suggested patch seems reasonable to me, but I do not have a full list of what objects loadtxt supports today as opposed to what this patch will support. 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however. See comments on Trac as well. 1458: The fix suggested in the ticket seems reasonable, but I have never used record arrays, so I am not sure of this. There were some issues with Python3, and I also had some general reservations as noted on Trac - basically, it makes 'unpack' equivalent to transposing for 2D-arrays, but to splitting into fields for 1D-recarrays. My question was, what's going to happen when you get to 2D-recarrays? Currently this is not an issue since loadtxt can only read 2D regular or 1D structured arrays. But this might change if the data block functionality (see below) were to be implemented - data could then be returned as 3D arrays or 2D structured arrays... Still, it would probably make most sense (or at least give the widest functionality) to have 'unpack=True' always return a list or iterator over columns. 1445: Adding this functionality could break old code, as some old datafiles may have empty lines which are now simply ignored. I do not think the feature is a good idea. It could rather be implemented as a separate function. 1107: I do not see the need for this enhancement. In my eyes, the usecols kwarg does this and more. Perhaps I am misunderstanding something here. Agree about #1445, and the bit about 'usecols' - 'numcols' would just provide a shorter call to e.g. read the first 20 columns of a file (well, not even that much over 'usecols=range(20)'...), don't think that justifies an extra argument. But the 'datablocks' provides something new, that a number of people seem to miss from e.g. gnuplot (including me, actually ;-). And it would also satisfy the request from #1445 without breaking backwards compatibility. I've been wondering if could instead specify the separator lines through the parameter, e.g. blocksep=['None', 'blank','invalid'], not sure if that would make it more useful... 1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket. In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or ''. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with delimiter='\t' - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well. 1163: 1565: These tickets seem to have the same origin of the problem. I attach one possible patch. The previously suggested patches that I've seen will not correctly convert floats to ints, which I believe my patch will. +1, though I am a bit concerned that prompting to raise a ValueError for every element could impede performance. I'd probably still enclose it into an if issubclass(typ, np.uint64) or issubclass(typ, np.int64): just like in npio.patch. I also thought one might switch to int(float128(x)) in that case, but at least for the given examples float128 cannot convert with more accuracy than float64 (even on PowerPC ;-). There were some dissenting opinions that trying to read a float into an int should generally throw an exception though... And Chuck just beat me... On 26 Mar 2011, at 21:25, Charles R Harris wrote: I put all these patches together at https://github.com/charris/numpy/tree/loadtxt-savetxt . Please pull from there to continue work on loadtxt/savetxt so as to avoid conflicts in the patches. One of the numpy tests is failing, I assume from patch conflicts, and more tests for the tickets are needed in any case. Also,
[Numpy-discussion] loadtxt/savetxt tickets
Hi All, Could someone with an interest in loadtxt/savetxt look through the associated tickets? A search on the tickets using either of those keys will return fairly lengthy lists. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt stop
Though, really, it's annoying that numpy.loadtxt needs both the readline function *and* the iterator protocol. If it just used iterators, you could do: def truncator(fh, delimiter='END'): for line in fh: if line.strip() == delimiter: break yield line numpy.loadtxt(truncator(c)) Maybe I'll try to work up a patch for this. http://projects.scipy.org/numpy/ticket/1616 Zach That seemed easy... worth applying? Won't break compatibility, because the previous loadtxt required both fname.readline and fname.__iter__, while this requires only the latter. Index: numpy/lib/npyio.py === --- numpy/lib/npyio.py(revision 8716) +++ numpy/lib/npyio.py(working copy) @@ -597,10 +597,11 @@ fh = bz2.BZ2File(fname) else: fh = open(fname, 'U') -elif hasattr(fname, 'readline'): -fh = fname else: -raise ValueError('fname must be a string or file handle') + try: + fh = iter(fname) + except: + raise ValueError('fname must be a string or file handle') X = [] def flatten_dtype(dt): @@ -633,14 +634,18 @@ # Skip the first `skiprows` lines for i in xrange(skiprows): -fh.readline() +try: +fh.next() +except StopIteration: +raise IOError('End-of-file reached before encountering data.') # Read until we find a line with some values, and use # it to estimate the number of columns, N. first_vals = None while not first_vals: -first_line = fh.readline() -if not first_line: # EOF reached +try: +first_line = fh.next() +except StopIteration: raise IOError('End-of-file reached before encountering data.') first_vals = split_line(first_line) N = len(usecols or first_vals) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] loadtxt stop
Hi, I been looking around and could spot anything on this. Quite often I want to read a homogeneous block of data from within a file. The skiprows option is great for missing out the section before the data starts, but if there is anything below then loadtxt will choke. I wondered if there was a possibility to put an endmarker= ? For example, if I want to load text from a large! file that looks like this header line header line 1 2.0 3.0 2 4.5 5.7 ... 500 4.3 5.4 END more headers more headers 1 2.0 3.0 3.14 1.1414 2 4.5 5.7 1.14 3.1459 ... 500 4.3 5.4 0.000 0.001 END Then I can use skiprows=2, but loadtxt will choke when it gets to 'END'. To read t ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt stop
oops, I meant to save my post but I sent it instead - doh! In the end, the question was; is worth adding start= and stop= markers into loadtxt to allow grabbing sections of a file between two known headers? I imagine it's something that people come up against regularly. Thanks, Neil From: Neil Hodgson hodgson.n...@yahoo.co.uk To: numpy-discussion@scipy.org Sent: Fri, 17 September, 2010 14:17:12 Subject: loadtxt stop Hi, I been looking around and could spot anything on this. Quite often I want to read a homogeneous block of data from within a file. The skiprows option is great for missing out the section before the data starts, but if there is anything below then loadtxt will choke. I wondered if there was a possibility to put an endmarker= ? For example, if I want to load text from a large! file that looks like this header line header line 1 2.0 3.0 2 4.5 5.7 ... 500 4.3 5.4 END more headers more headers 1 2.0 3.0 3.14 1.1414 2 4.5 5.7 1.14 3.1459 ... 500 4.3 5.4 0.000 0.001 END Then I can use skiprows=2, but loadtxt will choke when it gets to 'END'. To read t ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt stop
On Sep 17, 2010, at 2:40 PM, Neil Hodgson wrote: oops, I meant to save my post but I sent it instead - doh! In the end, the question was; is worth adding start= and stop= markers into loadtxt to allow grabbing sections of a file between two known headers? I imagine it's something that people come up against regularly. genfromtxt comes with skip_header and skip_footer that do what you want. Earlier this week, I corrected a bug w/ skip_footer on the SVN (now git) version of the sources. Please check it out. Try to be as specific as possible with your input options, that'll make genfromtxt more efficient. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt stop
Neil Hodgson wrote: In the end, the question was; is worth adding start= and stop= markers into loadtxt to allow grabbing sections of a file between two known headers? I imagine it's something that people come up against regularly. maybe not so regular. However, a common use would be to be able load only n rows, which also does not appear to be supported. That would be nice. -Chris Thanks, Neil *From:* Neil Hodgson hodgson.n...@yahoo.co.uk *To:* numpy-discussion@scipy.org *Sent:* Fri, 17 September, 2010 14:17:12 *Subject:* loadtxt stop Hi, I been looking around and could spot anything on this. Quite often I want to read a homogeneous block of data from within a file. The skiprows option is great for missing out the section before the data starts, but if there is anything below then loadtxt will choke. I wondered if there was a possibility to put an endmarker= ? For example, if I want to load text from a large! file that looks like this header line header line 1 2.0 3.0 2 4.5 5.7 ... 500 4.3 5.4 END more headers more headers 1 2.0 3.0 3.14 1.1414 2 4.5 5.7 1.14 3.1459 ... 500 4.3 5.4 0.000 0.001 END Then I can use skiprows=2, but loadtxt will choke when it gets to 'END'. To read t ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt stop
In the end, the question was; is worth adding start= and stop= markers into loadtxt to allow grabbing sections of a file between two known headers? I imagine it's something that people come up against regularly. Simple enough to wrap your file in a new file-like object that stops coughing up lines when the delimiter is found, no? class TruncatingFile(object): def __init__(self, fh, delimiter='END'): self.fh = fh self.delimiter=delimiter self.done = False def readline(self): if self.done: return '' line = self.fh.readline() if line.strip() == self.delimiter: self.done = True return '' return line def __iter__(self): return self def next(self): line = self.fh.next() if line.strip() == self.delimiter: self.done = True raise StopIteration() return line from StringIO import StringIO c = StringIO(0 1\n2 3\nEND) numpy.loadtxt(TruncatingFile(c)) Though, really, it's annoying that numpy.loadtxt needs both the readline function *and* the iterator protocol. If it just used iterators, you could do: def truncator(fh, delimiter='END'): for line in fh: if line.strip() == delimiter: break yield line numpy.loadtxt(truncator(c)) Maybe I'll try to work up a patch for this. Zach On Sep 17, 2010, at 2:51 PM, Christopher Barker wrote: Neil Hodgson wrote: In the end, the question was; is worth adding start= and stop= markers into loadtxt to allow grabbing sections of a file between two known headers? I imagine it's something that people come up against regularly. maybe not so regular. However, a common use would be to be able load only n rows, which also does not appear to be supported. That would be nice. -Chris Thanks, Neil *From:* Neil Hodgson hodgson.n...@yahoo.co.uk *To:* numpy-discussion@scipy.org *Sent:* Fri, 17 September, 2010 14:17:12 *Subject:* loadtxt stop Hi, I been looking around and could spot anything on this. Quite often I want to read a homogeneous block of data from within a file. The skiprows option is great for missing out the section before the data starts, but if there is anything below then loadtxt will choke. I wondered if there was a possibility to put an endmarker= ? For example, if I want to load text from a large! file that looks like this header line header line 1 2.0 3.0 2 4.5 5.7 ... 500 4.3 5.4 END more headers more headers 1 2.0 3.0 3.14 1.1414 2 4.5 5.7 1.14 3.1459 ... 500 4.3 5.4 0.000 0.001 END Then I can use skiprows=2, but loadtxt will choke when it gets to 'END'. To read t ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt stop
Though, really, it's annoying that numpy.loadtxt needs both the readline function *and* the iterator protocol. If it just used iterators, you could do: def truncator(fh, delimiter='END'): for line in fh: if line.strip() == delimiter: break yield line numpy.loadtxt(truncator(c)) Maybe I'll try to work up a patch for this. That seemed easy... worth applying? Won't break compatibility, because the previous loadtxt required both fname.readline and fname.__iter__, while this requires only the latter. Index: numpy/lib/npyio.py === --- numpy/lib/npyio.py (revision 8716) +++ numpy/lib/npyio.py (working copy) @@ -597,10 +597,11 @@ fh = bz2.BZ2File(fname) else: fh = open(fname, 'U') -elif hasattr(fname, 'readline'): -fh = fname else: -raise ValueError('fname must be a string or file handle') + try: + fh = iter(fname) + except: + raise ValueError('fname must be a string or file handle') X = [] def flatten_dtype(dt): @@ -633,14 +634,18 @@ # Skip the first `skiprows` lines for i in xrange(skiprows): -fh.readline() +try: +fh.next() +except StopIteration: +raise IOError('End-of-file reached before encountering data.') # Read until we find a line with some values, and use # it to estimate the number of columns, N. first_vals = None while not first_vals: -first_line = fh.readline() -if not first_line: # EOF reached +try: +first_line = fh.next() +except StopIteration: raise IOError('End-of-file reached before encountering data.') first_vals = split_line(first_line) N = len(usecols or first_vals) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt stop
On Fri, Sep 17, 2010 at 2:50 PM, Zachary Pincus zachary.pin...@yale.eduwrote: Though, really, it's annoying that numpy.loadtxt needs both the readline function *and* the iterator protocol. If it just used iterators, you could do: def truncator(fh, delimiter='END'): for line in fh: if line.strip() == delimiter: break yield line numpy.loadtxt(truncator(c)) Maybe I'll try to work up a patch for this. That seemed easy... worth applying? Won't break compatibility, because the previous loadtxt required both fname.readline and fname.__iter__, while this requires only the latter. Index: numpy/lib/npyio.py === --- numpy/lib/npyio.py (revision 8716) +++ numpy/lib/npyio.py (working copy) @@ -597,10 +597,11 @@ fh = bz2.BZ2File(fname) else: fh = open(fname, 'U') -elif hasattr(fname, 'readline'): -fh = fname else: -raise ValueError('fname must be a string or file handle') + try: + fh = iter(fname) + except: + raise ValueError('fname must be a string or file handle') X = [] def flatten_dtype(dt): @@ -633,14 +634,18 @@ # Skip the first `skiprows` lines for i in xrange(skiprows): -fh.readline() +try: +fh.next() +except StopIteration: +raise IOError('End-of-file reached before encountering data.') # Read until we find a line with some values, and use # it to estimate the number of columns, N. first_vals = None while not first_vals: -first_line = fh.readline() -if not first_line: # EOF reached +try: +first_line = fh.next() +except StopIteration: raise IOError('End-of-file reached before encountering data.') first_vals = split_line(first_line) N = len(usecols or first_vals) So, this code will still raise an error for an empty file. Personally, I consider that a bug because I would expect to receive an empty array. I could understand raising an error for a non-empty file that does not contain anything useful. For comparison, Matlab returns an empty matrix for loading an emtpy text file. This has been a long-standing annoyance for me, along with the behavior with a single-line data file. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt stop
On Sep 17, 2010, at 3:59 PM, Benjamin Root wrote: So, this code will still raise an error for an empty file. Personally, I consider that a bug because I would expect to receive an empty array. I could understand raising an error for a non-empty file that does not contain anything useful. For comparison, Matlab returns an empty matrix for loading an emtpy text file. This has been a long-standing annoyance for me, along with the behavior with a single-line data file. Agreed... I just wanted to make the patch as identical in behavior to the old version as possible. Though again, simple shims around loadtxt (as in my previous examples) can yield the desired behavior easily enough. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt stop
On Fri, Sep 17, 2010 at 3:04 PM, Zachary Pincus zachary.pin...@yale.eduwrote: On Sep 17, 2010, at 3:59 PM, Benjamin Root wrote: So, this code will still raise an error for an empty file. Personally, I consider that a bug because I would expect to receive an empty array. I could understand raising an error for a non-empty file that does not contain anything useful. For comparison, Matlab returns an empty matrix for loading an emtpy text file. This has been a long-standing annoyance for me, along with the behavior with a single-line data file. Agreed... I just wanted to make the patch as identical in behavior to the old version as possible. Though again, simple shims around loadtxt (as in my previous examples) can yield the desired behavior easily enough. Fair enough. No need to mix a bugfix with a feature request. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt() behavior on single-line files
On Thu, Jun 24, 2010 at 1:53 PM, Benjamin Root ben.r...@ou.edu wrote: On Thu, Jun 24, 2010 at 1:00 PM, Warren Weckesser warren.weckes...@enthought.com wrote: Benjamin Root wrote: Hi, I was having the hardest time trying to figure out an intermittent bug in one of my programs. Essentially, in some situations, it was throwing an error saying that the array object was not an array. It took me a while, but then I figured out that my program was assuming that the object returned from a loadtxt() call was always a structured array (I was using dtypes). However, if the data file being loaded only had one data record, then all you get back is a structured record. import numpy as np from StringIO import StringIO strData = StringIO(89.23 47.2\n13.2 42.2) a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print Length Two print a print a.shape print len(a) strData = StringIO(53.2 49.2) a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print \n\nLength One print a print a.shape try : print len(a) except TypeError as err print ERROR:, err Which gets me this output: Length Two [(89.234, 47.203) (13.199, 42.203)] (2,) 2 Length One (53.203, 49.203) () ERROR: len() of unsized object Note that this isn't restricted to structured arrays. For regular ndarrays, loadtxt() appears to mimic the behavior of np.squeeze(): Exactly. The last four lines of the function are: X = np.squeeze(X) if unpack: return X.T else: return X a = np.ones((1, 1, 1)) np.squeeze(a)[0] IndexError: 0-d arrays can't be indexed strData = StringIO(53.2) a = np.loadtxt(strData) a[0] IndexError: 0-d arrays can't be indexed So, if you have multiple lines with multiple columns, you get a 2-D array, as expected. if you have a single line of data with multiple columns, you get a 1-D array. If you have a single column with many lines, you also get a 1-D array (which is probably expected, I guess). If you have a single column with a single line, you get a scalar (actually, a 0-D array). Is this a bug or a feature? I can see the advantages of having loadtxt() returning the lowest # of dimensions that can hold the given data, but it leaves the code vulnerable to certain edge cases. Maybe there is a different way I should be doing this, but I feel that this behavior at the very least should be included in the loadtxt documentation. It would be useful to be able to tell loadtxt to not call squeeze, so a program that reads column-formatted data doesn't have to treat the case of a single line specially. Warren I don't know if that is the best way to solve the problem. In that case, you would always get a 2-D array, right? Is that useful for those who have text data as a single column? Maybe a mindim keyword (with None as default) and apply an appropriate atleast_Nd() call (or maybe have available an .atleast_nd() function?). But, then what would this mean for structured arrays? One might think that they want at least 2-D, but they really want at least 1-D. Ben Root P.S. - Taking this a step further, the functions completely fail in dealing with empty files... In MATLAB, it returns an empty array (matrix?). I am reviving this dead thread to note that I have filed ticket #1562 on the numpy Trac about this issue: http://projects.scipy.org/numpy/ticket/1562 Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] loadtxt() behavior on single-line files
Hi, I was having the hardest time trying to figure out an intermittent bug in one of my programs. Essentially, in some situations, it was throwing an error saying that the array object was not an array. It took me a while, but then I figured out that my program was assuming that the object returned from a loadtxt() call was always a structured array (I was using dtypes). However, if the data file being loaded only had one data record, then all you get back is a structured record. import numpy as np from StringIO import StringIO strData = StringIO(89.23 47.2\n13.2 42.2) a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print Length Two print a print a.shape print len(a) strData = StringIO(53.2 49.2) a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print \n\nLength One print a print a.shape try : print len(a) except TypeError as err print ERROR:, err Which gets me this output: Length Two [(89.234, 47.203) (13.199, 42.203)] (2,) 2 Length One (53.203, 49.203) () ERROR: len() of unsized object Note that this isn't restricted to structured arrays. For regular ndarrays, loadtxt() appears to mimic the behavior of np.squeeze(): a = np.ones((1, 1, 1)) np.squeeze(a)[0] IndexError: 0-d arrays can't be indexed strData = StringIO(53.2) a = np.loadtxt(strData) a[0] IndexError: 0-d arrays can't be indexed So, if you have multiple lines with multiple columns, you get a 2-D array, as expected. if you have a single line of data with multiple columns, you get a 1-D array. If you have a single column with many lines, you also get a 1-D array (which is probably expected, I guess). If you have a single column with a single line, you get a scalar (actually, a 0-D array). Is this a bug or a feature? I can see the advantages of having loadtxt() returning the lowest # of dimensions that can hold the given data, but it leaves the code vulnerable to certain edge cases. Maybe there is a different way I should be doing this, but I feel that this behavior at the very least should be included in the loadtxt documentation. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt() behavior on single-line files
Benjamin Root wrote: Hi, I was having the hardest time trying to figure out an intermittent bug in one of my programs. Essentially, in some situations, it was throwing an error saying that the array object was not an array. It took me a while, but then I figured out that my program was assuming that the object returned from a loadtxt() call was always a structured array (I was using dtypes). However, if the data file being loaded only had one data record, then all you get back is a structured record. import numpy as np from StringIO import StringIO strData = StringIO(89.23 47.2\n13.2 42.2) a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print Length Two print a print a.shape print len(a) strData = StringIO(53.2 49.2) a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print \n\nLength One print a print a.shape try : print len(a) except TypeError as err print ERROR:, err Which gets me this output: Length Two [(89.234, 47.203) (13.199, 42.203)] (2,) 2 Length One (53.203, 49.203) () ERROR: len() of unsized object Note that this isn't restricted to structured arrays. For regular ndarrays, loadtxt() appears to mimic the behavior of np.squeeze(): Exactly. The last four lines of the function are: X = np.squeeze(X) if unpack: return X.T else: return X a = np.ones((1, 1, 1)) np.squeeze(a)[0] IndexError: 0-d arrays can't be indexed strData = StringIO(53.2) a = np.loadtxt(strData) a[0] IndexError: 0-d arrays can't be indexed So, if you have multiple lines with multiple columns, you get a 2-D array, as expected. if you have a single line of data with multiple columns, you get a 1-D array. If you have a single column with many lines, you also get a 1-D array (which is probably expected, I guess). If you have a single column with a single line, you get a scalar (actually, a 0-D array). Is this a bug or a feature? I can see the advantages of having loadtxt() returning the lowest # of dimensions that can hold the given data, but it leaves the code vulnerable to certain edge cases. Maybe there is a different way I should be doing this, but I feel that this behavior at the very least should be included in the loadtxt documentation. It would be useful to be able to tell loadtxt to not call squeeze, so a program that reads column-formatted data doesn't have to treat the case of a single line specially. Warren ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt() behavior on single-line files
Warren Weckesser wrote: Benjamin Root wrote: Note that this isn't restricted to structured arrays. For regular ndarrays, loadtxt() appears to mimic the behavior of np.squeeze(): Exactly. The last four lines of the function are: X = np.squeeze(X) if unpack: return X.T else: return X It would be useful to be able to tell loadtxt to not call squeeze, so a program that reads column-formatted data doesn't have to treat the case of a single line specially. I agree -- it seem to me that every time I load data, I know what shape I expect the result to be -- I'd never want it to squeeze. It might be nice if you could specify the dimensionality of the array you want. But for now: can you just do a reshape? In [42]: strData = StringIO(53.2 49.2) In[43]:a=p.loadtxt(strData,dtype=[('x',float),('y',float)]).reshape((-1,)) In [45]: a.shape Out[45]: (1,) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt() behavior on single-line files
On Thu, Jun 24, 2010 at 1:00 PM, Warren Weckesser warren.weckes...@enthought.com wrote: Benjamin Root wrote: Hi, I was having the hardest time trying to figure out an intermittent bug in one of my programs. Essentially, in some situations, it was throwing an error saying that the array object was not an array. It took me a while, but then I figured out that my program was assuming that the object returned from a loadtxt() call was always a structured array (I was using dtypes). However, if the data file being loaded only had one data record, then all you get back is a structured record. import numpy as np from StringIO import StringIO strData = StringIO(89.23 47.2\n13.2 42.2) a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print Length Two print a print a.shape print len(a) strData = StringIO(53.2 49.2) a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print \n\nLength One print a print a.shape try : print len(a) except TypeError as err print ERROR:, err Which gets me this output: Length Two [(89.234, 47.203) (13.199, 42.203)] (2,) 2 Length One (53.203, 49.203) () ERROR: len() of unsized object Note that this isn't restricted to structured arrays. For regular ndarrays, loadtxt() appears to mimic the behavior of np.squeeze(): Exactly. The last four lines of the function are: X = np.squeeze(X) if unpack: return X.T else: return X a = np.ones((1, 1, 1)) np.squeeze(a)[0] IndexError: 0-d arrays can't be indexed strData = StringIO(53.2) a = np.loadtxt(strData) a[0] IndexError: 0-d arrays can't be indexed So, if you have multiple lines with multiple columns, you get a 2-D array, as expected. if you have a single line of data with multiple columns, you get a 1-D array. If you have a single column with many lines, you also get a 1-D array (which is probably expected, I guess). If you have a single column with a single line, you get a scalar (actually, a 0-D array). Is this a bug or a feature? I can see the advantages of having loadtxt() returning the lowest # of dimensions that can hold the given data, but it leaves the code vulnerable to certain edge cases. Maybe there is a different way I should be doing this, but I feel that this behavior at the very least should be included in the loadtxt documentation. It would be useful to be able to tell loadtxt to not call squeeze, so a program that reads column-formatted data doesn't have to treat the case of a single line specially. Warren I don't know if that is the best way to solve the problem. In that case, you would always get a 2-D array, right? Is that useful for those who have text data as a single column? Maybe a mindim keyword (with None as default) and apply an appropriate atleast_Nd() call (or maybe have available an .atleast_nd() function?). But, then what would this mean for structured arrays? One might think that they want at least 2-D, but they really want at least 1-D. Ben Root P.S. - Taking this a step further, the functions completely fail in dealing with empty files... In MATLAB, it returns an empty array (matrix?). ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] loadtxt raises an exception on empty file
Hello everybody, I'm using numpy V1.3.0 and ran into a case when numpy.loadtxt('foo.txt') raised an exception: import numpy as np np.loadtxt('foo.txt') Traceback (most recent call last): File stdin, line 1, in module File /Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/numpy/lib/io.py, line 456, in loadtxt raise IOError('End-of-file reached before encountering data.') IOError: End-of-file reached before encountering data. if provided file 'foo.txt' is empty. Would anybody happen to know if it's a feature or a bug? I would expect it to return an empty array. numpy.fromfile() handles empty text files: np.fromfile('foo.txt', sep='\t\n ') array([], dtype=float64) Would anybody suggest a graceful way of handling empty files with numpy.loadtxt() (except for catching an IOError exception)? Many thanks, Masha liu...@usc.edu ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt raises an exception on empty file
On Mon, May 24, 2010 at 4:14 PM, Maria Liukis liu...@usc.edu wrote: Hello everybody, I'm using numpy V1.3.0 and ran into a case when numpy.loadtxt('foo.txt') raised an exception: import numpy as np np.loadtxt('foo.txt') Traceback (most recent call last): File stdin, line 1, in module File /Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/numpy/lib/io.py, line 456, in loadtxt raise IOError('End-of-file reached before encountering data.') IOError: End-of-file reached before encountering data. if provided file 'foo.txt' is empty. Would anybody happen to know if it's a feature or a bug? I would expect it to return an empty array. Looking at the source for loadtxt line 591 # Read until we find a line with some values, and use # it to estimate the number of columns, N. first_vals = None while not first_vals: first_line = fh.readline() if first_line == '': # EOF reached raise IOError('End-of-file reached before encountering data.') So it looks like it is not a bug although I am not sure why returning an empty array would not be valid. But then what are you going to do with the empty array? Vincent numpy.fromfile() handles empty text files: np.fromfile('foo.txt', sep='\t\n ') array([], dtype=float64) Would anybody suggest a graceful way of handling empty files with numpy.loadtxt() (except for catching an IOError exception)? Many thanks, Masha liu...@usc.edu ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion *Vincent Davis 720-301-3003 * vinc...@vincentdavis.net my blog http://vincentdavis.net | LinkedInhttp://www.linkedin.com/in/vincentdavis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt raises an exception on empty file
You can just catch the exception and decide what to do with it: try: data = np.loadtxt('foo.txt') except IOError: data = 0 # Or something similar Nadav -Original Message- From: numpy-discussion-boun...@scipy.org on behalf of Maria Liukis Sent: Tue 25-May-10 01:14 To: numpy-discussion@scipy.org Subject: [Numpy-discussion] loadtxt raises an exception on empty file Hello everybody, I'm using numpy V1.3.0 and ran into a case when numpy.loadtxt('foo.txt') raised an exception: import numpy as np np.loadtxt('foo.txt') Traceback (most recent call last): File stdin, line 1, in module File /Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/numpy/lib/io.py, line 456, in loadtxt raise IOError('End-of-file reached before encountering data.') IOError: End-of-file reached before encountering data. if provided file 'foo.txt' is empty. Would anybody happen to know if it's a feature or a bug? I would expect it to return an empty array. numpy.fromfile() handles empty text files: np.fromfile('foo.txt', sep='\t\n ') array([], dtype=float64) Would anybody suggest a graceful way of handling empty files with numpy.loadtxt() (except for catching an IOError exception)? Many thanks, Masha liu...@usc.edu ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion winmail.dat___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] loadtxt and genfromtxt
I am new to python/numpy/scipy and new to this list. I recently migrated over from using Octave and am very impressed so far! Recently I needed to load data from a text file and quickly found numpy's loadtxt function. However, there were missing data values, which loadtxt does not handle. After some amount of googling, I did find genfromtxt which does exactly what I need. It would have been helpful if genfromtxt was included in the See Also portion of the docstring for loadtxt. Perhaps this is a simple oversight? I see that genfromtxt does mention loadtxt in its docstring. Let me know if I should submit a bug somewhere, or if it is sufficient to mention this small item on the list. Thanks, Jonathan P.S. My first send did not seem to go through. Trying again; sorry if this is posted twice... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and genfromtxt
On Thu, Feb 11, 2010 at 1:36 AM, Jonathan Stickel jjstic...@vcn.com wrote: I am new to python/numpy/scipy and new to this list. I recently migrated over from using Octave and am very impressed so far! Recently I needed to load data from a text file and quickly found numpy's loadtxt function. However, there were missing data values, which loadtxt does not handle. After some amount of googling, I did find genfromtxt which does exactly what I need. It would have been helpful if genfromtxt was included in the See Also portion of the docstring for loadtxt. Perhaps this is a simple oversight? I see that genfromtxt does mention loadtxt in its docstring. Thanks, fixed: http://docs.scipy.org/numpy/docs/numpy.lib.io.loadtxt/ Let me know if I should submit a bug somewhere, or if it is sufficient to mention this small item on the list. If you find more such things, please consider creating an account in the doc wiki I linked above and contributing directly. After account creation you'd need to ask for edit rights on this list. Cheers, Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] loadtxt example problem ?
Hello, I'm new to numpy, and considering using loadtxt() to read a data file. As a starter, I tried the example of the doc page ( http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) : from StringIO import StringIO # StringIO behaves like a file object c = StringIO(0 1\n2 3) np.loadtxt(c) I didn't get the expectd answer, but : Traceback (moste recent call last): File(stdin), line 1, in module File C:\Python25\lib\sire-packages\numpy\core\numeric.py, line 725, in loadtxt X = array(X, dtype) ValueError: setting an array element with a sequence. I'm using verison 1.0.4 of numpy). I got the same problem on a Ms-Windows and a Linux Machine. I could run the example by adding a \n at the end of c : c = StringIO(0 1\n2 3\n) Is it the normal and expected behaviour ? Bruno. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt example problem ?
On Mon, May 4, 2009 at 3:06 PM, bruno Piguet bruno.pig...@gmail.com wrote: Hello, I'm new to numpy, and considering using loadtxt() to read a data file. As a starter, I tried the example of the doc page ( http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) : from StringIO import StringIO # StringIO behaves like a file object c = StringIO(0 1\n2 3) np.loadtxt(c) I didn't get the expectd answer, but : Traceback (moste recent call last): File(stdin), line 1, in module File C:\Python25\lib\sire-packages\numpy\core\numeric.py, line 725, in loadtxt X = array(X, dtype) ValueError: setting an array element with a sequence. I'm using verison 1.0.4 of numpy). I got the same problem on a Ms-Windows and a Linux Machine. I could run the example by adding a \n at the end of c : c = StringIO(0 1\n2 3\n) Is it the normal and expected behaviour ? Bruno. It's a bug that's been fixed. Numpy 1.0.4 is quite a bit out of date, so I'd recommend updating to the latest (1.3). Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt issues
On 2/11/2009 6:40 AM, A B wrote: Hi, How do I write a loadtxt command to read in the following file and store each data point as the appropriate data type: 12|h|34.5|44.5 14552|bbb|34.5|42.5 dt = {'names': ('gender','age','weight','bal'), 'formats': ('i4', 'S4','f4', 'f4')} Does this work for you? dt = {'names': ('gender','age','weight','bal'), 'formats': ('i4','S4','f4', 'f4')} with open('filename.txt', 'rt') as file: linelst = [line.strip('\n').split('|') for line in file] n = len(linelst) data = numpy.zeros(n, dtype=numpy.dtype(dt)) for i,(gender, age, weight, bal) in zip(range(n),linelst): data[i] = (int(gender), age, float(weight), float(bal)) S.M. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt issues
On 3/4/2009 12:57 PM, Sturla Molden wrote: Does this work for you? Never mind, it seems my e-mail got messed up. I ought to keep them sorted by date... S.M. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt slow
On Sun, 1 Mar 2009 14:29:54 -0500, Michael Gilbert wrote: i will send the current version to the list tomorrow when i have access to the system that it is on. attached is my current version of loadtxt. like i said, it's slower for small data sets (because it reads through the whole data file twice). the first loop is used to figure out how much memory to allocate, and i can optimize this by intelligently seeking through the file. but like i said, i haven't had the time to implement it. all of the options should work, except for converters (i have never used converters and i couldn't figure out exactly what it does based on a quick read-through of the docs). best wishes, mike myloadtxt Description: Binary data ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] loadtxt slow
So I have some data sets of about 16 floating point numbers stored in text files. I find that loadtxt is rather slow. Is this to be expected? Would it be faster if it were loading binary data? -gideon ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt slow
On Sun, 1 Mar 2009 16:12:14 -0500 Gideon Simpson wrote: So I have some data sets of about 16 floating point numbers stored in text files. I find that loadtxt is rather slow. Is this to be expected? Would it be faster if it were loading binary data? i have run into this as well. loadtxt uses a python list to allocate memory for the data it reads in, so once you get to about 1/4th of your available memory, it will start allocating the updated list (every time it reads a new value from your data file) in swap instead of main memory, which is rediculously slow (in fact it causes my system to be quite unresponsive and a jumpy cursor). i have rewritten loadtxt to be smarter about allocating memory, but it is slower overall and doesn't support all of the original arguments/options (yet). i have some ideas to make it smarter/more efficient, but have not had the time to work on it recently. i will send the current version to the list tomorrow when i have access to the system that it is on. best wishes, mike ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt slow
On Sun, 1 Mar 2009 14:29:54 -0500 Michael Gilbert wrote: i have rewritten loadtxt to be smarter about allocating memory, but it is slower overall and doesn't support all of the original arguments/options (yet). i had meant to say that my version is slower for smaller data sets (when you aren't close to your main memory limit), but it is orders of magnitude faster for large data sets. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt slow
On Sun, Mar 1, 2009 at 11:29 AM, Michael Gilbert michael.s.gilb...@gmail.com wrote: On Sun, 1 Mar 2009 16:12:14 -0500 Gideon Simpson wrote: So I have some data sets of about 16 floating point numbers stored in text files. I find that loadtxt is rather slow. Is this to be expected? Would it be faster if it were loading binary data? i have run into this as well. loadtxt uses a python list to allocate memory for the data it reads in, so once you get to about 1/4th of your available memory, it will start allocating the updated list (every time it reads a new value from your data file) in swap instead of main memory, which is rediculously slow (in fact it causes my system to be quite unresponsive and a jumpy cursor). i have rewritten loadtxt to be smarter about allocating memory, but it is slower overall and doesn't support all of the original arguments/options (yet). i have some ideas to make it smarter/more efficient, but have not had the time to work on it recently. i will send the current version to the list tomorrow when i have access to the system that it is on. best wishes, mike ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion to address the slowness, i use wrappers around savetxt/loadtxt that save/load a .npy file along with/instead of the .txt file. -- and the loadtxt wrapper checks if the .npy is up-to-date. code here: http://rafb.net/p/dGBJjg80.html of course it's still slow the first time. i look forward to your speedups. -brentp ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt slow
Gideon Simpson wrote: So I have some data sets of about 16 floating point numbers stored in text files. I find that loadtxt is rather slow. Is this to be expected? Would it be faster if it were loading binary data? Depending on the format you may be able to use numpy.fromfile, which I suspect would be much faster. It only handles very simple ascii formats, though. Eric ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt issues
On Tue, Feb 10, 2009 at 9:52 PM, Brent Pedersen bpede...@gmail.com wrote: On Tue, Feb 10, 2009 at 9:40 PM, A B python6...@gmail.com wrote: Hi, How do I write a loadtxt command to read in the following file and store each data point as the appropriate data type: 12|h|34.5|44.5 14552|bbb|34.5|42.5 Do the strings have to be read in separately from the numbers? Why would anyone use 'S10' instead of 'string'? dt = {'names': ('gender','age','weight','bal'), 'formats': ('i4', 'S4','f4', 'f4')} a = loadtxt(sample_data.txt, dtype=dt) gives ValueError: need more than 1 value to unpack I can do a = loadtxt(sample_data.txt, dtype=string) but can't use 'string' instead of S4 and all my data is read into strings. Seems like all the examples on-line use either numeric or textual input, but not both. Thanks. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion works for me but not sure i understand the problem, did you try setting the delimiter? import numpy as np from cStringIO import StringIO txt = StringIO(\ 12|h|34.5|44.5 14552|bbb|34.5|42.5) dt = {'names': ('gender','age','weight','bal'), 'formats': ('i4', 'S4','f4', 'f4')} a = np.loadtxt(txt, dtype=dt, delimiter=|) print a.dtype I had tried both with and without the delimiter. In any event, it just worked for me as well. Not sure what I was missing before. Anyway, thank you. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt issues
On Wed, Feb 11, 2009 at 6:27 PM, A B python6...@gmail.com wrote: On Tue, Feb 10, 2009 at 9:52 PM, Brent Pedersen bpede...@gmail.com wrote: On Tue, Feb 10, 2009 at 9:40 PM, A B python6...@gmail.com wrote: Hi, How do I write a loadtxt command to read in the following file and store each data point as the appropriate data type: 12|h|34.5|44.5 14552|bbb|34.5|42.5 Do the strings have to be read in separately from the numbers? Why would anyone use 'S10' instead of 'string'? dt = {'names': ('gender','age','weight','bal'), 'formats': ('i4', 'S4','f4', 'f4')} a = loadtxt(sample_data.txt, dtype=dt) gives ValueError: need more than 1 value to unpack I can do a = loadtxt(sample_data.txt, dtype=string) but can't use 'string' instead of S4 and all my data is read into strings. Seems like all the examples on-line use either numeric or textual input, but not both. Thanks. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion works for me but not sure i understand the problem, did you try setting the delimiter? import numpy as np from cStringIO import StringIO txt = StringIO(\ 12|h|34.5|44.5 14552|bbb|34.5|42.5) dt = {'names': ('gender','age','weight','bal'), 'formats': ('i4', 'S4','f4', 'f4')} a = np.loadtxt(txt, dtype=dt, delimiter=|) print a.dtype I had tried both with and without the delimiter. In any event, it just worked for me as well. Not sure what I was missing before. Anyway, thank you. Actually, I was using two different machines and it appears that the version of numpy available on Ubuntu is seriously out of date (1.0.4). Wonder why ... Version 1.2.1 on a RedHat box worked fine. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt issues
2009/2/12 A B python6...@gmail.com: Actually, I was using two different machines and it appears that the version of numpy available on Ubuntu is seriously out of date (1.0.4). Wonder why ... See the recent post here http://projects.scipy.org/pipermail/numpy-discussion/2009-February/040252.html Cheers, Scott ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] loadtxt issues
Hi, How do I write a loadtxt command to read in the following file and store each data point as the appropriate data type: 12|h|34.5|44.5 14552|bbb|34.5|42.5 Do the strings have to be read in separately from the numbers? Why would anyone use 'S10' instead of 'string'? dt = {'names': ('gender','age','weight','bal'), 'formats': ('i4', 'S4','f4', 'f4')} a = loadtxt(sample_data.txt, dtype=dt) gives ValueError: need more than 1 value to unpack I can do a = loadtxt(sample_data.txt, dtype=string) but can't use 'string' instead of S4 and all my data is read into strings. Seems like all the examples on-line use either numeric or textual input, but not both. Thanks. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt issues
On Tue, Feb 10, 2009 at 9:40 PM, A B python6...@gmail.com wrote: Hi, How do I write a loadtxt command to read in the following file and store each data point as the appropriate data type: 12|h|34.5|44.5 14552|bbb|34.5|42.5 Do the strings have to be read in separately from the numbers? Why would anyone use 'S10' instead of 'string'? dt = {'names': ('gender','age','weight','bal'), 'formats': ('i4', 'S4','f4', 'f4')} a = loadtxt(sample_data.txt, dtype=dt) gives ValueError: need more than 1 value to unpack I can do a = loadtxt(sample_data.txt, dtype=string) but can't use 'string' instead of S4 and all my data is read into strings. Seems like all the examples on-line use either numeric or textual input, but not both. Thanks. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion works for me but not sure i understand the problem, did you try setting the delimiter? import numpy as np from cStringIO import StringIO txt = StringIO(\ 12|h|34.5|44.5 14552|bbb|34.5|42.5) dt = {'names': ('gender','age','weight','bal'), 'formats': ('i4', 'S4','f4', 'f4')} a = np.loadtxt(txt, dtype=dt, delimiter=|) print a.dtype ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Loadtxt .bz2 support
Charles R Harris wrote: On Tue, Oct 21, 2008 at 1:30 PM, Ryan May [EMAIL PROTECTED] wrote: Hi, I noticed numpy.loadtxt has support for gzipped text files, but not for bz2'd files. Here's a 3 line patch to add bzip2 support to loadtxt. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma Index: numpy/lib/io.py === --- numpy/lib/io.py (revision 5953) +++ numpy/lib/io.py (working copy) @@ -320,6 +320,9 @@ if fname.endswith('.gz'): import gzip fh = gzip.open(fname) +elif fname.endswith('.bz2'): +import bz2 +fh = bz2.BZ2File(fname) else: fh = file(fname) elif hasattr(fname, 'seek'): Could you open a ticket for this? Mark it as an enhancement. Done. #940 http://scipy.org/scipy/numpy/ticket/940 Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Loadtxt .bz2 support
Hi, I noticed numpy.loadtxt has support for gzipped text files, but not for bz2'd files. Here's a 3 line patch to add bzip2 support to loadtxt. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma Index: numpy/lib/io.py === --- numpy/lib/io.py (revision 5953) +++ numpy/lib/io.py (working copy) @@ -320,6 +320,9 @@ if fname.endswith('.gz'): import gzip fh = gzip.open(fname) +elif fname.endswith('.bz2'): +import bz2 +fh = bz2.BZ2File(fname) else: fh = file(fname) elif hasattr(fname, 'seek'): ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion