Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Pierre GM wrote: I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5'] Valid point. Well, all, stay tuned for yet another yet another implementation... Found a problem. If you read the names from the file and specify usecols, you end up with the first N names read from the file as the fields in your output (where N is the number of entries in usecols), instead of having the names of the columns you asked for. For instance: from StringIO import StringIO from genload_proposal import loadtxt f = StringIO('stid stnm relh tair\nnrmn 121 45 9.1') loadtxt(f, usecols=('stid', 'relh', 'tair'), names=True, dtype=None) array(('nrmn', 45, 9.0996), dtype=[('stid', '|S4'), ('stnm', 'i8'), ('relh', 'f8')]) What I want to come out is: array(('nrmn', 45, 9.0996), dtype=[('stid', '|S4'), ('relh', 'i8'), ('tair', 'f8')]) I've attached a version that fixes this by setting a flag internally if the names are read from the file. If this flag is true, at the end the names are filtered down to only the ones that are given in usecols. I also have one other thought. Is there any way we can make this handle object arrays, or rather, a field containing objects, specifically datetime objects? Right now, this does not work because calling view does not work for object arrays. I'm just looking for a simple way to store date/time in my record array (currently a string field). Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma Proposal : Here's an extension to np.loadtxt, designed to take missing values into account. import itertools import numpy as np import numpy.ma as ma def _is_string_like(obj): Check whether obj behaves like a string. try: obj + '' except (TypeError, ValueError): return False return True def _to_filehandle(fname, flag='r', return_opened=False): Returns the filehandle corresponding to a string or a file. If the string ends in '.gz', the file is automatically unzipped. Parameters -- fname : string, filehandle Name of the file whose filehandle must be returned. flag : string, optional Flag indicating the status of the file ('r' for read, 'w' for write). return_opened : boolean, optional Whether to return the opening status of the file. if _is_string_like(fname): if fname.endswith('.gz'): import gzip fhd = gzip.open(fname, flag) elif fname.endswith('.bz2'): import bz2 fhd = bz2.BZ2File(fname) else: fhd = file(fname, flag) opened = True elif hasattr(fname, 'seek'): fhd = fname opened = False else: raise ValueError('fname must be a string or file handle') if return_opened: return fhd, opened return fhd def flatten_dtype(ndtype): Unpack a structured data-type. names = ndtype.names if names is None: return [ndtype] else: types = [] for field in names: (typ, _) = ndtype.fields[field] flat_dt = flatten_dtype(typ) types.extend(flat_dt) return types def nested_masktype(datatype): Construct the dtype of a mask for nested elements. names = datatype.names if names: descr = [] for name in names: (ndtype, _) = datatype.fields[name] descr.append((name, nested_masktype(ndtype))) return descr # Is this some kind of composite a la (np.float,2) elif datatype.subdtype: mdescr = list(datatype.subdtype) mdescr[0] = np.dtype(bool) return tuple(mdescr) else: return np.bool class LineSplitter: Defines a function to split a string at a given delimiter or at given places. Parameters -- comment : {'#', string} Character used to mark the beginning of a comment. delimiter : var def __init__(self, delimiter=None, comments='#'): self.comments = comments # Delimiter is a character if delimiter is None: self._isfixed = False self.delimiter = None elif _is_string_like(delimiter): self._isfixed = False self.delimiter = delimiter.strip() or None # Delimiter is a list of field widths elif hasattr(delimiter, '__iter__'): self._isfixed = True idx = np.cumsum([0]+list(delimiter)) self.slices = [slice(i,j) for (i,j) in zip(idx[:-1], idx[1:])] # Delimiter is a single integer elif int(delimiter): self._isfixed = True
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
If I know my data is already clean and is handled nicely by the old loadtxt, will I be able to turn off and the special handling in order to retain the old load speed? Alan Isaac ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Pierre GM wrote: I can try, but in that case, please write me a unittest, so that I have a clear and unambiguous idea of what you expect. fair enough, though I'm not sure when I'll have time to do it. I do wonder if anyone else thinks it would be useful to have multiple delimiters as an option. I got the idea because with fromfile(), if you specify, say ',' as the delimiter, it won't use '\n', only a comma, so there is no way to quickly read a whole bunch of comma delimited data like: 1,2,3,4 5,6,7,8 so I'd like to be able to say to use either ',' or '\n' as the delimiter. However, if I understand loadtxt() correctly, it's handling the new lines separately anyway (to get a 2-d array), so this use case isn't an issue. So how likely is it that someone would have: 1 2 3, 4, 5 6 7 8, 8, 9 and want to read that into a single 2-d array? I'm not sure I've seen it. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On Dec 3, 2008, at 12:48 PM, Christopher Barker wrote: Pierre GM wrote: I can try, but in that case, please write me a unittest, so that I have a clear and unambiguous idea of what you expect. fair enough, though I'm not sure when I'll have time to do it. Oh, don;t worry, nothing too fancy: give me a couple lines of input data and a line with what you expect. Using Ryan's recent example: f = StringIO('stid stnm relh tair\nnrmn 121 45 9.1') test = loadtxt(f, usecols=('stid', 'relh', 'tair'), names=True, dtype=None) control=array(('nrmn', 45, 9.0996), dtype=[('stid', '|S4'), ('relh', 'i8'), ('tair', 'f8')]) That's quite enough for a test. I do wonder if anyone else thinks it would be useful to have multiple delimiters as an option. I got the idea because with fromfile(), if you specify, say ',' as the delimiter, it won't use '\n', only a comma, so there is no way to quickly read a whole bunch of comma delimited data like: 1,2,3,4 5,6,7,8 so I'd like to be able to say to use either ',' or '\n' as the delimiter. I'm not quite sure I follow you. Do you want to delimiters, one for the field of a record (','), one for the records (\n) ? However, if I understand loadtxt() correctly, it's handling the new lines separately anyway (to get a 2-d array), so this use case isn't an issue. So how likely is it that someone would have: 1 2 3, 4, 5 6 7 8, 8, 9 and want to read that into a single 2-d array? With the current behaviour, you gonna have [(1 2 3, 4, 5), (6 7 8, 8, 9)] if you use , as a delimiter, [(1,2,3,,4,,5),(6,7,8,,8,,9)] if you use as a delimiter. Mixing delimiter is doable, but I don't think it's that a good idea. I'm in favor of sticking to one and only field delimiter, and the default line spearator for record delimiter. In other terms, not changing anythng. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On Dec 3, 2008, at 12:32 PM, Alan G Isaac wrote: If I know my data is already clean and is handled nicely by the old loadtxt, will I be able to turn off and the special handling in order to retain the old load speed? Hopefully. I'm looking for the best way to do it. Do you have an example you could send me off-list so that I can play with timers ? Thx in advance. P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
by the way, should this work: io.loadtxt('junk.dat', delimiter=' ') for more than one space between numbers, like: 1 2 3 4 5 6 7 8 9 10 I get: io.loadtxt('junk.dat', delimiter=' ') Traceback (most recent call last): File stdin, line 1, in module File /Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/numpy/lib/io.py, line 403, in loadtxt X.append(tuple([conv(val) for (conv, val) in zip(converters, vals)])) ValueError: empty string for float() with the current version. io.loadtxt('junk.dat', delimiter=None) array([[ 1., 2., 3., 4., 5.], [ 6., 7., 8., 9., 10.]]) does work. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Pierre GM wrote: Oh, don;t worry, nothing too fancy: give me a couple lines of input data and a line with what you expect. I just went and looked at the existing tests, and you're right, it's very easy -- my first foray into the new nose tests -- very nice! specify, say ',' as the delimiter, it won't use '\n', only a comma, so there is no way to quickly read a whole bunch of comma delimited data like: 1,2,3,4 5,6,7,8 so I'd like to be able to say to use either ',' or '\n' as the delimiter. I'm not quite sure I follow you. Do you want to delimiters, one for the field of a record (','), one for the records (\n) ? well, in the case of fromfile(), it doesn't do records -- it will only give you a 1-d array, so I want it all as a flat array, and you can re-size it yourself later. Clearly this is more work (and requires more knowledge of your data) than using loadtxt, but sometimes I really want FAST data reading of simple formats. However, this isn't fromfile() we are talking about now, it's loadtxt()... So how likely is it that someone would have: 1 2 3, 4, 5 6 7 8, 8, 9 and want to read that into a single 2-d array? With the current behaviour, you gonna have [(1 2 3, 4, 5), (6 7 8, 8, 9)] if you use , as a delimiter, [(1,2,3,,4,,5),(6,7,8,,8,,9)] if you use as a delimiter. right. Mixing delimiter is doable, but I don't think it's that a good idea. I can't come up with a use case at this point, so.. I'm in favor of sticking to one and only field delimiter, and the default line spearator for record delimiter. In other terms, not changing anything. I agree -- sorry for the noise! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Alan G Isaac wrote: If I know my data is already clean and is handled nicely by the old loadtxt, will I be able to turn off and the special handling in order to retain the old load speed? what I'd like to see is a version of loadtxt built on a slightly enhanced fromfile() -- that would be blazingly fast for the easy cases (simple tabular data of one dtype). I don't know if the special-casing should be automatic, or just have it be a separate function. Also, fromfile() needs some work, and it needs to be done in C, which is less fun, so who knows when it will get done. As I think about it, maybe what I really want is a simple version of loadtxt written in C: It would only handle one data type at a time. It would support simple comment lines. It would only support one delimiter (plus newline). It would create a 2-d array from normal, tabular data. You could specify: how many numbers you wanted, or how many rows, or read 'till EOF Actually, this is a lot like matlab's fscanf() someday -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On Dec 3, 2008, at 1:00 PM, Christopher Barker wrote: by the way, should this work: io.loadtxt('junk.dat', delimiter=' ') for more than one space between numbers, like: 1 2 3 4 5 6 7 8 9 10 On the version I'm working on, both delimiter='' and delimiter=None (default) would give you the expected output. delimiter=' ' would fail, delimiter=' ' would work. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Alan G Isaac wrote: If I know my data is already clean and is handled nicely by the old loadtxt, will I be able to turn off and the special handling in order to retain the old load speed? Alan Isaac Hi all, that's going in the same direction I was thinking about. When I thought about an improved version of loadtxt, I wished it was fault tolerant without loosing too much performance. So my solution was much simpler than the very nice genloadtxt function -- and it works for me. My ansatz is to leave the existing loadtxt function unchanged. I only replaced the default converter calls by a fault tolerant converter class. I attached a patch against io.py in numpy 1.2.1 The nice thing is that it not only handles missing values, but for example also columns/fields with non-number characters. It just returns nan in these cases. This is of practical importance for many datafiles of astronomical catalogues, for example the Hipparcos catalogue data. Regarding the performance, it is a little bit slower than the original loadtxt, but not much: on my machine, 10x reading in a clean testfile with 3 columns and 2 rows I get the following results: original loadtxt: ~1.3s modified loadtxt: ~1.7s new genloadtxt : ~2.7s So you see, there is some loss of performance, but not as much as with the new converter class. I hope this solution is of interest ... Manuel 237a238,247 class _faultsaveconv(object): def __init__(self,conv): self._conv = conv def __call__(self, x): try: return self._conv(x) except: return np.nan 241c251 return lambda x: bool(int(x)) --- return _faultsaveconv(lambda x: bool(int(x))) 243c253 return lambda x: int(float(x)) --- return _faultsaveconv(lambda x: int(float(x))) 245c255 return float --- return _faultsaveconv(float) 247c257 return complex --- return _faultsaveconv(complex) 249c259 return str --- return _faultsaveconv(str) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Pierre GM wrote: On Dec 3, 2008, at 1:00 PM, Christopher Barker wrote: for more than one space between numbers, like: 1 2 3 4 5 6 7 8 9 10 On the version I'm working on, both delimiter='' and delimiter=None (default) would give you the expected output. so empty string and None both mean any white space? also tabs, etc? delimiter=' ' would fail, s only exactly that delimiter. Is that so things like '\t' will work right? but what about: 4, 5, 34,123, In that case, ',' is the delimiter, but whitespace is ignored. or 4\t 5\t 34\t 123. we're ignoring extra whitespace there, too, so I'm not sure why we shouldn't ignore it in the ' ' case also. delimiter=' ' would work. but in my example, there were sometimes two spaces, sometimes three -- so I think it would fail, no? 1 2 3 4 5.split(' ') ['1', '2', '3', '4', ' 5'] actually, that would work, but four spaces wouldn't. 1 2 3 45.split(' ') ['1', '2', '3', '4', '', '5'] I guess the solution is to use delimiter=None in that case, and is does make sense that you can't have ' ' mean one or more spaces, but \t mean only one tab. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Manuel Metz wrote: Alan G Isaac wrote: If I know my data is already clean and is handled nicely by the old loadtxt, will I be able to turn off and the special handling in order to retain the old load speed? Alan Isaac Hi all, that's going in the same direction I was thinking about. When I thought about an improved version of loadtxt, I wished it was fault tolerant without loosing too much performance. So my solution was much simpler than the very nice genloadtxt function -- and it works for me. My ansatz is to leave the existing loadtxt function unchanged. I only replaced the default converter calls by a fault tolerant converter class. I attached a patch against io.py in numpy 1.2.1 The nice thing is that it not only handles missing values, but for example also columns/fields with non-number characters. It just returns nan in these cases. This is of practical importance for many datafiles of astronomical catalogues, for example the Hipparcos catalogue data. Regarding the performance, it is a little bit slower than the original loadtxt, but not much: on my machine, 10x reading in a clean testfile with 3 columns and 2 rows I get the following results: original loadtxt: ~1.3s modified loadtxt: ~1.7s new genloadtxt : ~2.7s So you see, there is some loss of performance, but not as much as with the new converter class. I hope this solution is of interest ... Manuel Oops, wrong version of the diff file. Wanted to name the class _faulttolerantconv ... 237a238,247 class _faulttolerantconv(object): def __init__(self,conv): self._conv = conv def __call__(self, x): try: return self._conv(x) except: return np.nan 241c251 return lambda x: bool(int(x)) --- return _faulttolerantconv(lambda x: bool(int(x))) 243c253 return lambda x: int(float(x)) --- return _faulttolerantconv(lambda x: int(float(x))) 245c255 return float --- return _faulttolerantconv(float) 247c257 return complex --- return _faulttolerantconv(complex) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Manuel, Looks nice, I gonna try to see how I can incorporate yours. Note that returning np.nan by default will not work w/ Python 2.6 if you want an int... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On 12/2/2008 7:21 AM Joris De Ridder apparently wrote: As a historical note, we used to have scipy.io.read_array which at the time was considered by Travis too slow and too grandiose to be put in Numpy. As a consequence, numpy.loadtxt() was created which was simple and fast. Now it looks like we're going back to something grandiose. But perhaps it can be made grandiose *and* reasonably fast ;-). I hope this consideration remains prominent in this thread. Is the disappearance or read_array the reason for this change? What happened to it? Note that read_array_demo1.py is still in scipy.io despite the loss of read_array. Alan Isaac ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On 12/2/2008 8:12 AM Alan G Isaac apparently wrote: I hope this consideration remains prominent in this thread. Is the disappearance or read_array the reason for this change? What happened to it? Apologies; it is only deprecated, not gone. Alan Isaac ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On 1 Dec 2008, at 21:47 , Stéfan van der Walt wrote: Hi Pierre 2008/12/1 Pierre GM [EMAIL PROTECTED]: * `genloadtxt` is the base function that makes all the work. It outputs 2 arrays, one for the data (missing values being substituted by the appropriate default) and one for the mask. It would go in np.lib.io I see the code length increased from 200 lines to 800. This made me wonder about the execution time: initial benchmarks suggest a 3x slow-down. Could this be a problem for loading large text files? If so, should we consider keeping both versions around, or by default bypassing all the extra hooks? Regards Stéfan As a historical note, we used to have scipy.io.read_array which at the time was considered by Travis too slow and too grandiose to be put in Numpy. As a consequence, numpy.loadtxt() was created which was simple and fast. Now it looks like we're going back to something grandiose. But perhaps it can be made grandiose *and* reasonably fast ;-). Cheers, Joris P.S. As a reference: http://article.gmane.org/gmane.comp.python.numeric.general/5556/ Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Hi Pierre, I've tested the new loadtxt briefly. Looks good, except that there's a minor bug when trying to use a specific white-space delimiter (e.g. \t) while still allowing other white-space to be allowed in fields (e.g. spaces). Specifically, on line 115 in LineSplitter, we have: self.delimiter = delimiter.strip() or None so if I pass in, say, '\t' as the delimiter, self.delimiter gets set to None, which then causes the default behavior of any-whitespace-is- delimiter to be used. This makes lines like Gene Name\tPubMed ID \tStarting Position get split wrong, even when I explicitly pass in '\t' as the delimiter! Similarly, I believe that some of the tests are formulated wrong: def test_nodelimiter(self): Test LineSplitter w/o delimiter strg = 1 2 3 4 5 # test test = LineSplitter(' ')(strg) assert_equal(test, ['1', '2', '3', '4', '5']) I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5'] At least, that's what I would expect. Treating contiguous blocks of whitespace as single delimiters is perfectly reasonable when None is provided as the delimiter, but when an explicit delimiter has been provided, it strikes me that the code shouldn't try to further- interpret it... Does anyone else have any opinion here? Zach On Dec 1, 2008, at 1:21 PM, Pierre GM wrote: Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message. genload_proposal.py ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Zachary Pincus wrote: Specifically, on line 115 in LineSplitter, we have: self.delimiter = delimiter.strip() or None so if I pass in, say, '\t' as the delimiter, self.delimiter gets set to None, which then causes the default behavior of any-whitespace-is- delimiter to be used. This makes lines like Gene Name\tPubMed ID \tStarting Position get split wrong, even when I explicitly pass in '\t' as the delimiter! Similarly, I believe that some of the tests are formulated wrong: def test_nodelimiter(self): Test LineSplitter w/o delimiter strg = 1 2 3 4 5 # test test = LineSplitter(' ')(strg) assert_equal(test, ['1', '2', '3', '4', '5']) I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5'] At least, that's what I would expect. Treating contiguous blocks of whitespace as single delimiters is perfectly reasonable when None is provided as the delimiter, but when an explicit delimiter has been provided, it strikes me that the code shouldn't try to further- interpret it... Does anyone else have any opinion here? I agree. If the user explicity passes something as a delimiter, we should use it and not try to be too smart. +1 Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Pierre GM wrote: Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message. A couple of quick nitpicks: 1) On line 186 (in the NameValidator class), you use excludelist.append() to append a list to the end of a list. I think you meant to use excludelist.extend() 2) When validating a list of names, why do you insist on lower casing them? (I'm referring to the call to lower() on line 207). On one hand, this would seem nicer than all upper case, but on the other hand this can cause confusion for someone who sees certain casing of names in the file and expects that data to be laid out the same. Other than those, it's working fine for me here. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On Dec 2, 2008, at 3:12 PM, Ryan May wrote: Pierre GM wrote: Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message. A couple of quick nitpicks: 1) On line 186 (in the NameValidator class), you use excludelist.append() to append a list to the end of a list. I think you meant to use excludelist.extend() Good call. 2) When validating a list of names, why do you insist on lower casing them? (I'm referring to the call to lower() on line 207). On one hand, this would seem nicer than all upper case, but on the other hand this can cause confusion for someone who sees certain casing of names in the file and expects that data to be laid out the same. I recall a life where names were case-insensitives, so 'dates' and 'Dates' and 'DATES' were the same field. It should be easy enough to get rid of that limitations, or add a parameter for case-sensitivity On Dec 2, 2008, at 2:47 PM, Zachary Pincus wrote: Specifically, on line 115 in LineSplitter, we have: self.delimiter = delimiter.strip() or None so if I pass in, say, '\t' as the delimiter, self.delimiter gets set to None, which then causes the default behavior of any-whitespace-is- delimiter to be used. This makes lines like Gene Name\tPubMed ID \tStarting Position get split wrong, even when I explicitly pass in '\t' as the delimiter! OK, I'll check that. I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5'] Valid point. Well, all, stay tuned for yet another yet another implementation... Other than those, it's working fine for me here. Ryan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Pierre GM wrote: I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5'] Valid point. Well, all, stay tuned for yet another yet another implementation... While we're at it, it might be nice to be able to pass in more than one delimiter: ('\t',' '). though maybe that only combination that I'd really want would be something and '\n', which I think is being treated specially already. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Chris, I can try, but in that case, please write me a unittest, so that I have a clear and unambiguous idea of what you expect. ANFSCD, have you tried the missing_values option ? On Dec 2, 2008, at 5:36 PM, Christopher Barker wrote: Pierre GM wrote: I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5'] Valid point. Well, all, stay tuned for yet another yet another implementation... While we're at it, it might be nice to be able to pass in more than one delimiter: ('\t',' '). though maybe that only combination that I'd really want would be something and '\n', which I think is being treated specially already. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] np.loadtxt : yet a new implementation...
All, Please find attached to this message another implementation of np.loadtxt, which focuses on missing values. It's basically a combination of John Hunter's et al mlab.csv2rec, Ryan May's patches and pieces of code I'd been working on over the last few weeks. Besides some helper classes (StringConverter to convert a string into something else, NameValidator to check names..._), you'll find 3 functions: * `genloadtxt` is the base function that makes all the work. It outputs 2 arrays, one for the data (missing values being substituted by the appropriate default) and one for the mask. It would go in np.lib.io * `loadtxt` would replace the current np.loadtxt. It outputs a ndarray, where missing data being filled. It would also go in np.lib.io * `mloadtxt` would go into np.ma.io (to be created) and renamed `loadtxt`. Right now, I needed a different name to avoid conflicts. It combines the outputs of `genloadtxt` into a single masked array. You'll also several series of tests, that you can use as examples. Please give it a try and send me some feedback (bugs, wishes, suggestions). I'd like it to make the 1.3.0 release (I need some of the functionalities to improve the corresponding function in scikits.timeseries, currently fubar...) P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
And now for the tests: Proposal : Here's an extension to np.loadtxt, designed to take missing values into account. from genload_proposal import * from numpy.ma.testutils import * import StringIO class TestLineSplitter(TestCase): # def test_nodelimiter(self): Test LineSplitter w/o delimiter strg = 1 2 3 4 5 # test test = LineSplitter(' ')(strg) assert_equal(test, ['1', '2', '3', '4', '5']) test = LineSplitter()(strg) assert_equal(test, ['1', '2', '3', '4', '5']) # def test_delimiter(self): Test LineSplitter on delimiter strg = 1,2,3,4,,5 test = LineSplitter(',')(strg) assert_equal(test, ['1', '2', '3', '4', '', '5']) # strg = 1,2,3,4,,5 # test test = LineSplitter(',')(strg) assert_equal(test, ['1', '2', '3', '4', '', '5']) # strg = 1 2 3 4 5 # test test = LineSplitter(' ')(strg) assert_equal(test, ['1', '2', '3', '4', '5']) # def test_fixedwidth(self): Test LineSplitter w/ fixed-width fields strg = 1 2 3 4 5 # test test = LineSplitter(3)(strg) assert_equal(test, ['1', '2', '3', '4', '', '5', '']) # strg = 1 3 4 5 6# test test = LineSplitter((3,6,6,3))(strg) assert_equal(test, ['1', '3', '4 5', '6']) # strg = 1 3 4 5 6# test test = LineSplitter((6,6,9))(strg) assert_equal(test, ['1', '3 4', '5 6']) # strg = 1 3 4 5 6# test test = LineSplitter(20)(strg) assert_equal(test, ['1 3 4 5 6']) # strg = 1 3 4 5 6# test test = LineSplitter(30)(strg) assert_equal(test, ['1 3 4 5 6']) class TestStringConverter(TestCase): Test StringConverter # def test_creation(self): Test creation of a StringConverter converter = StringConverter(int, -9) assert_equal(converter._status, 1) assert_equal(converter.default, -9) # def test_upgrade(self): Tests the upgrade method. converter = StringConverter() assert_equal(converter._status, 0) converter.upgrade('0') assert_equal(converter._status, 1) converter.upgrade('0.') assert_equal(converter._status, 2) converter.upgrade('0j') assert_equal(converter._status, 3) converter.upgrade('a') assert_equal(converter._status, len(converter._mapper)-1) # def test_missing(self): Tests the use of missing values. converter = StringConverter(missing_values=('missing','missed')) converter.upgrade('0') assert_equal(converter('0'), 0) assert_equal(converter(''), converter.default) assert_equal(converter('missing'), converter.default) assert_equal(converter('missed'), converter.default) try: converter('miss') except ValueError: pass # def test_upgrademapper(self): Tests updatemapper import dateutil.parser import datetime dateparser = dateutil.parser.parse StringConverter.upgrade_mapper(dateparser, datetime.date(2000,1,1)) convert = StringConverter(dateparser, datetime.date(2000, 1, 1)) test = convert('2001-01-01') assert_equal(test, datetime.datetime(2001, 01, 01, 00, 00, 00)) class TestLoadTxt(TestCase): # def test_record(self): Test w/ explicit dtype data = StringIO.StringIO('1 2\n3 4') #data.seek(0) test = loadtxt(data, dtype=[('x', np.int32), ('y', np.int32)]) control = np.array([(1, 2), (3, 4)], dtype=[('x', 'i4'), ('y', 'i4')]) assert_equal(test, control) # data = StringIO.StringIO('M 64.0 75.0\nF 25.0 60.0') #data.seek(0) descriptor = {'names': ('gender','age','weight'), 'formats': ('S1', 'i4', 'f4')} control = np.array([('M', 64.0, 75.0), ('F', 25.0, 60.0)], dtype=descriptor) test = loadtxt(data, dtype=descriptor) assert_equal(test, control) def test_array(self): Test outputing a standard ndarray data = StringIO.StringIO('1 2\n3 4') control = np.array([[1,2],[3,4]], dtype=int) test = loadtxt(data, dtype=int) assert_array_equal(test, control) # data.seek(0) control = np.array([[1,2],[3,4]], dtype=float) test = np.loadtxt(data, dtype=float) assert_array_equal(test, control) def test_1D(self): Test squeezing to 1D control = np.array([1, 2, 3, 4], int) # data = StringIO.StringIO('1\n2\n3\n4\n') test = loadtxt(data, dtype=int) assert_array_equal(test, control) # data = StringIO.StringIO('1,2,3,4\n') test = loadtxt(data, dtype=int, delimiter=',')
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
2008/12/1 Pierre GM [EMAIL PROTECTED]: Please find attached to this message another implementation of Struggling to comply! Cheers Stéfan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message. Proposal : Here's an extension to np.loadtxt, designed to take missing values into account. import itertools import numpy as np import numpy.ma as ma def _is_string_like(obj): Check whether obj behaves like a string. try: obj + '' except (TypeError, ValueError): return False return True def _to_filehandle(fname, flag='r', return_opened=False): Returns the filehandle corresponding to a string or a file. If the string ends in '.gz', the file is automatically unzipped. Parameters -- fname : string, filehandle Name of the file whose filehandle must be returned. flag : string, optional Flag indicating the status of the file ('r' for read, 'w' for write). return_opened : boolean, optional Whether to return the opening status of the file. if _is_string_like(fname): if fname.endswith('.gz'): import gzip fhd = gzip.open(fname, flag) elif fname.endswith('.bz2'): import bz2 fhd = bz2.BZ2File(fname) else: fhd = file(fname, flag) opened = True elif hasattr(fname, 'seek'): fhd = fname opened = False else: raise ValueError('fname must be a string or file handle') if return_opened: return fhd, opened return fhd def flatten_dtype(ndtype): Unpack a structured data-type. names = ndtype.names if names is None: return [ndtype] else: types = [] for field in names: (typ, _) = ndtype.fields[field] flat_dt = flatten_dtype(typ) types.extend(flat_dt) return types def nested_masktype(datatype): Construct the dtype of a mask for nested elements. names = datatype.names if names: descr = [] for name in names: (ndtype, _) = datatype.fields[name] descr.append((name, nested_masktype(ndtype))) return descr # Is this some kind of composite a la (np.float,2) elif datatype.subdtype: mdescr = list(datatype.subdtype) mdescr[0] = np.dtype(bool) return tuple(mdescr) else: return np.bool class LineSplitter: Defines a function to split a string at a given delimiter or at given places. Parameters -- comment : {'#', string} Character used to mark the beginning of a comment. delimiter : var def __init__(self, delimiter=None, comments='#'): self.comments = comments # Delimiter is a character if delimiter is None: self._isfixed = False self.delimiter = None elif _is_string_like(delimiter): self._isfixed = False self.delimiter = delimiter.strip() or None # Delimiter is a list of field widths elif hasattr(delimiter, '__iter__'): self._isfixed = True idx = np.cumsum([0]+list(delimiter)) self.slices = [slice(i,j) for (i,j) in zip(idx[:-1], idx[1:])] # Delimiter is a single integer elif int(delimiter): self._isfixed = True self.slices = None self.delimiter = delimiter else: self._isfixed = False self.delimiter = None # def __call__(self, line): # Strip the comments line = line.split(self.comments)[0] if not line: return [] # Fixed-width fields if self._isfixed: # Fields have different widths if self.slices is None: fixed = self.delimiter slices = [slice(i, i+fixed) for i in range(len(line))[::fixed]] else: slices = self.slices return [line[s].strip() for s in slices] else: return [s.strip() for s in line.split(self.delimiter)] Splits the line at each current delimiter. Comments are stripped beforehand. class NameValidator: Validates a list of strings to use as field names. The strings are stripped of any non alphanumeric character, and spaces are replaced by `_`. During instantiation, the user can define a list of names to exclude, as well as a list of invalid characters. Names in the exclude list are appended a '_' character. Once an instance has been created, it can be called with a list of names and a list of valid names will be created. The `__call__` method accepts an optional keyword, `default`, that sets the default name in case of ambiguity. By default, `default = 'f'`, so that names will default to `f0`, `f1` Parameters -- excludelist : sequence, optional A list of names to
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On Mon, Dec 1, 2008 at 12:21 PM, Pierre GM [EMAIL PROTECTED] wrote: Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message.\ It looks like I am doing something wrong -- trying to parse a CSV file with dates formatted like '2008-10-14', with:: import datetime, sys import dateutil.parser StringConverter.upgrade_mapper(dateutil.parser.parse, default=datetime.date(1900,1,1)) r = loadtxt(sys.argv[1], delimiter=',', names=True) print r.dtype I get the following:: Traceback (most recent call last): File genload_proposal.py, line 734, in ? r = loadtxt(sys.argv[1], delimiter=',', names=True) File genload_proposal.py, line 711, in loadtxt (output, _) = genloadtxt(fname, **kwargs) File genload_proposal.py, line 646, in genloadtxt rows[i] = tuple([conv(val) for (conv, val) in zip(converters, vals)]) File genload_proposal.py, line 385, in __call__ raise ValueError(Cannot convert string '%s' % value) ValueError: Cannot convert string '2008-10-14' In debug mode, I see the following where the error occurs ipdb vals ('2008-10-14', '116.26', '116.40', '103.14', '104.08', '70749800', '104.08') ipdb converters [__main__.StringConverter instance at 0xa35fa6c, __main__.StringConverter instance at 0xa35ff2c, __main__.StringConverter instance at 0xa35ff8c, __main__.StringConverter instance at 0xa35ffec, __main__.StringConverter instance at 0xa15406c, __main__.StringConverter instance at 0xa1540cc, __main__.StringConverter instance at 0xa15412c] It looks like my registry of a custom converter isn't working. Here is what the _mapper looks like:: In [23]: StringConverter._mapper Out[23]: [(type 'numpy.bool_', function str2bool at 0xa2b8bc4, None), (type 'numpy.integer', type 'int', -1), (type 'numpy.floating', type 'float', -NaN), (type 'complex', type 'complex', (-NaN+0j)), (type 'numpy.object_', function parse at 0x8cf1534, datetime.date(1900, 1, 1)), (type 'numpy.string_', type 'str', '???')] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Hi Pierre 2008/12/1 Pierre GM [EMAIL PROTECTED]: * `genloadtxt` is the base function that makes all the work. It outputs 2 arrays, one for the data (missing values being substituted by the appropriate default) and one for the mask. It would go in np.lib.io I see the code length increased from 200 lines to 800. This made me wonder about the execution time: initial benchmarks suggest a 3x slow-down. Could this be a problem for loading large text files? If so, should we consider keeping both versions around, or by default bypassing all the extra hooks? Regards Stéfan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Stéfan van der Walt wrote: Hi Pierre 2008/12/1 Pierre GM [EMAIL PROTECTED]: * `genloadtxt` is the base function that makes all the work. It outputs 2 arrays, one for the data (missing values being substituted by the appropriate default) and one for the mask. It would go in np.lib.io I see the code length increased from 200 lines to 800. This made me wonder about the execution time: initial benchmarks suggest a 3x slow-down. Could this be a problem for loading large text files? If so, should we consider keeping both versions around, or by default bypassing all the extra hooks? I've wondered about this being an issue. On one hand, you hate to make existing code noticeably slower. On the other hand, if speed is important to you, why are you using ascii I/O? I personally am not entirely against having two versions of loadtxt-like functions. However, the idea seems a little odd, seeing as how loadtxt was already supposed to be the swiss army knife of text reading. I'm seeing a similar slowdown with Pierre's version of the code. The version of loadtxt that I cobbled together with the StringConverter class (and no missing value support) shows about a 50% slowdown, so clearly there's a performance penalty for trying to make a generic function that can be all things to all people. On the other hand, this approach reduces code duplication. I'm not really opinionated on what the right approach is here. My only opinion is that this functionality *really* needs to be in numpy in some fashion. For my own use case, with the old version, I could read a text file and by hand separate out columns and mask values. Now, I open a file and get a structured array with an automatically detected dtype (names and types!) plus masked values. My $0.02. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
2008/12/1 Ryan May [EMAIL PROTECTED]: I've wondered about this being an issue. On one hand, you hate to make existing code noticeably slower. On the other hand, if speed is important to you, why are you using ascii I/O? More I than O! But I think numpy.fromfile, once fixed up, could fill this niche nicely. I personally am not entirely against having two versions of loadtxt-like functions. However, the idea seems a little odd, seeing as how loadtxt was already supposed to be the swiss army knife of text reading. I haven't investigated the code in too much detail, but wouldn't it be possible to implement the current set of functionality in a base-class, which is then specialised to add the rest? That way, one could always instantiate TextReader yourself for some added speed. I'm not really opinionated on what the right approach is here. My only opinion is that this functionality *really* needs to be in numpy in some fashion. For my own use case, with the old version, I could read a text file and by hand separate out columns and mask values. Now, I open a file and get a structured array with an automatically detected dtype (names and types!) plus masked values. That's neat! Cheers Stéfan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
I agree, genloadtxt is a bit blotted, and it's not a surprise it's slower than the initial one. I think that in order to be fair, comparisons must be performed with matplotlib.mlab.csv2rec, that implements as well the autodetection of the dtype. I'm quite in favor of keeping a lite version around. On Dec 1, 2008, at 4:47 PM, Stéfan van der Walt wrote: I haven't investigated the code in too much detail, but wouldn't it be possible to implement the current set of functionality in a base-class, which is then specialised to add the rest? That way, one could always instantiate TextReader yourself for some added speed. Well, one of the issues is that we need to keep the function compatible w/ urllib.urlretrieve (Ryan, am I right?), which means not being able to go back to the beginning of a file (no call to .seek). Another issue comes from the possibility to define the dtype automatically: you need to keep track of the converters, then have to do a second loop on the data. Those converters are likely the bottleneck, as you need to check whether each value can be interpreted as missing or not and respond appropriately. I thought about creating a base class, with a specific subclass taking care of the missing values. I found out it would have duplicated a lot of code In any case, I think that's secondary: we can always optimize pieces of the code afterwards. I'd like more feedback on corner cases and usage... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Stéfan van der Walt wrote: important to you, why are you using ascii I/O? ascii I/O is slow, so that's a reason in itself to want it not to be slower! More I than O! But I think numpy.fromfile, once fixed up, could fill this niche nicely. I agree -- for the simple cases, fromfile() could work very well -- perhaps it could even be used to speed up some special cases of loadtxt. But is anyone working on fromfile()? By the way, I think overloading fromfile() for text files is a bit misleading for users -- I propose we have a fromtextfile() or something instead. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Pierre GM wrote: Another issue comes from the possibility to define the dtype automatically: Does all that get bypassed if the dtype(s) is specified? Is it still slow in that case? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion