Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: On Nov 27, 2008, at 3:08 AM, Manuel Metz wrote: Certainly, yes! Dealing with fixed-length fields would be necessary. The case I had in mind had both -- a separator (|) __and__ fixed-length fields -- and is probably very special in that sense. But such data-files exists out there... Well, if you have a non-space delimiter, it doesn't matter if the fields have a fixed length or not, does it? Each field is stripped anyway. Yes. It would already be _very_ helpful (without changing loadtxt too much) if the current implementation uses a converter like this def fval(val): try: return float(val) except: return numpy.nan instead of float(val) by default. mm The real issue is when the delimiter is ' '... I should be able to take care of that over the week-end (which started earlier today over here :) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Manuel, Give me the week-end to come up with something. What you want is already doable with the current implementation of np.loadtxt, through the converter keyword. Support for missing data will be covered in a separate function, most likely to be put in numpy.ma.io at term. On Nov 28, 2008, at 5:42 AM, Manuel Metz wrote: Pierre GM wrote: On Nov 27, 2008, at 3:08 AM, Manuel Metz wrote: Certainly, yes! Dealing with fixed-length fields would be necessary. The case I had in mind had both -- a separator (|) __and__ fixed- length fields -- and is probably very special in that sense. But such data-files exists out there... Well, if you have a non-space delimiter, it doesn't matter if the fields have a fixed length or not, does it? Each field is stripped anyway. Yes. It would already be _very_ helpful (without changing loadtxt too much) if the current implementation uses a converter like this def fval(val): try: return float(val) except: return numpy.nan instead of float(val) by default. mm The real issue is when the delimiter is ' '... I should be able to take care of that over the week-end (which started earlier today over here :) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: On Nov 26, 2008, at 5:55 PM, Ryan May wrote: Manuel Metz wrote: Ryan May wrote: 3) Better support for missing values. The docstring mentions a way of handling missing values by passing in a converter. The problem with this is that you have to pass in a converter for *every column* that will contain missing values. If you have a text file with 50 columns, writing this dictionary of converters seems like ugly and needless boilerplate. I'm unsure of how best to pass in both what values indicate missing values and what values to fill in their place. I'd love suggestions Hi Ryan, this would be a great feature to have !!! About missing values: * I don't think missing values should be supported in np.loadtxt. That should go into a specific np.ma.io.loadtxt function, a preview of which I posted earlier. I'll modify it taking Ryan's new function into account, and Chrisopher's suggestion (defining a dictionary {column name : missing values}. * StringConverter already defines some default filling values for each dtype. In np.ma.io.loadtxt, these values can be overwritten. Note that you should also be able to define a filling value by specifying a converter (think float(x or 0) for example) * Missing values on space-separated fields are very tricky to handle: take a line like a,,,d. With a comma as separator, it's clear that the 2nd and 3rd fields are missing. Now, imagine that commas are actually spaces ( a d): 'd' is now seen as the 2nd field of a 2-field record, not as the 4th field of a 4- field record with 2 missing values. I thought about it, and kicked in touch * That said, there should be a way to deal with fixed-length fields, probably by taking consecutive slices of the initial string. That way, we should be able to keep track of missing data... Certainly, yes! Dealing with fixed-length fields would be necessary. The case I had in mind had both -- a separator (|) __and__ fixed-length fields -- and is probably very special in that sense. But such data-files exists out there... mm ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 27, 2008, at 3:08 AM, Manuel Metz wrote: Certainly, yes! Dealing with fixed-length fields would be necessary. The case I had in mind had both -- a separator (|) __and__ fixed-length fields -- and is probably very special in that sense. But such data-files exists out there... Well, if you have a non-space delimiter, it doesn't matter if the fields have a fixed length or not, does it? Each field is stripped anyway. The real issue is when the delimiter is ' '... I should be able to take care of that over the week-end (which started earlier today over here :) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Thu, 27 Nov 2008 09:08:41 +0100 Manuel Metz [EMAIL PROTECTED] wrote: Pierre GM wrote: On Nov 26, 2008, at 5:55 PM, Ryan May wrote: Manuel Metz wrote: Ryan May wrote: 3) Better support for missing values. The docstring mentions a way of handling missing values by passing in a converter. The problem with this is that you have to pass in a converter for *every column* that will contain missing values. If you have a text file with 50 columns, writing this dictionary of converters seems like ugly and needless boilerplate. I'm unsure of how best to pass in both what values indicate missing values and what values to fill in their place. I'd love suggestions Hi Ryan, this would be a great feature to have !!! About missing values: * I don't think missing values should be supported in np.loadtxt. That should go into a specific np.ma.io.loadtxt function, a preview of which I posted earlier. I'll modify it taking Ryan's new function into account, and Chrisopher's suggestion (defining a dictionary {column name : missing values}. * StringConverter already defines some default filling values for each dtype. In np.ma.io.loadtxt, these values can be overwritten. Note that you should also be able to define a filling value by specifying a converter (think float(x or 0) for example) * Missing values on space-separated fields are very tricky to handle: take a line like a,,,d. With a comma as separator, it's clear that the 2nd and 3rd fields are missing. Now, imagine that commas are actually spaces ( a d): 'd' is now seen as the 2nd field of a 2-field record, not as the 4th field of a 4- field record with 2 missing values. I thought about it, and kicked in touch * That said, there should be a way to deal with fixed-length fields, probably by taking consecutive slices of the initial string. That way, we should be able to keep track of missing data... Certainly, yes! Dealing with fixed-length fields would be necessary. The case I had in mind had both -- a separator (|) __and__ fixed-length fields -- and is probably very special in that sense. But such data-files exists out there... See page 9, 10 (Bulk data input deck) http://www.zonatech.com/Documentation/zndalusersmanual2.0.pdf Nils ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Ryan May wrote: Hi, I have a couple more changes to loadtxt() that I'd like to code up in time for 1.3, but I thought I should run them by the list before doing too much work. These are already implemented in some fashion in matplotlib.mlab.csv2rec(), but the code bases are different enough, that pretty much only the idea can be lifted. All of these changes would be done in a manner that is backwards compatible with the current API. 1) Support for setting the names of fields in the returned structured array without using dtype. This can be a passed in list of names or reading the names of fields from the first line of the file. Many files have a header line that gives a name for each column. Adding this would obviously make loadtxt much more general and allow for more generic code, IMO. My current thinking is to add a *name* keyword parameter that defaults to None, for no support for reading names. Setting it to True would tell loadtxt() to read the names from the first line (after skiprows). The other option would be to set names to a list of strings. 2) Support for automatic dtype inference. Instead of assuming all values are floats, this would try a list of options until one worked. For strings, this would keep track of the longest string within a given field before setting the dtype. This would allow reading of files containing a mixture of types much more easily, without having to go to the trouble of constructing a full dtype by hand. This would work alongside any custom converters one passes in. My current thinking of API would just be to add the option of passing the string 'auto' as the dtype parameter. 3) Better support for missing values. The docstring mentions a way of handling missing values by passing in a converter. The problem with this is that you have to pass in a converter for *every column* that will contain missing values. If you have a text file with 50 columns, writing this dictionary of converters seems like ugly and needless boilerplate. I'm unsure of how best to pass in both what values indicate missing values and what values to fill in their place. I'd love suggestions Hi Ryan, this would be a great feature to have !!! One question: I have a datafile in ASCII format that uses a fixed width for each column. If no data if present, the space is left empty (see second row). What is the default behavior of the StringConverter class in this case? Does it ignore the empty entry by default? If so, what is the value in the array in this case? Is it nan? Example file: 1| 123.4| -123.4| 00.0 2| | 234.7| 12.2 Manuel Here's an example of my use case (without 50 columns): ID,First Name,Last Name,Homework1,Homework2,Quiz1,Homework3,Final 1234,Joe,Smith,85,90,,76, 5678,Jane,Doe,65,99,,78, 9123,Joe,Plumber,45,90,,92, Currently reading in this code requires a bit of boilerplace (declaring dtypes, converters). While it's nothing I can't write, it still would be easier to write it once within loadtxt and have it for everyone. Any support for *any* of these ideas? Any suggestions on how the user should pass in the information? Thanks, Ryan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Tue, Nov 25, 2008 at 11:23 PM, Ryan May [EMAIL PROTECTED] wrote: Updated patch attached. This includes: * Updated docstring * New tests * Fixes for previous issues * Fixes to make new tests actually work I appreciate any and all feedback. I'm having trouble applying your patch, so I haven't tested yet, but do you (and do you want to) handle a case like this:: from StringIO import StringIO import matplotlib.mlab as mlab f1 = StringIO(\ name age weight John 23 145. Harry 43 180.) for line in f1: print line.split(' ') Ie, space delimited but using an irregular number of spaces? One place this comes up a lot is when the output files are actually fixed-width using spaces to line up the columns. One could count the columns to figure out the fixed widths and work with that, but it is much easier to simply assume space delimiting and handle the irregular number of spaces assuming one or more spaces is the delimiter. In csv2rec, we write a custom file object to handle this case. Apologies if you are already handling this and I missed it... JDH ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
John Hunter wrote: On Tue, Nov 25, 2008 at 11:23 PM, Ryan May [EMAIL PROTECTED] wrote: Updated patch attached. This includes: * Updated docstring * New tests * Fixes for previous issues * Fixes to make new tests actually work I appreciate any and all feedback. I'm having trouble applying your patch, so I haven't tested yet, but do you (and do you want to) handle a case like this:: from StringIO import StringIO import matplotlib.mlab as mlab f1 = StringIO(\ name age weight John 23 145. Harry 43 180.) for line in f1: print line.split(' ') Ie, space delimited but using an irregular number of spaces? One place this comes up a lot is when the output files are actually fixed-width using spaces to line up the columns. One could count the columns to figure out the fixed widths and work with that, but it is much easier to simply assume space delimiting and handle the irregular number of spaces assuming one or more spaces is the delimiter. In csv2rec, we write a custom file object to handle this case. Apologies if you are already handling this and I missed it... I think line.split(None) handles this case, so *in theory* passing delimiter=None would do it. I *am* interested in this case, so I'll have to give it a try when I get a chance. (I sense this is the same case as Manuel just asked about.) Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Manuel Metz wrote: Ryan May wrote: 3) Better support for missing values. The docstring mentions a way of handling missing values by passing in a converter. The problem with this is that you have to pass in a converter for *every column* that will contain missing values. If you have a text file with 50 columns, writing this dictionary of converters seems like ugly and needless boilerplate. I'm unsure of how best to pass in both what values indicate missing values and what values to fill in their place. I'd love suggestions Hi Ryan, this would be a great feature to have !!! Thanks for the support! One question: I have a datafile in ASCII format that uses a fixed width for each column. If no data if present, the space is left empty (see second row). What is the default behavior of the StringConverter class in this case? Does it ignore the empty entry by default? If so, what is the value in the array in this case? Is it nan? Example file: 1| 123.4| -123.4| 00.0 2| | 234.7| 12.2 I don't think this is so much anything to do with StringConverter, but more to do with how to split lines. Maybe we should add an option that, instead of simply specifying characters that delimit the fields, allows one to pass a custom function to split lines? That could either be done by overriding `delimiter` or by adding a new option like `splitter` I'll have to give that some thought. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 26, 2008, at 5:55 PM, Ryan May wrote: Manuel Metz wrote: Ryan May wrote: 3) Better support for missing values. The docstring mentions a way of handling missing values by passing in a converter. The problem with this is that you have to pass in a converter for *every column* that will contain missing values. If you have a text file with 50 columns, writing this dictionary of converters seems like ugly and needless boilerplate. I'm unsure of how best to pass in both what values indicate missing values and what values to fill in their place. I'd love suggestions Hi Ryan, this would be a great feature to have !!! About missing values: * I don't think missing values should be supported in np.loadtxt. That should go into a specific np.ma.io.loadtxt function, a preview of which I posted earlier. I'll modify it taking Ryan's new function into account, and Chrisopher's suggestion (defining a dictionary {column name : missing values}. * StringConverter already defines some default filling values for each dtype. In np.ma.io.loadtxt, these values can be overwritten. Note that you should also be able to define a filling value by specifying a converter (think float(x or 0) for example) * Missing values on space-separated fields are very tricky to handle: take a line like a,,,d. With a comma as separator, it's clear that the 2nd and 3rd fields are missing. Now, imagine that commas are actually spaces ( a d): 'd' is now seen as the 2nd field of a 2-field record, not as the 4th field of a 4- field record with 2 missing values. I thought about it, and kicked in touch * That said, there should be a way to deal with fixed-length fields, probably by taking consecutive slices of the initial string. That way, we should be able to keep track of missing data... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] More loadtxt() changes
Hi, I have a couple more changes to loadtxt() that I'd like to code up in time for 1.3, but I thought I should run them by the list before doing too much work. These are already implemented in some fashion in matplotlib.mlab.csv2rec(), but the code bases are different enough, that pretty much only the idea can be lifted. All of these changes would be done in a manner that is backwards compatible with the current API. 1) Support for setting the names of fields in the returned structured array without using dtype. This can be a passed in list of names or reading the names of fields from the first line of the file. Many files have a header line that gives a name for each column. Adding this would obviously make loadtxt much more general and allow for more generic code, IMO. My current thinking is to add a *name* keyword parameter that defaults to None, for no support for reading names. Setting it to True would tell loadtxt() to read the names from the first line (after skiprows). The other option would be to set names to a list of strings. 2) Support for automatic dtype inference. Instead of assuming all values are floats, this would try a list of options until one worked. For strings, this would keep track of the longest string within a given field before setting the dtype. This would allow reading of files containing a mixture of types much more easily, without having to go to the trouble of constructing a full dtype by hand. This would work alongside any custom converters one passes in. My current thinking of API would just be to add the option of passing the string 'auto' as the dtype parameter. 3) Better support for missing values. The docstring mentions a way of handling missing values by passing in a converter. The problem with this is that you have to pass in a converter for *every column* that will contain missing values. If you have a text file with 50 columns, writing this dictionary of converters seems like ugly and needless boilerplate. I'm unsure of how best to pass in both what values indicate missing values and what values to fill in their place. I'd love suggestions Here's an example of my use case (without 50 columns): ID,First Name,Last Name,Homework1,Homework2,Quiz1,Homework3,Final 1234,Joe,Smith,85,90,,76, 5678,Jane,Doe,65,99,,78, 9123,Joe,Plumber,45,90,,92, Currently reading in this code requires a bit of boilerplace (declaring dtypes, converters). While it's nothing I can't write, it still would be easier to write it once within loadtxt and have it for everyone. Any support for *any* of these ideas? Any suggestions on how the user should pass in the information? Thanks, Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Ryan, FYI, I've been coding over the last couple of weeks an extension of loadtxt for a better support of masked data, with the option to read column names in a header. Please find an example below (I also have unittest). Most of the work is actually inspired from matplotlib's mlab.csv2rec. It might be worth not duplicating efforts. Cheers, P. :mod:`_preview` A collection of utilities from incoming versions of numpy.ma import itertools import numpy as np import numpy.ma as ma _string_like = np.lib.io._string_like def _to_filehandle(fname, flag='r', return_opened=False): Returns the filehandle corresponding to a string or a file. If the string ends in '.gz', the file is automatically unzipped. Parameters -- fname : string, filehandle Name of the file whose filehandle must be returned. flag : string, optional Flag indicating the status of the file ('r' for read, 'w' for write). return_opened : boolean, optional Whether to return the opening status of the file. if _string_like(fname): if fname.endswith('.gz'): import gzip fhd = gzip.open(fname, flag) else: fhd = file(fname, flag) opened = True elif hasattr(fname, 'seek'): fhd = fname opened = False else: raise ValueError('fname must be a string or file handle') if return_opened: return fhd, opened return fhd def flatten_dtype(dtp): Unpack a structured data-type. if dtp.names is None: return [dtp] else: types = [] for field in dtp.names: (typ, _) = dtp.fields[field] flat_dt = flatten_dtype(typ) types.extend(flat_dt) return types class LineReader: File reader that automatically split each line. This reader behaves like an iterator. Parameters -- fhd : filehandle File handle of the underlying file. comment : string, optional The character used to indicate the start of a comment. delimiter : string, optional The string used to separate values. By default, this is any whitespace. # def __init__(self, fhd, comment='#', delimiter=None): self.fh = fhd self.comment = comment self.delimiter = delimiter if delimiter == ' ': self.delimiter = None # def close(self): Close the current reader. self.fh.close() # def seek(self, arg): Moves to a new position in the file. See Also file.seek self.fh.seek(arg) # def splitter(self, line): Splits the line at each current delimiter. Comments are stripped beforehand. line = line.split(self.comment)[0].strip() delimiter = self.delimiter if line: return line.split(delimiter) else: return [] # def next(self): Moves to the next line or raises :exc:StopIteration. return self.splitter(self.fh.next()) # def __iter__(self): for line in self.fh: yield self.splitter(line) def readline(self): Returns the next line of the file, splitted at the delimiter and stripped of comments. return self.splitter(self.fh.readline()) def skiprows(self, nbrows=1): Skips `nbrows` from the file. for i in range(nbrows): self.fh.readline() def get_first_valid_row(self): Returns the values in the first valid (uncommented and not empty) line of the file. first_values = None while not first_values: first_line = self.fh.readline() if first_line == '': # EOF reached raise IOError('End-of-file reached before encountering data.') first_values = self.splitter(first_line) return first_values itemdictionary = {'return': 'return_', 'file':'file_', 'print':'print_' } def process_header(headers): Validates a list of strings to use as field names. The strings are stripped of any non alphanumeric character, and spaces are replaced by `_` # # Define the characters to delete from the headers delete = set([EMAIL PROTECTED]*()-=+~\|]}[{';: /?.,) delete.add('') names = [] seen = dict() for i, item in enumerate(headers): item = item.strip().lower().replace(' ', '_') item = ''.join([c for c in item if c not in delete]) if not len(item): item = 'column%d' % i item = itemdictionary.get(item, item) cnt = seen.get(item, 0) if cnt 0: names.append(item + '_%d'%cnt) else: names.append(item) seen[item] = cnt+1 return
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: FYI, I've been coding over the last couple of weeks an extension of loadtxt for a better support of masked data, with the option to read column names in a header. Please find an example below great, thanks! this could be very useful to me. Two comments: missing : string, optional A string representing a missing value, irrespective of the column where it appears (e.g., ``'missing'`` or ``'unused'``. It might be nice if missing could be a sequence of strings, if there is more than one value for missing values, that are not clearly mapped to a particular field. missing_values : {None, dictionary}, optional A dictionary mapping a column number to a string indicating whether the corresponding field should be masked. would it possible to specify column header, rather than number here? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 25, 2008, at 12:30 PM, Christopher Barker wrote: missing : string, optional A string representing a missing value, irrespective of the column where it appears (e.g., ``'missing'`` or ``'unused'``. It might be nice if missing could be a sequence of strings, if there is more than one value for missing values, that are not clearly mapped to a particular field. OK, easy enough. missing_values : {None, dictionary}, optional A dictionary mapping a column number to a string indicating whether the corresponding field should be masked. would it possible to specify column header, rather than number here? A la mlab.csv2rec ? It could work with a bit more tweaking, basically following John Hunter's et al. path. What happens when the column names are unknown (read from the header) or wrong ? Actually, I'd like John to comment on that, hence the CC. More generally, wouldn't be useful to push the recarray manipulating functions from matplotlib.mlab to numpy ? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: would it possible to specify column header, rather than number here? A la mlab.csv2rec ? I'll have to take a look at that. following John Hunter's et al. path. What happens when the column names are unknown (read from the header) or wrong ? well, my use case is that I don't know column numbers, but I do now column headers, and what missing value is associated with a give header. You have to know something! if the header is wrong, you get an error, though we may need to decide what wrong means. In my case, I'm dealing with data that has pre-specified headers (and I think missing values that go with them), but in any given file I don't know which of those columns is there. I want to read it in, and be able to query the result for what data it has. Actually, I'd like John to comment on that, hence the CC. I don't see a CC ,but yes, it would be nice to get his input. More generally, wouldn't be useful to push the recarray manipulating functions from matplotlib.mlab to numpy ? I think so -- or scipy. I 'd really like MPL to be about plotting, and only plotting. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: Ryan, FYI, I've been coding over the last couple of weeks an extension of loadtxt for a better support of masked data, with the option to read column names in a header. Please find an example below (I also have unittest). Most of the work is actually inspired from matplotlib's mlab.csv2rec. It might be worth not duplicating efforts. Cheers, P. Absolutely! Definitely don't want to duplicate effort here. What I see here meets a lot of what I was looking for. Here are some questions: 1) It looks like the function returns a structured array rather than a rec array, so that fields are obtained by doing a dictionary access. Since it's a dictionary access, is there any reason that the header needs to be munged to replace characters and reserved names? IIUC, csv2rec changes names b/c it returns a rec array, which uses attribute lookup and hence all names need to be valid python identifiers. This is not the case for a structured array. 2) Can we avoid the use of seek() in here? I just posted a patch to change the check to readline, which was the only file function used previously. This allowed the direct use of a file-like object returned by urllib2.urlopen(). 3) In order to avoid breaking backwards compatibility, can we change to default for dtype to be float32, and instead use some kind of special value ('auto' ?) to use the automatic dtype determination? I'm currently cooking up some of these changes myself, but thought I would see what you thought first. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 25, 2008, at 2:06 PM, Ryan May wrote: 1) It looks like the function returns a structured array rather than a rec array, so that fields are obtained by doing a dictionary access. Since it's a dictionary access, is there any reason that the header needs to be munged to replace characters and reserved names? IIUC, csv2rec changes names b/c it returns a rec array, which uses attribute lookup and hence all names need to be valid python identifiers. This is not the case for a structured array. Personally, I prefer flexible ndarrays to recarrays, hence the output. However, I still think that names should be as clean as possible to avoid bad surprises down the road. 2) Can we avoid the use of seek() in here? I just posted a patch to change the check to readline, which was the only file function used previously. This allowed the direct use of a file-like object returned by urllib2.urlopen(). I coded that a couple of weeks ago, before you posted your patch and I didn't have tme to check it. Yes, we could try getting rid of seek. However, we need to find a way to rewind to the beginning of the file if the dtypes are not given in input (as we parsed the whole file to find the best converter in that case). 3) In order to avoid breaking backwards compatibility, can we change to default for dtype to be float32, and instead use some kind of special value ('auto' ?) to use the automatic dtype determination? I'm not especially concerned w/ backwards compatibility, because we're supporting masked values (something that np.loadtxt shouldn't have to worry about). Initially, I needed a replacement to the fromfile function in the scikits.timeseries.trecords package. I figured it'd be easier and more portable to get a function for generic masked arrays, that could be adapted afterwards to timeseries. In any case, I was more considering the functions I send you to be part of some numpy.ma.io module than a replacement to np.loadtxt. I tried to get the syntax as close as possible to np.loadtxt and mlab.csv2rec, but there'll always be some differences. So, yes, we could try to use a default dtype=float and yes, we could have an extra parameter 'auto'. But is it really that useful ? I'm not sure (well, no, I'm sure it's not...) I'm currently cooking up some of these changes myself, but thought I would see what you thought first. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Tue, Nov 25, 2008 at 12:16 PM, Pierre GM [EMAIL PROTECTED] wrote: A la mlab.csv2rec ? It could work with a bit more tweaking, basically following John Hunter's et al. path. What happens when the column names are unknown (read from the header) or wrong ? Actually, I'd like John to comment on that, hence the CC. More generally, wouldn't be useful to push the recarray manipulating functions from matplotlib.mlab to numpy ? Yes, I've said on a number of occasions I'd like to see these functions in numpy, since a number of them make more sense as numpy methods than as stand alone functions. What happens when the column names are unknown (read from the header) or wrong ? I'm not quite sure what you are looking for here. Either the user will have to know the correct column name or the column number or you should raise an error. I think supporting column names everywhere they make sense is critical since this is how most people think about these CSV-like files with column headers. One other thing that is essential for me is that date support is included. Virtually every CSV file I work with has date data in it, in a variety of formats, and I depend on csv2rec (via dateutil.parser.parse which mpl ships) to be able to handle it w/o any extra cognitive overhead, albeit at the expense of some performance overhead, but my files aren't too big. I'm not sure how numpy would handle the date parsing aspect, but this came up in the date datatype PEP discussion I think. For me, having to manually specify a date converter with the proper format string every time I load a CSV file is probably not viable. Another feature that is critical to me is to be able to get a np.recarray back instead of a record array. I use these all day long, and the convenience of r.date over r['date'] is too much for me to give up. Feel free to ignore these suggestions if they are too burdensome or not appropriate for numpy -- I'm just letting you know some of the things I need to see before I personally would stop using mlab.csv2rec and use numpy.loadtxt instead. One last thing, I consider the masked array support in csv2rec somewhat broken because when using a masked array you cannot get at the data (eg datetime methods or string methods) directly using the same interface that regular recarrays use. Pierre, last I brought this up you asked for some example code and indicated a willingness to work on it but I fell behind and never posted it. The code illustrating the problem is below. I'm really not sure what the right solution is, but the current implementation -- sometimes returning a plain-vanilla rec array, sometimes returning a masked record array -- with different interfaces is not good. Perhaps the best solution is to force the user to ask for masked support, and then always return a masked array whether any of the data is masked or not. csv2rec conditionally returns a masked array only if some of the data are masked, which makes it difficult to use. JDH Here is the problem I referred to above -- in f1 none of the rows are masked and so I can access the object attributes from the rows directly. In the 2nd example, row 3 has some missing data so I get an mrecords recarray back, which does not allow me to directly access the valid data methods. from StringIO import StringIO import matplotlib.mlab as mlab f1 = StringIO(\ date,name,age,weight 2008-10-12,'Bill',22,125. 2008-10-13,'Tom',23,135. 2008-10-14,'Sally',23,145. ) r1 = mlab.csv2rec(f1) row0 = r1[0] print row0.date.year, row0.name.upper() f2 = StringIO(\ date,name,age,weight 2008-10-12,'Bill',22,125. 2008-10-13,'Tom',23,135. 2008-10-14,'',,145. ) r2 = mlab.csv2rec(f2) row0 = r2[0] print row0.date.year, row0.name.upper() ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: On Nov 25, 2008, at 2:06 PM, Ryan May wrote: 1) It looks like the function returns a structured array rather than a rec array, so that fields are obtained by doing a dictionary access. Since it's a dictionary access, is there any reason that the header needs to be munged to replace characters and reserved names? IIUC, csv2rec changes names b/c it returns a rec array, which uses attribute lookup and hence all names need to be valid python identifiers. This is not the case for a structured array. Personally, I prefer flexible ndarrays to recarrays, hence the output. However, I still think that names should be as clean as possible to avoid bad surprises down the road. Ok, I'm not really partial to this, I just thought it would simplify. Your point is valid. 2) Can we avoid the use of seek() in here? I just posted a patch to change the check to readline, which was the only file function used previously. This allowed the direct use of a file-like object returned by urllib2.urlopen(). I coded that a couple of weeks ago, before you posted your patch and I didn't have tme to check it. Yes, we could try getting rid of seek. However, we need to find a way to rewind to the beginning of the file if the dtypes are not given in input (as we parsed the whole file to find the best converter in that case). What about doing the parsing and type inference in a loop and holding onto the already split lines? Then loop through the lines with the converters that were finally chosen? In addition to making my usecase work, this has the benefit of not doing the I/O twice. 3) In order to avoid breaking backwards compatibility, can we change to default for dtype to be float32, and instead use some kind of special value ('auto' ?) to use the automatic dtype determination? I'm not especially concerned w/ backwards compatibility, because we're supporting masked values (something that np.loadtxt shouldn't have to worry about). Initially, I needed a replacement to the fromfile function in the scikits.timeseries.trecords package. I figured it'd be easier and more portable to get a function for generic masked arrays, that could be adapted afterwards to timeseries. In any case, I was more considering the functions I send you to be part of some numpy.ma.io module than a replacement to np.loadtxt. I tried to get the syntax as close as possible to np.loadtxt and mlab.csv2rec, but there'll always be some differences. So, yes, we could try to use a default dtype=float and yes, we could have an extra parameter 'auto'. But is it really that useful ? I'm not sure (well, no, I'm sure it's not...) I understand you're not concerned with backwards compatibility, but with the exception of missing handling, which is probably specific to masked arrays, I was hoping to just add functionality to loadtxt(). Numpy doesn't need a separate text reader for most of this and breaking API for any of this is likely a non-starter. So while, yes, having float be the default dtype is probably not the most useful, leaving it also doesn't break existing code. -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 25, 2008, at 2:26 PM, John Hunter wrote: Yes, I've said on a number of occasions I'd like to see these functions in numpy, since a number of them make more sense as numpy methods than as stand alone functions. Great. Could we think about getting that on for 1.3x, would you have time ? Or should we wait till early jan. ? One other thing that is essential for me is that date support is included. As I mentioned in an earlier post, I needed to get a replacement for a function in scikits.timeseries, where we do need dates, but I also needed something not too specific for numpy.ma. So I thought about extracting the conversion methods from the bulk of the function and create this new object, StringConverter, that takes care of the conversion. If you need to add date support, the simplest is to extend your StringConverter to take the date/datetime functions just after you import _preview (or numpy.ma.io if we go that path) dateparser = dateutil.parser.parse # Update the StringConvert mapper, so that date-like columns are automatically # converted _preview.StringConverter.mapper.insert(-1, (dateparser, datetime.date(2000, 1, 1))) That way, if a date is found i one of the column, it'll be converted appropiately. Seems to work pretty well for scikits.timeseries, I'll try to post that in the next couples of weeks (once I ironed out some of the numpy.ma bugs...) Another feature that is critical to me is to be able to get a np.recarray back instead of a record array. I use these all day long, and the convenience of r.date over r['date'] is too much for me to give up. No problem: just take a view once you got your output. I thought about adding yet another parameter that'd take care of that directly, but then we end up with far too many keywords... One last thing, I consider the masked array support in csv2rec somewhat broken because when using a masked array you cannot get at the data (eg datetime methods or string methods) directly using the same interface that regular recarrays use. Well, it's more mrecords which is broken. I committed some fix a little while back, but it might not be very robust. I need to check that w/ your example. Perhaps the best solution is to force the user to ask for masked support, and then always return a masked array whether any of the data is masked or not. csv2rec conditionally returns a masked array only if some of the data are masked, which makes it difficult to use. Forcing to a flexible masked array would make quite sense if we pushed that function in numpy.ma.io. I don't think we should overload np.loadtxt too much anyway... On Nov 25, 2008, at 2:37 PM, Ryan May wrote: What about doing the parsing and type inference in a loop and holding onto the already split lines? Then loop through the lines with the converters that were finally chosen? In addition to making my usecase work, this has the benefit of not doing the I/O twice. You mean, filling a list and relooping on it if we need to ? Sounds like a plan, but doesn't it create some extra temporaries we may not want ? I understand you're not concerned with backwards compatibility, but with the exception of missing handling, which is probably specific to masked arrays, I was hoping to just add functionality to loadtxt(). Numpy doesn't need a separate text reader for most of this and breaking API for any of this is likely a non-starter. So while, yes, having float be the default dtype is probably not the most useful, leaving it also doesn't break existing code. Depends on how we do it. We could have a modified np.loadtxt that takes some of the ideas of the file I send you (the StringConverter, for example), then I could have a numpy.ma.io that would take care of the missing data. And something in scikits.timeseries for the dates... The new np.loadtxt could use the default of the initial one, or we could create yet another function (np.loadfromtxt) that would match what I was suggesting, and np.loadtxt would be a special stripped downcase with dtype=float by default. thoughts? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 25, 2008, at 2:37 PM, Ryan May wrote: What about doing the parsing and type inference in a loop and holding onto the already split lines? Then loop through the lines with the converters that were finally chosen? In addition to making my usecase work, this has the benefit of not doing the I/O twice. You mean, filling a list and relooping on it if we need to ? Sounds like a plan, but doesn't it create some extra temporaries we may not want ? It shouldn't create any *extra* temporaries since we already make a list of lists before creating the final array. It just introduces an extra looping step. (I'd reuse the existing list of lists). Depends on how we do it. We could have a modified np.loadtxt that takes some of the ideas of the file I send you (the StringConverter, for example), then I could have a numpy.ma.io that would take care of the missing data. And something in scikits.timeseries for the dates... The new np.loadtxt could use the default of the initial one, or we could create yet another function (np.loadfromtxt) that would match what I was suggesting, and np.loadtxt would be a special stripped downcase with dtype=float by default. thoughts? My personal opinion is that if it doesn't make loadtxt too unwieldly, to just add a few of the options to loadtxt() itself. I'm working on tweaking loadtxt() to add the auto dtype and the names, relying heavily on your StringConverter class (nice code btw.). If my understanding of StringConverter is correct, tweaking the new loadtxt for ma or timeseries would only require passing in modified versions of StringConverter. I'll post that when I'm done and we can see if it looks like too much functionality stapled together or not. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
It shouldn't create any *extra* temporaries since we already make a list of lists before creating the final array. It just introduces an extra looping step. (I'd reuse the existing list of lists). Cool then, go for it. If my understanding of StringConverter is correct, tweaking the new loadtxt for ma or timeseries would only require passing in modified versions of StringConverter. Nope, we still need to double check whether there's any missing data in any field of the line we process, independently of the conversion. So there must be some extra loop involved, and I'd need a special function in numpy.ma to take care of that. So our options are * create a new function in numpy.ma and leave np.loadtxt like that * write a new np.loadtxt incorporating most of the ideas of the code I send, but I'd still need to adapt it to support masked values. I'll post that when I'm done and we can see if it looks like too much functionality stapled together or not. Sounds like a plan. Wouldn't mind getting more feedback from fellow users before we get too deep, however... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: Nope, we still need to double check whether there's any missing data in any field of the line we process, independently of the conversion. So there must be some extra loop involved, and I'd need a special function in numpy.ma to take care of that. So our options are * create a new function in numpy.ma and leave np.loadtxt like that * write a new np.loadtxt incorporating most of the ideas of the code I send, but I'd still need to adapt it to support masked values. You couldn't run this loop on the array returned by np.loadtxt() (by masking on the appropriate fill value)? I'll post that when I'm done and we can see if it looks like too much functionality stapled together or not. Sounds like a plan. Wouldn't mind getting more feedback from fellow users before we get too deep, however... Agreed. Anyone? -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 25, 2008, at 3:33 PM, Ryan May wrote: You couldn't run this loop on the array returned by np.loadtxt() (by masking on the appropriate fill value)? Yet an extra loop... Doable, yes... But meh. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Tue, Nov 25, 2008 at 2:01 PM, Pierre GM [EMAIL PROTECTED] wrote: On Nov 25, 2008, at 2:26 PM, John Hunter wrote: Yes, I've said on a number of occasions I'd like to see these functions in numpy, since a number of them make more sense as numpy methods than as stand alone functions. Great. Could we think about getting that on for 1.3x, would you have time ? Or should we wait till early jan. ? I wasn't volunteering to do it, just that I support the migration if someone else wants to do it. I'm fully committed with mpl already... JDH ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
OK then, I'll take care of that over the next few weeks... On Nov 25, 2008, at 4:56 PM, John Hunter wrote: On Tue, Nov 25, 2008 at 2:01 PM, Pierre GM [EMAIL PROTECTED] wrote: On Nov 25, 2008, at 2:26 PM, John Hunter wrote: Yes, I've said on a number of occasions I'd like to see these functions in numpy, since a number of them make more sense as numpy methods than as stand alone functions. Great. Could we think about getting that on for 1.3x, would you have time ? Or should we wait till early jan. ? I wasn't volunteering to do it, just that I support the migration if someone else wants to do it. I'm fully committed with mpl already... JDH ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
John Hunter wrote: On Tue, Nov 25, 2008 at 12:16 PM, Pierre GM [EMAIL PROTECTED] wrote: A la mlab.csv2rec ? It could work with a bit more tweaking, basically following John Hunter's et al. path. What happens when the column names are unknown (read from the header) or wrong ? Actually, I'd like John to comment on that, hence the CC. More generally, wouldn't be useful to push the recarray manipulating functions from matplotlib.mlab to numpy ? Yes, I've said on a number of occasions I'd like to see these functions in numpy, since a number of them make more sense as numpy methods than as stand alone functions. John and I are in agreement here. The issue has remained somebody stepping up and doing the conversions (and fielding the questions and the resulting discussion) for the various routines that probably ought to go into NumPy. This would be a great place to get involved if there is a lurker looking for a project. -Travis ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: OK then, I'll take care of that over the next few weeks... Thanks Pierre. -Travis ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Oh don't mention... However, I'd be quite grateful if you could give an eye to the pb of mixing np.scalars and 0d subclasses of ndarray: looks like it's a C pb, quite out of my league... http://scipy.org/scipy/numpy/ticket/826 http://article.gmane.org/gmane.comp.python.numeric.general/26354/match=priority+rules http://article.gmane.org/gmane.comp.python.numeric.general/25670/match=priority+rules On Nov 25, 2008, at 5:24 PM, Travis E. Oliphant wrote: Pierre GM wrote: OK then, I'll take care of that over the next few weeks... Thanks Pierre. -Travis ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: Sounds like a plan. Wouldn't mind getting more feedback from fellow users before we get too deep, however... Ok, I've attached, as a first cut, a diff against SVN HEAD that does (I think) what I'm looking for. It passes all of the old tests and passes my own quick test. A more rigorous test suite will follow, but I want this out the door before I need to leave for the day. What this changeset essentially does is just add support for automatic dtypes along with supplying/reading names for flexible dtypes. It leverages StringConverter heavily, using a few tweaks so that old behavior is kept. This is by no means a final version. Probably the biggest change from what I mentioned earlier is that instead of dtype='auto', I've used dtype=None to signal the detection code, since dtype=='auto' causes problems. I welcome any and all suggestions here, both on the code and on the original idea of adding these capabilities to loadtxt(). Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma Index: lib/io.py === --- lib/io.py (revision 6099) +++ lib/io.py (working copy) @@ -233,29 +233,138 @@ for name in todel: os.remove(name) -# Adapted from matplotlib +def _string_like(obj): +try: obj + '' +except (TypeError, ValueError): return False +return True -def _getconv(dtype): -typ = dtype.type -if issubclass(typ, np.bool_): -return lambda x: bool(int(x)) -if issubclass(typ, np.integer): -return lambda x: int(float(x)) -elif issubclass(typ, np.floating): -return float -elif issubclass(typ, np.complex): -return complex +def str2bool(value): + +Tries to transform a string supposed to represent a boolean to a boolean. + +Raises +-- +ValueError +If the string is not 'True' or 'False' (case independent) + +value = value.upper() +if value == 'TRUE': +return True +elif value == 'FALSE': +return False else: -return str +return int(bool(value)) +class StringConverter(object): + +Factory class for function transforming a string into another object (int, +float). -def _string_like(obj): -try: obj + '' -except (TypeError, ValueError): return 0 -return 1 +After initialization, an instance can be called to transform a string +into another object. If the string is recognized as representing a missing +value, a default value is returned. +Parameters +-- +dtype : dtype, optional +Input data type, used to define a basic function and a default value +for missing data. For example, when `dtype` is float, the :attr:`func` +attribute is set to ``float`` and the default value to `np.nan`. +missing_values : sequence, optional +Sequence of strings indicating a missing value. + +Attributes +-- +func : function +Function used for the conversion +default : var +Default value to return when the input corresponds to a missing value. +mapper : sequence of tuples +Sequence of tuples (function, default value) to evaluate in order. + + +from numpy.core import nan # To avoid circular import +mapper = [(str2bool, None), + (lambda x: int(float(x)), -1), + (float, nan), + (complex, nan+0j), + (str, '???')] + +def __init__(self, dtype=None, missing_values=None): +if dtype is None: +self.func = str2bool +self.default = None +self._status = 0 +else: +dtype = np.dtype(dtype).type +self.func,self.default,self._status = self._get_from_dtype(dtype) + +# Store the list of strings corresponding to missing values. +if missing_values is None: +self.missing_values = [] +else: +self.missing_values = set(list(missing_values) + ['']) + +def __call__(self, value): +if value in self.missing_values: +return self.default +return self.func(value) + +def upgrade(self, value): + +Tries to find the best converter for `value`, by testing different +converters in order. +The order in which the converters are tested is read from the +:attr:`_status` attribute of the instance. + +try: +self.__call__(value) +except ValueError: +_statusmax = len(self.mapper) +if self._status == _statusmax: +raise ValueError(Could not find a valid conversion function) +elif self._status _statusmax - 1: +self._status += 1 +(self.func, self.default) = self.mapper[self._status] +self.upgrade(value) + +def _get_from_dtype(self, dtype): + +Sets the :attr:`func`
Re: [Numpy-discussion] More loadtxt() changes
Ryan, Quick comments: * I already have some unittests for StringConverter, check the file I attach. * Your str2bool will probably mess things up in upgrade compared to the one JDH had written (the one I send you): you don't wanna use int(bool(value)), as it'll always give you 0 or 1 when you might need a ValueError * Your locked version of update won't probably work either, as you force the converter to output a string (you set the status to largest possible, that's the one that outputs strings). Why don't you set the status to the current one (make a tmp one if needed). * I'd probably get rid of StringConverter._get_from_dtype, as it is not needed outside the __init__. You may wanna stick to the original __init__. # pylint disable-msg=E1101, W0212, W0621 import os import tempfile import numpy as np import numpy.ma as ma from numpy.ma.testutils import * from StringIO import StringIO from _preview import * class TestStringConverter(TestCase): Test Stringconverter # def test_upgrade(self): Tests the upgrade method. converter = StringConverter() assert_equal(converter._status, 0) converter.upgrade('0') assert_equal(converter._status, 1) converter.upgrade('0.') assert_equal(converter._status, 2) converter.upgrade('0j') assert_equal(converter._status, 3) converter.upgrade('a') assert_equal(converter._status, len(converter.mapper)-1) # def test_missing(self): Tests the use of missing values. converter = StringConverter(missing_values=('missing','missed')) converter.upgrade('0') assert_equal(converter('0'), 0) assert_equal(converter(''), converter.default) assert_equal(converter('missing'), converter.default) assert_equal(converter('missed'), converter.default) try: converter('miss') except ValueError: pass class TestLineReader(TestCase): Tests the LineReader class # def test_spacedelimiter(self): Tests the use of space as delimiter. data = StringIO(0 1\n2 3\n4 5 6) reader = LineReader(data) nbfields = [len(line) for line in reader] assert_equal(nbfields, [2, 2, 3]) # def test_get_first_row(self): Tests the access of the first row. data = StringIO(0 1\n2 3\n4 5 6) reader = LineReader(data) assert_equal(reader.get_first_valid_row(), ['0', '1']) class TestLoadTxt(TestCase): Test the `loadtxt` function. # def setUp(self): Pre-processing and initialization. data = 0 1\n2 3 (self.fhdw, self.fhnw) = tempfile.mkstemp() (self.fhdwo, self.fhnwo) = tempfile.mkstemp() os.write(self.fhdw, A B\n) os.write(self.fhdwo, data) os.write(self.fhdw, data) os.close(self.fhdw) os.close(self.fhdwo) # def tearDown(self): Post-processing. os.remove(self.fhnw) os.remove(self.fhnwo) # def test_noheader(self): Tests loadtxt in absence of an header. data = self.fhnwo # No dtype test = loadtxt(data) assert_equal(test.shape, (1,)) assert_equal(test.item(), (2, 3)) assert_equal(test.dtype.names, ['0', '1']) # w/ basic dtype test = loadtxt(data, dtype=np.float) control = ma.array([[0, 1], [2, 3]], mask=False) assert_equal(test, control) # w/ flexible dtype dtype = [('A', np.int), ('B', np.float)] test = loadtxt(data, dtype=dtype) control = ma.array([(0, 1), (2, 3)], mask=(False, False), dtype=dtype) assert_equal(test, control) # w/ descriptor descriptor = {'names':('A', 'B'), 'formats':(np.int, np.float)} test = loadtxt(data, dtype=descriptor) control = ma.array([(0, 1), (2, 3)], mask=(False, False), dtype=dtype) assert_equal(test, control) # w/ names test = loadtxt(data, names=a,b) dtype = [('a', np.int), ('b', np.int)] assert_equal(test, np.array([(0, 1), (2, 3)], dtype=dtype)) assert_equal(test['a'].dtype, np.dtype(np.int)) # def test_with_noheader_with_missing(self): Tests `loadtxt` on a file w/o header, but w/ missig values. data = StringIO(0 1\n2 ) test = loadtxt(data, dtype=float) assert_equal(test, [[0, 1], [2, 3]]) assert_equal(test.mask, [[0, 0], [0, 1]]) # def test_with_header(self): Tests `loadtxt` on a file w/ header. data = self.fhnw control = ma.array([(0, 1), (2, 3)], dtype=[('a', np.int), ('b', np.int)]) # No dtype test = loadtxt(data) assert_equal(test.dtype.names, ['a', 'b']) assert_equal(test, control) # W dtype: should fail, as there's already a header dtype = [('A', np.float), ('B', np.int)] try:
Re: [Numpy-discussion] More loadtxt() changes
On Tue, Nov 25, 2008 at 5:00 PM, Pierre GM [EMAIL PROTECTED] wrote: snip All, another question: What's the best way to have some kind of sandbox for code like the one Ryan is writing ? So that we can try it, modify it, without commiting anything to SVN yet ? Probably make a branch and do commits there. If you don't want to hassle with a merge, just copy the file over to the trunk when you are done and commit it from there, then remove the branch. Instructions on making branches are at http://projects.scipy.org/scipy/numpy/wiki/MakingBranches . snip Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: Ryan, Quick comments: * I already have some unittests for StringConverter, check the file I attach. Ok, great. * Your str2bool will probably mess things up in upgrade compared to the one JDH had written (the one I send you): you don't wanna use int(bool(value)), as it'll always give you 0 or 1 when you might need a ValueError Ok, I wasn't sure. I was trying to merge what the old code used with the new str2bool you supplied. That's probably not all that necessary. * Your locked version of update won't probably work either, as you force the converter to output a string (you set the status to largest possible, that's the one that outputs strings). Why don't you set the status to the current one (make a tmp one if needed). Looking at the code, it looks like mapper is only used in the upgrade() method. My goal by setting status to the largest possible is to lock the converter to the supplied function. That way for the user supplied converters, the StringConverter doesn't try to upgrade away from it. My thinking was that if the user supplied converter function fails, the user should know. (Though I got this wrong the first time.) * I'd probably get rid of StringConverter._get_from_dtype, as it is not needed outside the __init__. You may wanna stick to the original __init__. Done. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 25, 2008, at 10:02 PM, Ryan May wrote: Pierre GM wrote: * Your locked version of update won't probably work either, as you force the converter to output a string (you set the status to largest possible, that's the one that outputs strings). Why don't you set the status to the current one (make a tmp one if needed). Looking at the code, it looks like mapper is only used in the upgrade() method. My goal by setting status to the largest possible is to lock the converter to the supplied function. That way for the user supplied converters, the StringConverter doesn't try to upgrade away from it. My thinking was that if the user supplied converter function fails, the user should know. (Though I got this wrong the first time.) Then, define a _locked attribute in StringConverter, and prevent upgrade to run if self._locked is True. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: On Nov 25, 2008, at 10:02 PM, Ryan May wrote: Pierre GM wrote: * Your locked version of update won't probably work either, as you force the converter to output a string (you set the status to largest possible, that's the one that outputs strings). Why don't you set the status to the current one (make a tmp one if needed). Looking at the code, it looks like mapper is only used in the upgrade() method. My goal by setting status to the largest possible is to lock the converter to the supplied function. That way for the user supplied converters, the StringConverter doesn't try to upgrade away from it. My thinking was that if the user supplied converter function fails, the user should know. (Though I got this wrong the first time.) Then, define a _locked attribute in StringConverter, and prevent upgrade to run if self._locked is True. Sure if you're into logic and sound design. I was going more for hackish and obtuse. (No seriously, I don't know why I didn't think of that.) Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: On Nov 25, 2008, at 10:02 PM, Ryan May wrote: Pierre GM wrote: * Your locked version of update won't probably work either, as you force the converter to output a string (you set the status to largest possible, that's the one that outputs strings). Why don't you set the status to the current one (make a tmp one if needed). Looking at the code, it looks like mapper is only used in the upgrade() method. My goal by setting status to the largest possible is to lock the converter to the supplied function. That way for the user supplied converters, the StringConverter doesn't try to upgrade away from it. My thinking was that if the user supplied converter function fails, the user should know. (Though I got this wrong the first time.) Updated patch attached. This includes: * Updated docstring * New tests * Fixes for previous issues * Fixes to make new tests actually work I appreciate any and all feedback. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma Index: numpy/lib/io.py === --- numpy/lib/io.py (revision 6107) +++ numpy/lib/io.py (working copy) @@ -233,29 +233,136 @@ for name in todel: os.remove(name) -# Adapted from matplotlib +def _string_like(obj): +try: obj + '' +except (TypeError, ValueError): return False +return True -def _getconv(dtype): -typ = dtype.type -if issubclass(typ, np.bool_): -return lambda x: bool(int(x)) -if issubclass(typ, np.integer): -return lambda x: int(float(x)) -elif issubclass(typ, np.floating): -return float -elif issubclass(typ, np.complex): -return complex +def str2bool(value): + +Tries to transform a string supposed to represent a boolean to a boolean. + +Raises +-- +ValueError +If the string is not 'True' or 'False' (case independent) + +value = value.upper() +if value == 'TRUE': +return True +elif value == 'FALSE': +return False else: -return str +raise ValueError(Invalid boolean) +class StringConverter(object): + +Factory class for function transforming a string into another object (int, +float). -def _string_like(obj): -try: obj + '' -except (TypeError, ValueError): return 0 -return 1 +After initialization, an instance can be called to transform a string +into another object. If the string is recognized as representing a missing +value, a default value is returned. +Parameters +-- +dtype : dtype, optional +Input data type, used to define a basic function and a default value +for missing data. For example, when `dtype` is float, the :attr:`func` +attribute is set to ``float`` and the default value to `np.nan`. +missing_values : sequence, optional +Sequence of strings indicating a missing value. + +Attributes +-- +func : function +Function used for the conversion +default : var +Default value to return when the input corresponds to a missing value. +mapper : sequence of tuples +Sequence of tuples (function, default value) to evaluate in order. + + +from numpy.core import nan # To avoid circular import +mapper = [(str2bool, None), + (int, -1), #Needs to be int so that it can fail and promote + #to float + (float, nan), + (complex, nan+0j), + (str, '???')] + +def __init__(self, dtype=None, missing_values=None): +self._locked = False +if dtype is None: +self.func = str2bool +self.default = None +self._status = 0 +else: +dtype = np.dtype(dtype).type +if issubclass(dtype, np.bool_): +(self.func, self.default, self._status) = (str2bool, 0, 0) +elif issubclass(dtype, np.integer): +#Needs to be int(float(x)) so that floating point values will +#be coerced to int when specifid by dtype +(self.func, self.default, self._status) = (lambda x: int(float(x)), -1, 1) +elif issubclass(dtype, np.floating): +(self.func, self.default, self._status) = (float, np.nan, 2) +elif issubclass(dtype, np.complex): +(self.func, self.default, self._status) = (complex, np.nan + 0j, 3) +else: +(self.func, self.default, self._status) = (str, '???', -1) + +# Store the list of strings corresponding to missing values. +if missing_values is None: +self.missing_values = [] +else: +self.missing_values = set(list(missing_values) + ['']) + +def __call__(self, value): +if value in self.missing_values: +return self.default +return