Re: [Numpy-discussion] More loadtxt() changes

2008-11-28 Thread Manuel Metz
Pierre GM wrote:
 On Nov 27, 2008, at 3:08 AM, Manuel Metz wrote:
 Certainly, yes! Dealing with fixed-length fields would be necessary.  
 The
 case I had in mind had both -- a separator (|) __and__ fixed-length
 fields -- and is probably very special in that sense. But such
 data-files exists out there...
 
 Well, if you have a non-space delimiter, it doesn't matter if the  
 fields have a fixed length or not, does it? Each field is stripped  
 anyway.

Yes. It would already be _very_ helpful (without changing loadtxt too 
much) if the current implementation uses a converter like this

def fval(val):
 try:
 return float(val)
 except:
 return numpy.nan

instead of float(val) by default.

mm

 The real issue is when the delimiter is ' '... I should be able to  
 take care of that over the week-end (which started earlier today over  
 here :) 
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-28 Thread Pierre GM
Manuel,

Give me the week-end to come up with something. What you want is  
already doable with the current implementation of np.loadtxt, through  
the converter keyword. Support for missing data will be covered in a  
separate function, most likely to be put in numpy.ma.io at term.



On Nov 28, 2008, at 5:42 AM, Manuel Metz wrote:

 Pierre GM wrote:
 On Nov 27, 2008, at 3:08 AM, Manuel Metz wrote:
 Certainly, yes! Dealing with fixed-length fields would be necessary.
 The
 case I had in mind had both -- a separator (|) __and__ fixed- 
 length
 fields -- and is probably very special in that sense. But such
 data-files exists out there...

 Well, if you have a non-space delimiter, it doesn't matter if the
 fields have a fixed length or not, does it? Each field is stripped
 anyway.

 Yes. It would already be _very_ helpful (without changing loadtxt too
 much) if the current implementation uses a converter like this

 def fval(val):
 try:
 return float(val)
 except:
 return numpy.nan

 instead of float(val) by default.

 mm

 The real issue is when the delimiter is ' '... I should be able to
 take care of that over the week-end (which started earlier today over
 here :)
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-27 Thread Manuel Metz
Pierre GM wrote:
 On Nov 26, 2008, at 5:55 PM, Ryan May wrote:
 
 Manuel Metz wrote:
 Ryan May wrote:
 3) Better support for missing values.  The docstring mentions a  
 way of
 handling missing values by passing in a converter.  The problem  
 with this is
 that you have to pass in a converter for *every column* that will  
 contain
 missing values.  If you have a text file with 50 columns, writing  
 this
 dictionary of converters seems like ugly and needless  
 boilerplate.  I'm
 unsure of how best to pass in both what values indicate missing  
 values and
 what values to fill in their place.  I'd love suggestions
 Hi Ryan,
   this would be a great feature to have !!!
 
 About missing values:
 
 * I don't think missing values should be supported in np.loadtxt. That  
 should go into a specific np.ma.io.loadtxt function, a preview of  
 which I posted earlier. I'll modify it taking Ryan's new function into  
 account, and Chrisopher's suggestion (defining a dictionary {column  
 name : missing values}.
 
 * StringConverter already defines some default filling values for each  
 dtype. In  np.ma.io.loadtxt, these values can be overwritten. Note  
 that you should also be able to define a filling value by specifying a  
 converter (think float(x or 0) for example)
 
 * Missing values on space-separated fields are very tricky to handle:
 take a line like a,,,d. With a comma as separator, it's clear that  
 the 2nd and 3rd fields are missing.
 Now, imagine that commas are actually spaces ( a d): 'd' is now  
 seen as the 2nd field of a 2-field record, not as the 4th field of a 4- 
 field record with 2 missing values. I thought about it, and kicked in  
 touch
 
 * That said, there should be a way to deal with fixed-length fields,  
 probably by taking consecutive slices of the initial string. That way,  
 we should be able to keep track of missing data...

Certainly, yes! Dealing with fixed-length fields would be necessary. The 
case I had in mind had both -- a separator (|) __and__ fixed-length 
fields -- and is probably very special in that sense. But such 
data-files exists out there...

mm
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-27 Thread Pierre GM

On Nov 27, 2008, at 3:08 AM, Manuel Metz wrote:


 Certainly, yes! Dealing with fixed-length fields would be necessary.  
 The
 case I had in mind had both -- a separator (|) __and__ fixed-length
 fields -- and is probably very special in that sense. But such
 data-files exists out there...

Well, if you have a non-space delimiter, it doesn't matter if the  
fields have a fixed length or not, does it? Each field is stripped  
anyway.
The real issue is when the delimiter is ' '... I should be able to  
take care of that over the week-end (which started earlier today over  
here :) 
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-27 Thread Nils Wagner
On Thu, 27 Nov 2008 09:08:41 +0100
  Manuel Metz [EMAIL PROTECTED] wrote:
 Pierre GM wrote:
 On Nov 26, 2008, at 5:55 PM, Ryan May wrote:
 
 Manuel Metz wrote:
 Ryan May wrote:
 3) Better support for missing values.  The docstring 
mentions a  
 way of
 handling missing values by passing in a converter.  The 
problem  
 with this is
 that you have to pass in a converter for *every column* 
that will  
 contain
 missing values.  If you have a text file with 50 
columns, writing  
 this
 dictionary of converters seems like ugly and needless  
 boilerplate.  I'm
 unsure of how best to pass in both what values indicate 
missing  
 values and
 what values to fill in their place.  I'd love 
suggestions
 Hi Ryan,
   this would be a great feature to have !!!
 
 About missing values:
 
 * I don't think missing values should be supported in 
np.loadtxt. That  
 should go into a specific np.ma.io.loadtxt function, a 
preview of  
 which I posted earlier. I'll modify it taking Ryan's new 
function into  
 account, and Chrisopher's suggestion (defining a 
dictionary {column  
 name : missing values}.
 
 * StringConverter already defines some default filling 
values for each  
 dtype. In  np.ma.io.loadtxt, these values can be 
overwritten. Note  
 that you should also be able to define a filling value 
by specifying a  
 converter (think float(x or 0) for example)
 
 * Missing values on space-separated fields are very 
tricky to handle:
 take a line like a,,,d. With a comma as separator, 
it's clear that  
 the 2nd and 3rd fields are missing.
 Now, imagine that commas are actually spaces ( a 
d): 'd' is now  
 seen as the 2nd field of a 2-field record, not as the 
4th field of a 4- 
 field record with 2 missing values. I thought about it, 
and kicked in  
 touch
 
 * That said, there should be a way to deal with 
fixed-length fields,  
 probably by taking consecutive slices of the initial 
string. That way,  
 we should be able to keep track of missing data...
 
 Certainly, yes! Dealing with fixed-length fields would 
be necessary. The 
 case I had in mind had both -- a separator (|) __and__ 
fixed-length 
 fields -- and is probably very special in that sense. 
But such 
 data-files exists out there...
 
See page 9, 10  (Bulk data input deck)
http://www.zonatech.com/Documentation/zndalusersmanual2.0.pdf

Nils
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-26 Thread Manuel Metz
Ryan May wrote:
 Hi,
 
 I have a couple more changes to loadtxt() that I'd like to code up in time
 for 1.3, but I thought I should run them by the list before doing too much
 work.  These are already implemented in some fashion in
 matplotlib.mlab.csv2rec(), but the code bases are different enough, that
 pretty much only the idea can be lifted.  All of these changes would be done
 in a manner that is backwards compatible with the current API.
 
 1) Support for setting the names of fields in the returned structured array
 without using dtype.  This can be a passed in list of names or reading the
 names of fields from the first line of the file.  Many files have a header
 line that gives a name for each column.  Adding this would obviously make
 loadtxt much more general and allow for more generic code, IMO. My current
 thinking is to add a *name* keyword parameter that defaults to None, for no
 support for reading names.  Setting it to True would tell loadtxt() to read
 the names from the first line (after skiprows).  The other option would be
 to set names to a list of strings.
 
 2) Support for automatic dtype inference.  Instead of assuming all values
 are floats, this would try a list of options until one worked.  For strings,
 this would keep track of the longest string within a given field before
 setting the dtype.  This would allow reading of files containing a mixture
 of types much more easily, without having to go to the trouble of
 constructing a full dtype by hand.  This would work alongside any custom
 converters one passes in.  My current thinking of API would just be to add
 the option of passing the string 'auto' as the dtype parameter.
 
 3) Better support for missing values.  The docstring mentions a way of
 handling missing values by passing in a converter.  The problem with this is
 that you have to pass in a converter for *every column* that will contain
 missing values.  If you have a text file with 50 columns, writing this
 dictionary of converters seems like ugly and needless boilerplate.  I'm
 unsure of how best to pass in both what values indicate missing values and
 what values to fill in their place.  I'd love suggestions

Hi Ryan,
   this would be a great feature to have !!!

One question: I have a datafile in ASCII format that uses a fixed width 
for each column. If no data if present, the space is left empty (see 
second row). What is the default behavior of the StringConverter class 
in this case? Does it ignore the empty entry by default? If so, what is 
the value in the array in this case? Is it nan?

Example file:

   1| 123.4| -123.4| 00.0
   2|  |  234.7| 12.2

Manuel

 Here's an example of my use case (without 50 columns):
 
 ID,First Name,Last Name,Homework1,Homework2,Quiz1,Homework3,Final
 1234,Joe,Smith,85,90,,76,
 5678,Jane,Doe,65,99,,78,
 9123,Joe,Plumber,45,90,,92,
 
 Currently reading in this code requires a bit of boilerplace (declaring
 dtypes, converters).  While it's nothing I can't write, it still would be
 easier to write it once within loadtxt and have it for everyone.
 
 Any support for *any* of these ideas?  Any suggestions on how the user
 should pass in the information?
 
 Thanks,
 
 Ryan
 
 
 
 
 
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-26 Thread John Hunter
On Tue, Nov 25, 2008 at 11:23 PM, Ryan May [EMAIL PROTECTED] wrote:

 Updated patch attached.  This includes:
  * Updated docstring
  * New tests
  * Fixes for previous issues
  * Fixes to make new tests actually work

 I appreciate any and all feedback.

I'm having trouble applying your patch, so I haven't tested yet, but
do you (and do you want to) handle a case like this::

from  StringIO import StringIO
import matplotlib.mlab as mlab
f1 = StringIO(\
name   age  weight
John   23   145.
Harry  43   180.)

for line in f1:
print line.split(' ')


Ie, space delimited but using an irregular number of spaces?   One
place this comes up a lot is when  the output files are actually
fixed-width using spaces to line up the columns.  One could count the
columns to figure out the fixed widths and work with that, but it is
much easier to simply assume space delimiting and handle the irregular
number of spaces assuming one or more spaces is the delimiter.  In
csv2rec, we write a custom file object to handle this case.

Apologies if you are already handling this and I missed it...

JDH
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-26 Thread Ryan May
John Hunter wrote:
 On Tue, Nov 25, 2008 at 11:23 PM, Ryan May [EMAIL PROTECTED] wrote:
 
 Updated patch attached.  This includes:
  * Updated docstring
  * New tests
  * Fixes for previous issues
  * Fixes to make new tests actually work

 I appreciate any and all feedback.
 
 I'm having trouble applying your patch, so I haven't tested yet, but
 do you (and do you want to) handle a case like this::
 
 from  StringIO import StringIO
 import matplotlib.mlab as mlab
 f1 = StringIO(\
 name   age  weight
 John   23   145.
 Harry  43   180.)
 
 for line in f1:
 print line.split(' ')
 
 
 Ie, space delimited but using an irregular number of spaces?   One
 place this comes up a lot is when  the output files are actually
 fixed-width using spaces to line up the columns.  One could count the
 columns to figure out the fixed widths and work with that, but it is
 much easier to simply assume space delimiting and handle the irregular
 number of spaces assuming one or more spaces is the delimiter.  In
 csv2rec, we write a custom file object to handle this case.
 
 Apologies if you are already handling this and I missed it...

I think line.split(None) handles this case, so *in theory* passing 
delimiter=None would do it.  I *am* interested in this case, so I'll 
have to give it a try when I get a chance. (I sense this is the same 
case as Manuel just asked about.)

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-26 Thread Ryan May
Manuel Metz wrote:
 Ryan May wrote:
 3) Better support for missing values.  The docstring mentions a way of
 handling missing values by passing in a converter.  The problem with this is
 that you have to pass in a converter for *every column* that will contain
 missing values.  If you have a text file with 50 columns, writing this
 dictionary of converters seems like ugly and needless boilerplate.  I'm
 unsure of how best to pass in both what values indicate missing values and
 what values to fill in their place.  I'd love suggestions
 
 Hi Ryan,
this would be a great feature to have !!!

Thanks for the support!

 One question: I have a datafile in ASCII format that uses a fixed width 
 for each column. If no data if present, the space is left empty (see 
 second row). What is the default behavior of the StringConverter class 
 in this case? Does it ignore the empty entry by default? If so, what is 
 the value in the array in this case? Is it nan?
 
 Example file:
 
1| 123.4| -123.4| 00.0
2|  |  234.7| 12.2
 

I don't think this is so much anything to do with StringConverter, but 
more to do with how to split lines.  Maybe we should add an option that, 
instead of simply specifying characters that delimit the fields, allows 
one to pass a custom function to split lines?  That could either be done 
by overriding `delimiter` or by adding a new option like `splitter`

I'll have to give that some thought.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-26 Thread Pierre GM

On Nov 26, 2008, at 5:55 PM, Ryan May wrote:

 Manuel Metz wrote:
 Ryan May wrote:
 3) Better support for missing values.  The docstring mentions a  
 way of
 handling missing values by passing in a converter.  The problem  
 with this is
 that you have to pass in a converter for *every column* that will  
 contain
 missing values.  If you have a text file with 50 columns, writing  
 this
 dictionary of converters seems like ugly and needless  
 boilerplate.  I'm
 unsure of how best to pass in both what values indicate missing  
 values and
 what values to fill in their place.  I'd love suggestions

 Hi Ryan,
   this would be a great feature to have !!!

About missing values:

* I don't think missing values should be supported in np.loadtxt. That  
should go into a specific np.ma.io.loadtxt function, a preview of  
which I posted earlier. I'll modify it taking Ryan's new function into  
account, and Chrisopher's suggestion (defining a dictionary {column  
name : missing values}.

* StringConverter already defines some default filling values for each  
dtype. In  np.ma.io.loadtxt, these values can be overwritten. Note  
that you should also be able to define a filling value by specifying a  
converter (think float(x or 0) for example)

* Missing values on space-separated fields are very tricky to handle:
take a line like a,,,d. With a comma as separator, it's clear that  
the 2nd and 3rd fields are missing.
Now, imagine that commas are actually spaces ( a d): 'd' is now  
seen as the 2nd field of a 2-field record, not as the 4th field of a 4- 
field record with 2 missing values. I thought about it, and kicked in  
touch

* That said, there should be a way to deal with fixed-length fields,  
probably by taking consecutive slices of the initial string. That way,  
we should be able to keep track of missing data...


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Ryan May
Hi,

I have a couple more changes to loadtxt() that I'd like to code up in time
for 1.3, but I thought I should run them by the list before doing too much
work.  These are already implemented in some fashion in
matplotlib.mlab.csv2rec(), but the code bases are different enough, that
pretty much only the idea can be lifted.  All of these changes would be done
in a manner that is backwards compatible with the current API.

1) Support for setting the names of fields in the returned structured array
without using dtype.  This can be a passed in list of names or reading the
names of fields from the first line of the file.  Many files have a header
line that gives a name for each column.  Adding this would obviously make
loadtxt much more general and allow for more generic code, IMO. My current
thinking is to add a *name* keyword parameter that defaults to None, for no
support for reading names.  Setting it to True would tell loadtxt() to read
the names from the first line (after skiprows).  The other option would be
to set names to a list of strings.

2) Support for automatic dtype inference.  Instead of assuming all values
are floats, this would try a list of options until one worked.  For strings,
this would keep track of the longest string within a given field before
setting the dtype.  This would allow reading of files containing a mixture
of types much more easily, without having to go to the trouble of
constructing a full dtype by hand.  This would work alongside any custom
converters one passes in.  My current thinking of API would just be to add
the option of passing the string 'auto' as the dtype parameter.

3) Better support for missing values.  The docstring mentions a way of
handling missing values by passing in a converter.  The problem with this is
that you have to pass in a converter for *every column* that will contain
missing values.  If you have a text file with 50 columns, writing this
dictionary of converters seems like ugly and needless boilerplate.  I'm
unsure of how best to pass in both what values indicate missing values and
what values to fill in their place.  I'd love suggestions

Here's an example of my use case (without 50 columns):

ID,First Name,Last Name,Homework1,Homework2,Quiz1,Homework3,Final
1234,Joe,Smith,85,90,,76,
5678,Jane,Doe,65,99,,78,
9123,Joe,Plumber,45,90,,92,

Currently reading in this code requires a bit of boilerplace (declaring
dtypes, converters).  While it's nothing I can't write, it still would be
easier to write it once within loadtxt and have it for everyone.

Any support for *any* of these ideas?  Any suggestions on how the user
should pass in the information?

Thanks,

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Pierre GM

Ryan,
FYI,  I've been coding over the last couple of weeks an extension of  
loadtxt for a better support of masked data, with the option to read  
column names in a header. Please find an example below (I also have  
unittest). Most of the work is actually inspired from matplotlib's  
mlab.csv2rec. It might be worth not duplicating efforts.

Cheers,
P.




:mod:`_preview`

A collection of utilities from incoming versions of numpy.ma





import itertools
import numpy as np
import numpy.ma as ma



_string_like = np.lib.io._string_like

def _to_filehandle(fname, flag='r', return_opened=False):

Returns the filehandle corresponding to a string or a file.
If the string ends in '.gz', the file is automatically unzipped.

Parameters
--
fname : string, filehandle
Name of the file whose filehandle must be returned.
flag : string, optional
Flag indicating the status of the file ('r' for read, 'w' for write).
return_opened : boolean, optional
Whether to return the opening status of the file.

if _string_like(fname):
if fname.endswith('.gz'):
import gzip
fhd = gzip.open(fname, flag)
else:
fhd = file(fname, flag)
opened = True
elif hasattr(fname, 'seek'):
fhd = fname
opened = False
else:
raise ValueError('fname must be a string or file handle')
if return_opened:
return fhd, opened
return fhd


def flatten_dtype(dtp):

Unpack a structured data-type.


if dtp.names is None:
return [dtp]
else:
types = []
for field in dtp.names:
(typ, _) = dtp.fields[field]
flat_dt = flatten_dtype(typ)
types.extend(flat_dt)
return types



class LineReader:

File reader that automatically split each line. This reader behaves like
an iterator.

Parameters
--
fhd : filehandle
File handle of the underlying file.
comment : string, optional
The character used to indicate the start of a comment.
delimiter : string, optional
The string used to separate values.  By default, this is any
whitespace.


#
def __init__(self, fhd, comment='#', delimiter=None):
self.fh = fhd
self.comment = comment
self.delimiter = delimiter
if delimiter == ' ':
self.delimiter = None
#
def close(self):
Close the current reader.
self.fh.close()
#
def seek(self, arg):

Moves to a new position in the file.

See Also

file.seek


self.fh.seek(arg)
#
def splitter(self, line):

Splits the line at each current delimiter.
Comments are stripped beforehand.

line = line.split(self.comment)[0].strip()
delimiter = self.delimiter
if line:
return line.split(delimiter)
else:
return []
#
def next(self):

Moves to the next line or raises :exc:StopIteration.

return self.splitter(self.fh.next())
#
def __iter__(self):
for line in self.fh:
yield self.splitter(line)

def readline(self):

Returns the next line of the file, splitted at the delimiter and stripped
of comments.

return self.splitter(self.fh.readline())

def skiprows(self, nbrows=1):

Skips `nbrows` from the file.

for i in range(nbrows):
self.fh.readline()

def get_first_valid_row(self):

Returns the values in the first valid (uncommented and not empty) line
of the file.

first_values = None
while not first_values:
first_line = self.fh.readline()
if first_line == '': # EOF reached
raise IOError('End-of-file reached before encountering data.')
first_values = self.splitter(first_line)
return first_values



itemdictionary = {'return': 'return_',
  'file':'file_',
  'print':'print_'
  }


def process_header(headers):

Validates a list of strings to use as field names.
The strings are stripped of any non alphanumeric character, and spaces
are replaced by `_`

#
# Define the characters to delete from the headers
delete = set([EMAIL PROTECTED]*()-=+~\|]}[{';: /?.,)
delete.add('')

names = []
seen = dict()
for i, item in enumerate(headers):
item = item.strip().lower().replace(' ', '_')
item = ''.join([c for c in item if c not in delete])
if not len(item):
item = 'column%d' % i

item = itemdictionary.get(item, item)
cnt = seen.get(item, 0)
if cnt  0:
names.append(item + '_%d'%cnt)
else:
names.append(item)
seen[item] = cnt+1
return 

Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Christopher Barker
Pierre GM wrote:
 FYI,  I've been coding over the last couple of weeks an extension of 
 loadtxt for a better support of masked data, with the option to read 
 column names in a header. Please find an example below

great, thanks! this could be very useful to me.

Two comments:


missing : string, optional
 A string representing a missing value, irrespective of the 
column where it appears (e.g., ``'missing'`` or ``'unused'``.


It might be nice if missing could be a sequence of strings, if there 
is more than one value for missing values, that are not clearly mapped 
to a particular field.



missing_values : {None, dictionary}, optional
 A dictionary mapping a column number to a string indicating 
whether the corresponding field should be masked.


would it possible to specify column header, rather than number here?


-Chris







-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Pierre GM

On Nov 25, 2008, at 12:30 PM, Christopher Barker wrote:

 
 missing : string, optional
 A string representing a missing value, irrespective of the
 column where it appears (e.g., ``'missing'`` or ``'unused'``.
 

 It might be nice if missing could be a sequence of strings, if there
 is more than one value for missing values, that are not clearly mapped
 to a particular field.

OK, easy enough.

 
 missing_values : {None, dictionary}, optional
 A dictionary mapping a column number to a string indicating
 whether the corresponding field should be masked.
 

 would it possible to specify column header, rather than number here?

A la mlab.csv2rec ? It could work with a bit more tweaking, basically  
following John Hunter's et al. path. What happens when the column  
names are unknown (read from the header) or wrong ?

Actually, I'd like John to comment on that, hence the CC. More  
generally, wouldn't be useful to push the recarray manipulating  
functions from matplotlib.mlab to numpy ?
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Christopher Barker
Pierre GM wrote:
 would it possible to specify column header, rather than number here?
 
 A la mlab.csv2rec ?

I'll have to take a look at that.

 following John Hunter's et al. path. What happens when the column  
 names are unknown (read from the header) or wrong ?

well, my use case is that I don't know column numbers, but I do now 
column headers, and what missing value is associated with a give 
header. You have to know something! if the header is wrong, you get an 
error, though we may need to decide what wrong means.

In my case, I'm dealing with data that has pre-specified headers (and I 
think missing values that go with them), but in any given file I don't 
know which of those columns is there. I want to read it in, and be able 
to query the result for what data it has.


 Actually, I'd like John to comment on that, hence the CC.

I don't see a CC ,but yes, it would be nice to get his input.

 More  
 generally, wouldn't be useful to push the recarray manipulating  
 functions from matplotlib.mlab to numpy ?

I think so -- or scipy. I 'd really like MPL to be about plotting, and 
only plotting.

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Ryan May
Pierre GM wrote:
 Ryan,
 FYI,  I've been coding over the last couple of weeks an extension of 
 loadtxt for a better support of masked data, with the option to read 
 column names in a header. Please find an example below (I also have 
 unittest). Most of the work is actually inspired from matplotlib's 
 mlab.csv2rec. It might be worth not duplicating efforts.
 Cheers,
 P.

Absolutely!  Definitely don't want to duplicate effort here.  What I see 
here meets a lot of what I was looking for.  Here are some questions:

1) It looks like the function returns a structured array rather than a 
rec array, so that fields are obtained by doing a dictionary access. 
Since it's a dictionary access, is there any reason that the header 
needs to be munged to replace characters and reserved names?  IIUC, 
csv2rec changes names b/c it returns a rec array, which uses attribute 
lookup and hence all names need to be valid python identifiers.  This is 
not the case for a structured array.

2) Can we avoid the use of seek() in here?  I just posted a patch to 
change the check to readline, which was the only file function used 
previously.  This allowed the direct use of a file-like object returned 
by urllib2.urlopen().

3) In order to avoid breaking backwards compatibility, can we change to 
default for dtype to be float32, and instead use some kind of special 
value ('auto' ?) to use the automatic dtype determination?

I'm currently cooking up some of these changes myself, but thought I 
would see what you thought first.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Pierre GM

On Nov 25, 2008, at 2:06 PM, Ryan May wrote:

 1) It looks like the function returns a structured array rather than a
 rec array, so that fields are obtained by doing a dictionary access.
 Since it's a dictionary access, is there any reason that the header
 needs to be munged to replace characters and reserved names?  IIUC,
 csv2rec changes names b/c it returns a rec array, which uses attribute
 lookup and hence all names need to be valid python identifiers.   
 This is
 not the case for a structured array.

Personally, I prefer flexible ndarrays to recarrays, hence the output.  
However, I still think that names should be as clean as possible to  
avoid bad surprises down the road.


 2) Can we avoid the use of seek() in here?  I just posted a patch to
 change the check to readline, which was the only file function used
 previously.  This allowed the direct use of a file-like object  
 returned
 by urllib2.urlopen().

I coded that a couple of weeks ago, before you posted your patch and I  
didn't have tme to check it. Yes, we could try getting rid of seek.  
However, we need to find a way to rewind to the beginning of the file  
if the dtypes are not given in input (as we parsed the whole file to  
find the best converter in that case).


 3) In order to avoid breaking backwards compatibility, can we change  
 to
 default for dtype to be float32, and instead use some kind of special
 value ('auto' ?) to use the automatic dtype determination?

I'm not especially concerned w/ backwards compatibility, because we're  
supporting masked values (something that np.loadtxt shouldn't have to  
worry about). Initially, I needed a replacement to the fromfile  
function in the scikits.timeseries.trecords package. I figured it'd be  
easier and more portable to get a function for generic masked arrays,  
that could be adapted afterwards to timeseries. In any case, I was  
more considering the functions I send you to be part of some  
numpy.ma.io module than a replacement to np.loadtxt. I tried to get  
the syntax as close as possible to np.loadtxt and mlab.csv2rec, but  
there'll always be some differences.

So, yes, we could try to use a default dtype=float and yes, we could  
have an extra parameter 'auto'. But is it really that useful ? I'm not  
sure (well, no, I'm sure it's not...)

 I'm currently cooking up some of these changes myself, but thought I
 would see what you thought first.

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread John Hunter
On Tue, Nov 25, 2008 at 12:16 PM, Pierre GM [EMAIL PROTECTED] wrote:

 A la mlab.csv2rec ? It could work with a bit more tweaking, basically
 following John Hunter's et al. path. What happens when the column names are
 unknown (read from the header) or wrong ?

 Actually, I'd like John to comment on that, hence the CC. More generally,
 wouldn't be useful to push the recarray manipulating functions from
 matplotlib.mlab to numpy ?

Yes, I've said on a number of occasions I'd like to see these
functions in numpy, since a number of them make more sense as numpy
methods than as stand alone functions.

 What happens when the column names are unknown (read from the header) or 
 wrong ?

I'm not quite sure what you are looking for here.  Either the user
will have to know the correct column name or the column number or you
should raise an error.  I think supporting column names everywhere
they make sense is critical since this is how most people think about
these CSV-like files with column headers.

One other thing that is essential for me is that date support is
included.  Virtually every CSV file I work with has date data in it,
in a variety of formats, and I depend on csv2rec (via
dateutil.parser.parse which mpl ships) to be able to handle it w/o any
extra cognitive overhead, albeit at the expense of some performance
overhead, but my files aren't too big.  I'm not sure how numpy would
handle the date parsing aspect, but this came up in the date datatype
PEP discussion I think.  For me, having to manually specify a date
converter with the proper format string every time I load a CSV file
is probably not viable.

Another feature that is critical to me is to be able to get a
np.recarray back instead of a record array.  I use these all day long,
and the convenience of r.date over r['date'] is too much for me to
give up.

Feel free to ignore these suggestions if they are too burdensome or
not appropriate for numpy -- I'm just letting you know some of the
things I need to see before I personally would stop using mlab.csv2rec
 and use numpy.loadtxt instead.

One last thing, I consider the masked array support in csv2rec
somewhat broken because when using a masked array you cannot get at
the data (eg datetime methods or string methods) directly using the
same interface that regular recarrays use.  Pierre, last I brought
this up you asked for some example code and indicated a willingness to
work on it but I fell behind and never posted it.  The code
illustrating the problem is below.  I'm really not sure what the right
solution is, but the current implementation -- sometimes returning a
plain-vanilla rec array, sometimes returning a masked record array --
with different interfaces is not good.

Perhaps the best solution is to force the user to ask for masked
support, and then always return a masked array whether any of the data
is masked or not.  csv2rec conditionally returns a masked array only
if some of the data are masked, which makes it difficult to use.

JDH

Here is the problem I referred to above -- in f1 none of the rows are
masked and so I can access the object attributes from the rows
directly.  In the 2nd example, row 3 has some missing data so I get an
mrecords recarray back, which does not allow me to directly access the
valid data methods.

from  StringIO import StringIO
import matplotlib.mlab as mlab
f1 = StringIO(\
date,name,age,weight
2008-10-12,'Bill',22,125.
2008-10-13,'Tom',23,135.
2008-10-14,'Sally',23,145.
)

r1 = mlab.csv2rec(f1)
row0 = r1[0]
print row0.date.year, row0.name.upper()

f2 = StringIO(\
date,name,age,weight
2008-10-12,'Bill',22,125.
2008-10-13,'Tom',23,135.
2008-10-14,'',,145.
)

r2 = mlab.csv2rec(f2)
row0 = r2[0]
print row0.date.year, row0.name.upper()
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Ryan May
Pierre GM wrote:
 On Nov 25, 2008, at 2:06 PM, Ryan May wrote:
 1) It looks like the function returns a structured array rather than a
 rec array, so that fields are obtained by doing a dictionary access.
 Since it's a dictionary access, is there any reason that the header
 needs to be munged to replace characters and reserved names?  IIUC,
 csv2rec changes names b/c it returns a rec array, which uses attribute
 lookup and hence all names need to be valid python identifiers.   
 This is
 not the case for a structured array.
 
 Personally, I prefer flexible ndarrays to recarrays, hence the output.  
 However, I still think that names should be as clean as possible to  
 avoid bad surprises down the road.

Ok, I'm not really partial to this, I just thought it would simplify. 
Your point is valid.

 2) Can we avoid the use of seek() in here?  I just posted a patch to
 change the check to readline, which was the only file function used
 previously.  This allowed the direct use of a file-like object  
 returned
 by urllib2.urlopen().
 
 I coded that a couple of weeks ago, before you posted your patch and I  
 didn't have tme to check it. Yes, we could try getting rid of seek.  
 However, we need to find a way to rewind to the beginning of the file  
 if the dtypes are not given in input (as we parsed the whole file to  
 find the best converter in that case).

What about doing the parsing and type inference in a loop and holding 
onto the already split lines?  Then loop through the lines with the 
converters that were finally chosen?  In addition to making my usecase 
work, this has the benefit of not doing the I/O twice.

 3) In order to avoid breaking backwards compatibility, can we change  
 to
 default for dtype to be float32, and instead use some kind of special
 value ('auto' ?) to use the automatic dtype determination?
 
 I'm not especially concerned w/ backwards compatibility, because we're  
 supporting masked values (something that np.loadtxt shouldn't have to  
 worry about). Initially, I needed a replacement to the fromfile  
 function in the scikits.timeseries.trecords package. I figured it'd be  
 easier and more portable to get a function for generic masked arrays,  
 that could be adapted afterwards to timeseries. In any case, I was  
 more considering the functions I send you to be part of some  
 numpy.ma.io module than a replacement to np.loadtxt. I tried to get  
 the syntax as close as possible to np.loadtxt and mlab.csv2rec, but  
 there'll always be some differences.
 
 So, yes, we could try to use a default dtype=float and yes, we could  
 have an extra parameter 'auto'. But is it really that useful ? I'm not  
 sure (well, no, I'm sure it's not...)

I understand you're not concerned with backwards compatibility, but with 
the exception of missing handling, which is probably specific to masked 
arrays, I was hoping to just add functionality to loadtxt().  Numpy 
doesn't need a separate text reader for most of this and breaking API 
for any of this is likely a non-starter.  So while, yes, having float be 
the default dtype is probably not the most useful, leaving it also 
doesn't break existing code.

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Pierre GM

On Nov 25, 2008, at 2:26 PM, John Hunter wrote:

 Yes, I've said on a number of occasions I'd like to see these
 functions in numpy, since a number of them make more sense as numpy
 methods than as stand alone functions.

Great. Could we think about getting that on for 1.3x, would you have  
time ? Or should we wait till early jan. ?

 One other thing that is essential for me is that date support is
 included.

As I mentioned in an earlier post, I needed to get a replacement for a  
function in scikits.timeseries, where we do need dates, but I also  
needed something not too specific for numpy.ma. So I thought about  
extracting the conversion methods from the bulk of the function and  
create this new object, StringConverter, that takes care of the  
conversion. If you need to add date support, the simplest is to extend  
your StringConverter to take the date/datetime functions just after  
you import _preview (or numpy.ma.io if we go that path)

  dateparser = dateutil.parser.parse
  # Update the StringConvert mapper, so that date-like columns are  
automatically
  # converted
  _preview.StringConverter.mapper.insert(-1, (dateparser,

datetime.date(2000,  
1, 1)))
That way, if a date is found i one of the column, it'll be converted  
appropiately. Seems to work pretty well for scikits.timeseries, I'll  
try to post that in the next couples of weeks (once I ironed out some  
of the numpy.ma bugs...)

 Another feature that is critical to me is to be able to get a
 np.recarray back instead of a record array.  I use these all day long,
 and the convenience of r.date over r['date'] is too much for me to
 give up.

No problem: just take a view once you got your output. I thought about  
adding yet another parameter that'd take care of that directly, but  
then we end up with far too many keywords...

 One last thing, I consider the masked array support in csv2rec
 somewhat broken because when using a masked array you cannot get at
 the data (eg datetime methods or string methods) directly using the
 same interface that regular recarrays use.

Well, it's more mrecords which is broken. I committed some fix a  
little while back, but it might not be very robust. I need to check  
that w/ your example.

 Perhaps the best solution is to force the user to ask for masked
 support, and then always return a masked array whether any of the data
 is masked or not.  csv2rec conditionally returns a masked array only
 if some of the data are masked, which makes it difficult to use.


Forcing to a flexible masked array would make quite sense if we pushed  
that function in numpy.ma.io. I don't think we should overload  
np.loadtxt too much anyway...


On Nov 25, 2008, at 2:37 PM, Ryan May wrote:

 What about doing the parsing and type inference in a loop and holding
 onto the already split lines?  Then loop through the lines with the
 converters that were finally chosen?  In addition to making my usecase
 work, this has the benefit of not doing the I/O twice.

You mean, filling a list and relooping on it if we need to ? Sounds  
like a plan, but doesn't it create some extra temporaries we may not  
want ?

 I understand you're not concerned with backwards compatibility, but  
 with
 the exception of missing handling, which is probably specific to  
 masked
 arrays, I was hoping to just add functionality to loadtxt().  Numpy
 doesn't need a separate text reader for most of this and breaking API
 for any of this is likely a non-starter.  So while, yes, having  
 float be
 the default dtype is probably not the most useful, leaving it also
 doesn't break existing code.

Depends on how we do it. We could have a  modified np.loadtxt that  
takes some of the ideas of the file I send you (the StringConverter,  
for example), then I could have a numpy.ma.io that would take care of  
the missing data. And something in scikits.timeseries for the dates...

The new np.loadtxt could use the default of the initial one, or we  
could create yet another function (np.loadfromtxt) that would match  
what I was suggesting, and np.loadtxt would be a special stripped  
downcase with dtype=float by default.

thoughts?





___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Ryan May
 On Nov 25, 2008, at 2:37 PM, Ryan May wrote:
 What about doing the parsing and type inference in a loop and holding
 onto the already split lines?  Then loop through the lines with the
 converters that were finally chosen?  In addition to making my usecase
 work, this has the benefit of not doing the I/O twice.
 
 You mean, filling a list and relooping on it if we need to ? Sounds  
 like a plan, but doesn't it create some extra temporaries we may not  
 want ?

It shouldn't create any *extra* temporaries since we already make a list 
of lists before creating the final array.  It just introduces an extra 
looping step. (I'd reuse the existing list of lists).

 Depends on how we do it. We could have a  modified np.loadtxt that  
 takes some of the ideas of the file I send you (the StringConverter,  
 for example), then I could have a numpy.ma.io that would take care of  
 the missing data. And something in scikits.timeseries for the dates...
 
 The new np.loadtxt could use the default of the initial one, or we  
 could create yet another function (np.loadfromtxt) that would match  
 what I was suggesting, and np.loadtxt would be a special stripped  
 downcase with dtype=float by default.
 
 thoughts?

My personal opinion is that if it doesn't make loadtxt too unwieldly, to 
just add a few of the options to loadtxt() itself.  I'm working on 
tweaking loadtxt() to add the auto dtype and the names, relying heavily 
on your StringConverter class (nice code btw.).  If my understanding of 
StringConverter is correct, tweaking the new loadtxt for ma or 
timeseries would only require passing in modified versions of 
StringConverter.

I'll post that when I'm done and we can see if it looks like too much 
functionality stapled together or not.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Pierre GM

 It shouldn't create any *extra* temporaries since we already make a  
 list
 of lists before creating the final array.  It just introduces an extra
 looping step. (I'd reuse the existing list of lists).

Cool then, go for it.

  If my understanding of
 StringConverter is correct, tweaking the new loadtxt for ma or
 timeseries would only require passing in modified versions of
 StringConverter.

Nope, we still need to double check whether there's any missing data  
in any field of the line we process, independently of the conversion.  
So there must be some extra loop involved, and I'd need a special  
function in numpy.ma to take care of that. So our options are
* create a new function in numpy.ma and leave np.loadtxt like that
* write a new np.loadtxt incorporating most of the ideas of the code I  
send, but I'd still need to adapt it to support masked values.


 I'll post that when I'm done and we can see if it looks like too much
 functionality stapled together or not.

Sounds like a plan. Wouldn't mind getting more feedback from fellow  
users before we get too deep, however...
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Ryan May
Pierre GM wrote:
 Nope, we still need to double check whether there's any missing data  
 in any field of the line we process, independently of the conversion.  
 So there must be some extra loop involved, and I'd need a special  
 function in numpy.ma to take care of that. So our options are
 * create a new function in numpy.ma and leave np.loadtxt like that
 * write a new np.loadtxt incorporating most of the ideas of the code I  
 send, but I'd still need to adapt it to support masked values.

You couldn't run this loop on the array returned by np.loadtxt() (by 
masking on the appropriate fill value)?

 I'll post that when I'm done and we can see if it looks like too much
 functionality stapled together or not.
 
 Sounds like a plan. Wouldn't mind getting more feedback from fellow  
 users before we get too deep, however...

Agreed.  Anyone?

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Pierre GM

On Nov 25, 2008, at 3:33 PM, Ryan May wrote:

 You couldn't run this loop on the array returned by np.loadtxt() (by
 masking on the appropriate fill value)?

Yet an extra loop... Doable, yes... But meh.

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread John Hunter
On Tue, Nov 25, 2008 at 2:01 PM, Pierre GM [EMAIL PROTECTED] wrote:

 On Nov 25, 2008, at 2:26 PM, John Hunter wrote:

 Yes, I've said on a number of occasions I'd like to see these
 functions in numpy, since a number of them make more sense as numpy
 methods than as stand alone functions.

 Great. Could we think about getting that on for 1.3x, would you have
 time ? Or should we wait till early jan. ?

I wasn't volunteering to do it, just that I support the migration if
someone else wants to do it.

I'm fully committed with mpl already...

JDH
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Pierre GM
OK then, I'll take care of that over the next few weeks...


On Nov 25, 2008, at 4:56 PM, John Hunter wrote:

 On Tue, Nov 25, 2008 at 2:01 PM, Pierre GM [EMAIL PROTECTED]  
 wrote:

 On Nov 25, 2008, at 2:26 PM, John Hunter wrote:

 Yes, I've said on a number of occasions I'd like to see these
 functions in numpy, since a number of them make more sense as numpy
 methods than as stand alone functions.

 Great. Could we think about getting that on for 1.3x, would you have
 time ? Or should we wait till early jan. ?

 I wasn't volunteering to do it, just that I support the migration if
 someone else wants to do it.

 I'm fully committed with mpl already...

 JDH
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Travis E. Oliphant
John Hunter wrote:
 On Tue, Nov 25, 2008 at 12:16 PM, Pierre GM [EMAIL PROTECTED] wrote:

   
 A la mlab.csv2rec ? It could work with a bit more tweaking, basically
 following John Hunter's et al. path. What happens when the column names are
 unknown (read from the header) or wrong ?

 Actually, I'd like John to comment on that, hence the CC. More generally,
 wouldn't be useful to push the recarray manipulating functions from
 matplotlib.mlab to numpy ?
 

 Yes, I've said on a number of occasions I'd like to see these
 functions in numpy, since a number of them make more sense as numpy
 methods than as stand alone functions.

   
John and I are in agreement here.   The issue has remained somebody 
stepping up and doing the conversions (and fielding the questions and 
the resulting discussion) for the various routines that probably ought 
to go into NumPy.

This would be a great place to get involved if there is a lurker looking 
for a project.

-Travis

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Travis E. Oliphant
Pierre GM wrote:
 OK then, I'll take care of that over the next few weeks...

   
Thanks  Pierre. 

-Travis


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Pierre GM
Oh don't mention...
However, I'd be quite grateful if you could give an eye to the pb of  
mixing np.scalars and 0d subclasses of ndarray: looks like it's a C  
pb, quite out of my league...

http://scipy.org/scipy/numpy/ticket/826
http://article.gmane.org/gmane.comp.python.numeric.general/26354/match=priority+rules
http://article.gmane.org/gmane.comp.python.numeric.general/25670/match=priority+rules



On Nov 25, 2008, at 5:24 PM, Travis E. Oliphant wrote:

 Pierre GM wrote:
 OK then, I'll take care of that over the next few weeks...


 Thanks  Pierre.

 -Travis


 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Ryan May

Pierre GM wrote:
Sounds like a plan. Wouldn't mind getting more feedback from fellow  
users before we get too deep, however...


Ok, I've attached, as a first cut, a diff against SVN HEAD that does (I 
think) what I'm looking for.  It passes all of the old tests and passes 
my own quick test.  A more rigorous test suite will follow, but I want 
this out the door before I need to leave for the day.


What this changeset essentially does is just add support for automatic 
dtypes along with supplying/reading names for flexible dtypes.  It 
leverages StringConverter heavily, using a few tweaks so that old 
behavior is kept.  This is by no means a final version.


Probably the biggest change from what I mentioned earlier is that 
instead of dtype='auto', I've used dtype=None to signal the detection 
code, since dtype=='auto' causes problems.


I welcome any and all suggestions here, both on the code and on the 
original idea of adding these capabilities to loadtxt().


Ryan

--
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
Index: lib/io.py
===
--- lib/io.py   (revision 6099)
+++ lib/io.py   (working copy)
@@ -233,29 +233,138 @@
 for name in todel:
 os.remove(name)
 
-# Adapted from matplotlib
+def _string_like(obj):
+try: obj + ''
+except (TypeError, ValueError): return False
+return True
 
-def _getconv(dtype):
-typ = dtype.type
-if issubclass(typ, np.bool_):
-return lambda x: bool(int(x))
-if issubclass(typ, np.integer):
-return lambda x: int(float(x))
-elif issubclass(typ, np.floating):
-return float
-elif issubclass(typ, np.complex):
-return complex
+def str2bool(value):
+
+Tries to transform a string supposed to represent a boolean to a boolean.
+
+Raises
+--
+ValueError
+If the string is not 'True' or 'False' (case independent)
+
+value = value.upper()
+if value == 'TRUE':
+return True
+elif value == 'FALSE':
+return False
 else:
-return str
+return int(bool(value))
 
+class StringConverter(object):
+
+Factory class for function transforming a string into another object (int,
+float).
 
-def _string_like(obj):
-try: obj + ''
-except (TypeError, ValueError): return 0
-return 1
+After initialization, an instance can be called to transform a string 
+into another object. If the string is recognized as representing a missing
+value, a default value is returned.
 
+Parameters
+--
+dtype : dtype, optional
+Input data type, used to define a basic function and a default value
+for missing data. For example, when `dtype` is float, the :attr:`func`
+attribute is set to ``float`` and the default value to `np.nan`.
+missing_values : sequence, optional
+Sequence of strings indicating a missing value.
+
+Attributes
+--
+func : function
+Function used for the conversion
+default : var
+Default value to return when the input corresponds to a missing value.
+mapper : sequence of tuples
+Sequence of tuples (function, default value) to evaluate in order.
+
+
+from numpy.core import nan # To avoid circular import
+mapper = [(str2bool, None),
+  (lambda x: int(float(x)), -1),
+  (float, nan),
+  (complex, nan+0j),
+  (str, '???')]
+
+def __init__(self, dtype=None, missing_values=None):
+if dtype is None:
+self.func = str2bool
+self.default = None
+self._status = 0
+else:
+dtype = np.dtype(dtype).type
+self.func,self.default,self._status = self._get_from_dtype(dtype)
+
+# Store the list of strings corresponding to missing values.
+if missing_values is None:
+self.missing_values = []
+else:
+self.missing_values = set(list(missing_values) + [''])
+
+def __call__(self, value):
+if value in self.missing_values:
+return self.default
+return self.func(value)
+
+def upgrade(self, value):
+
+Tries to find the best converter for `value`, by testing different
+converters in order.
+The order in which the converters are tested is read from the
+:attr:`_status` attribute of the instance.
+
+try:
+self.__call__(value)
+except ValueError:
+_statusmax = len(self.mapper)
+if self._status == _statusmax:
+raise ValueError(Could not find a valid conversion function)
+elif self._status  _statusmax - 1:
+self._status += 1
+(self.func, self.default) = self.mapper[self._status]
+self.upgrade(value)
+
+def _get_from_dtype(self, dtype):
+
+Sets the :attr:`func` 

Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Pierre GM

Ryan,
Quick comments:

* I already have some unittests for StringConverter, check the file I  
attach.


* Your str2bool will probably mess things up in upgrade compared to  
the one JDH had written (the one I send you): you don't wanna use  
int(bool(value)), as it'll always give you 0 or 1 when you might need  
a ValueError


* Your locked version of update won't probably work either, as you  
force the converter to output a string (you set the status to largest  
possible, that's the one that outputs strings). Why don't you set the  
status to the current one (make a tmp one if needed).


* I'd probably get rid of StringConverter._get_from_dtype, as it is  
not needed outside the __init__. You may wanna stick to the original  
__init__.



# pylint disable-msg=E1101, W0212, W0621

import os
import tempfile
import numpy as np
import numpy.ma as ma

from numpy.ma.testutils import *

from StringIO import StringIO

from _preview import *


class TestStringConverter(TestCase):
Test Stringconverter
#
def test_upgrade(self):
Tests the upgrade method.
converter = StringConverter()
assert_equal(converter._status, 0)
converter.upgrade('0')
assert_equal(converter._status, 1)
converter.upgrade('0.')
assert_equal(converter._status, 2)
converter.upgrade('0j')
assert_equal(converter._status, 3)
converter.upgrade('a')
assert_equal(converter._status, len(converter.mapper)-1)
#
def test_missing(self):
Tests the use of missing values.
converter = StringConverter(missing_values=('missing','missed'))
converter.upgrade('0')
assert_equal(converter('0'), 0)
assert_equal(converter(''), converter.default)
assert_equal(converter('missing'), converter.default)
assert_equal(converter('missed'), converter.default)
try:
converter('miss')
except ValueError:
pass



class TestLineReader(TestCase):
Tests the LineReader class
#
def test_spacedelimiter(self):
Tests the use of space as delimiter.
data = StringIO(0 1\n2   3\n4 5   6)
reader = LineReader(data)
nbfields = [len(line) for line in reader]
assert_equal(nbfields, [2, 2, 3])
#
def test_get_first_row(self):
Tests the access of the first row.
data = StringIO(0 1\n2   3\n4 5   6)
reader = LineReader(data)
assert_equal(reader.get_first_valid_row(), ['0', '1'])



class TestLoadTxt(TestCase):
Test the `loadtxt` function.
#
def setUp(self):
Pre-processing and initialization.
data = 0 1\n2 3
(self.fhdw, self.fhnw) = tempfile.mkstemp()
(self.fhdwo, self.fhnwo) = tempfile.mkstemp()
os.write(self.fhdw, A B\n)
os.write(self.fhdwo, data)
os.write(self.fhdw, data)
os.close(self.fhdw)
os.close(self.fhdwo)
#
def tearDown(self):
Post-processing.
os.remove(self.fhnw)
os.remove(self.fhnwo)
#
def test_noheader(self):
Tests loadtxt in absence of an header.
data = self.fhnwo
# No dtype
test = loadtxt(data)
assert_equal(test.shape, (1,))
assert_equal(test.item(), (2, 3))
assert_equal(test.dtype.names, ['0', '1'])
# w/ basic dtype
test = loadtxt(data, dtype=np.float)
control = ma.array([[0, 1], [2, 3]], mask=False)
assert_equal(test, control)
# w/ flexible dtype
dtype = [('A', np.int), ('B', np.float)]
test = loadtxt(data, dtype=dtype)
control = ma.array([(0, 1), (2, 3)], mask=(False, False), dtype=dtype)
assert_equal(test, control)
# w/ descriptor
descriptor = {'names':('A', 'B'), 'formats':(np.int, np.float)}
test = loadtxt(data, dtype=descriptor)
control = ma.array([(0, 1), (2, 3)], mask=(False, False), dtype=dtype)
assert_equal(test, control)
# w/ names
test = loadtxt(data, names=a,b)
dtype = [('a', np.int), ('b', np.int)]
assert_equal(test, np.array([(0, 1), (2, 3)], dtype=dtype))
assert_equal(test['a'].dtype, np.dtype(np.int))
#
def test_with_noheader_with_missing(self):
Tests `loadtxt` on a file w/o header, but w/ missig values.
data = StringIO(0 1\n2  )
test = loadtxt(data, dtype=float)
assert_equal(test, [[0, 1], [2, 3]])
assert_equal(test.mask, [[0, 0], [0, 1]])
#
def test_with_header(self):
Tests `loadtxt` on a file w/ header.
data = self.fhnw
control = ma.array([(0, 1), (2, 3)],
   dtype=[('a', np.int), ('b', np.int)])
# No dtype
test = loadtxt(data)
assert_equal(test.dtype.names, ['a', 'b'])
assert_equal(test, control)
# W dtype: should fail, as there's already a header
dtype = [('A', np.float), ('B', np.int)]
try:
  

Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Charles R Harris
On Tue, Nov 25, 2008 at 5:00 PM, Pierre GM [EMAIL PROTECTED] wrote:
snip

 All, another question:
 What's the best way to have some kind of sandbox for code like the one Ryan
 is writing ? So that we can try it, modify it, without commiting anything to
 SVN yet ?


Probably make a branch and do commits there. If you don't want to hassle
with a merge, just copy the file over to the trunk when you are done and
commit it from there, then remove the branch. Instructions on making
branches are at http://projects.scipy.org/scipy/numpy/wiki/MakingBranches .

snip

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Ryan May
Pierre GM wrote:
 Ryan,
 Quick comments:
 
 * I already have some unittests for StringConverter, check the file I 
 attach.

Ok, great.

 * Your str2bool will probably mess things up in upgrade compared to the 
 one JDH had written (the one I send you): you don't wanna use 
 int(bool(value)), as it'll always give you 0 or 1 when you might need a 
 ValueError

Ok, I wasn't sure.  I was trying to merge what the old code used with 
the new str2bool you supplied.  That's probably not all that necessary.

 * Your locked version of update won't probably work either, as you force 
 the converter to output a string (you set the status to largest 
 possible, that's the one that outputs strings). Why don't you set the 
 status to the current one (make a tmp one if needed).

Looking at the code, it looks like mapper is only used in the upgrade() 
method. My goal by setting status to the largest possible is to lock the 
converter to the supplied function.  That way for the user supplied 
converters, the StringConverter doesn't try to upgrade away from it.  My 
thinking was that if the user supplied converter function fails, the 
user should know. (Though I got this wrong the first time.)

 * I'd probably get rid of StringConverter._get_from_dtype, as it is not 
 needed outside the __init__. You may wanna stick to the original __init__.

Done.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Pierre GM

On Nov 25, 2008, at 10:02 PM, Ryan May wrote:
 Pierre GM wrote:

 * Your locked version of update won't probably work either, as you  
 force
 the converter to output a string (you set the status to largest
 possible, that's the one that outputs strings). Why don't you set the
 status to the current one (make a tmp one if needed).

 Looking at the code, it looks like mapper is only used in the  
 upgrade()
 method. My goal by setting status to the largest possible is to lock  
 the
 converter to the supplied function.  That way for the user supplied
 converters, the StringConverter doesn't try to upgrade away from  
 it.  My
 thinking was that if the user supplied converter function fails, the
 user should know. (Though I got this wrong the first time.)


Then, define a _locked attribute in StringConverter, and prevent  
upgrade to run if self._locked is True.


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Ryan May
Pierre GM wrote:
 On Nov 25, 2008, at 10:02 PM, Ryan May wrote:
 Pierre GM wrote:
 * Your locked version of update won't probably work either, as you  
 force
 the converter to output a string (you set the status to largest
 possible, that's the one that outputs strings). Why don't you set the
 status to the current one (make a tmp one if needed).
 Looking at the code, it looks like mapper is only used in the  
 upgrade()
 method. My goal by setting status to the largest possible is to lock  
 the
 converter to the supplied function.  That way for the user supplied
 converters, the StringConverter doesn't try to upgrade away from  
 it.  My
 thinking was that if the user supplied converter function fails, the
 user should know. (Though I got this wrong the first time.)

 
 Then, define a _locked attribute in StringConverter, and prevent  
 upgrade to run if self._locked is True.

Sure if you're into logic and sound design.  I was going more for 
hackish and obtuse.

(No seriously, I don't know why I didn't think of that.)

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] More loadtxt() changes

2008-11-25 Thread Ryan May

Pierre GM wrote:

On Nov 25, 2008, at 10:02 PM, Ryan May wrote:

Pierre GM wrote:
* Your locked version of update won't probably work either, as you  
force

the converter to output a string (you set the status to largest
possible, that's the one that outputs strings). Why don't you set the
status to the current one (make a tmp one if needed).
Looking at the code, it looks like mapper is only used in the  
upgrade()
method. My goal by setting status to the largest possible is to lock  
the

converter to the supplied function.  That way for the user supplied
converters, the StringConverter doesn't try to upgrade away from  
it.  My

thinking was that if the user supplied converter function fails, the
user should know. (Though I got this wrong the first time.)


Updated patch attached.  This includes:
 * Updated docstring
 * New tests
 * Fixes for previous issues
 * Fixes to make new tests actually work

I appreciate any and all feedback.

Ryan

--
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
Index: numpy/lib/io.py
===
--- numpy/lib/io.py (revision 6107)
+++ numpy/lib/io.py (working copy)
@@ -233,29 +233,136 @@
 for name in todel:
 os.remove(name)
 
-# Adapted from matplotlib
+def _string_like(obj):
+try: obj + ''
+except (TypeError, ValueError): return False
+return True
 
-def _getconv(dtype):
-typ = dtype.type
-if issubclass(typ, np.bool_):
-return lambda x: bool(int(x))
-if issubclass(typ, np.integer):
-return lambda x: int(float(x))
-elif issubclass(typ, np.floating):
-return float
-elif issubclass(typ, np.complex):
-return complex
+def str2bool(value):
+
+Tries to transform a string supposed to represent a boolean to a boolean.
+
+Raises
+--
+ValueError
+If the string is not 'True' or 'False' (case independent)
+
+value = value.upper()
+if value == 'TRUE':
+return True
+elif value == 'FALSE':
+return False
 else:
-return str
+raise ValueError(Invalid boolean)
 
+class StringConverter(object):
+
+Factory class for function transforming a string into another object (int,
+float).
 
-def _string_like(obj):
-try: obj + ''
-except (TypeError, ValueError): return 0
-return 1
+After initialization, an instance can be called to transform a string 
+into another object. If the string is recognized as representing a missing
+value, a default value is returned.
 
+Parameters
+--
+dtype : dtype, optional
+Input data type, used to define a basic function and a default value
+for missing data. For example, when `dtype` is float, the :attr:`func`
+attribute is set to ``float`` and the default value to `np.nan`.
+missing_values : sequence, optional
+Sequence of strings indicating a missing value.
+
+Attributes
+--
+func : function
+Function used for the conversion
+default : var
+Default value to return when the input corresponds to a missing value.
+mapper : sequence of tuples
+Sequence of tuples (function, default value) to evaluate in order.
+
+
+from numpy.core import nan # To avoid circular import
+mapper = [(str2bool, None),
+  (int, -1), #Needs to be int so that it can fail and promote
+ #to float
+  (float, nan),
+  (complex, nan+0j),
+  (str, '???')]
+
+def __init__(self, dtype=None, missing_values=None):
+self._locked = False
+if dtype is None:
+self.func = str2bool
+self.default = None
+self._status = 0
+else:
+dtype = np.dtype(dtype).type
+if issubclass(dtype, np.bool_):
+(self.func, self.default, self._status) = (str2bool, 0, 0)
+elif issubclass(dtype, np.integer):
+#Needs to be int(float(x)) so that floating point values will
+#be coerced to int when specifid by dtype
+(self.func, self.default, self._status) = (lambda x: 
int(float(x)), -1, 1)
+elif issubclass(dtype, np.floating):
+(self.func, self.default, self._status) = (float, np.nan, 2)
+elif issubclass(dtype, np.complex):
+(self.func, self.default, self._status) = (complex, np.nan + 
0j, 3)
+else:
+(self.func, self.default, self._status) = (str, '???', -1)
+
+# Store the list of strings corresponding to missing values.
+if missing_values is None:
+self.missing_values = []
+else:
+self.missing_values = set(list(missing_values) + [''])
+
+def __call__(self, value):
+if value in self.missing_values:
+return self.default
+return