Re: [Numpy-discussion] genloadtxt : last call

2008-12-17 Thread Ryan May
Pierre GM wrote:
 Ryan,
 OK, I'll look into that. I won't have time to address it before this  
 next week, however. Option #2 looks like the best.

No hurries, I just want to make sure I raise any issues I see while the design 
is 
still up for change.

 In other news, I was considering renaming genloadtxt to genfromtxt,  
 and using ndfromtxt, mafromtxt, recfromtxt, recfromcsv for the  
 function names. That way, loadtxt is untouched.

+1
I know I've changed my tune here, but at this point it seems like there's so 
much 
more functionality here that calling it loadtxt would be a disservice to how 
much 
the new function can do (and how much work you've done).

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] genloadtxt : last call

2008-12-16 Thread Ryan May
Pierre GM wrote:
 All,
 Here's the latest version of genloadtxt, with some recent corrections. 
 With just a couple of tweaking, we end up with some decent speed: it's 
 still slower than np.loadtxt, but only 15% so according to the test at 
 the end of the package.

I have one more use issue that you may or may not want to fix. My problem is 
that 
missing values are specified by their string representation, so that a string 
representing a missing value, while having the same actual numeric value, may 
not 
compare equal when represented as a string.  For instance, if you specify that 
-999.0 represents a missing value, but the value written to the file is 
-999.00, 
you won't end up masking the -999.00 data point.  I'm sure a test case will 
help 
here:

 def test_withmissing_float(self):
 data = StringIO.StringIO('A,B\n0,1.5\n2,-999.00')
 test = mloadtxt(data, dtype=None, delimiter=',', missing='-999.0',
 names=True)
 control = ma.array([(0, 1.5), (2, -1.)],
mask=[(False, False), (False, True)],
dtype=[('A', np.int), ('B', np.float)])
 print control
 print test
 assert_equal(test, control)
 assert_equal(test.mask, control.mask)

Right now this fails with the latest version of genloadtxt.  I've worked around 
this by specifying a whole bunch of string representations of the values, but I 
wasn't sure if you knew of a better way that this could be handled within 
genloadtxt.  I can only think of two ways, though I'm not thrilled with either:

1) Call the converter on the string form of the missing value and compare 
against 
the converted value from the file to determine if missing. (Probably very slow)

2) Add a list of objects (ints, floats, etc.) to compare against after 
conversion 
to determine if they're missing. This might needlessly complicate the function, 
which I know you've already taken pains to optimize.

If there's no good way to do it, I'm content to live with a workaround.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] genloadtxt : last call

2008-12-16 Thread Pierre GM
Ryan,
OK, I'll look into that. I won't have time to address it before this  
next week, however. Option #2 looks like the best.

In other news, I was considering renaming genloadtxt to genfromtxt,  
and using ndfromtxt, mafromtxt, recfromtxt, recfromcsv for the  
function names. That way, loadtxt is untouched.



On Dec 16, 2008, at 6:07 PM, Ryan May wrote:

 Pierre GM wrote:
 All,
 Here's the latest version of genloadtxt, with some recent  
 corrections.
 With just a couple of tweaking, we end up with some decent speed:  
 it's
 still slower than np.loadtxt, but only 15% so according to the test  
 at
 the end of the package.

 I have one more use issue that you may or may not want to fix. My  
 problem is that
 missing values are specified by their string representation, so  
 that a string
 representing a missing value, while having the same actual numeric  
 value, may not
 compare equal when represented as a string.  For instance, if you  
 specify that
 -999.0 represents a missing value, but the value written to the file  
 is -999.00,
 you won't end up masking the -999.00 data point.  I'm sure a test  
 case will help
 here:

 def test_withmissing_float(self):
 data = StringIO.StringIO('A,B\n0,1.5\n2,-999.00')
 test = mloadtxt(data, dtype=None, delimiter=',',  
 missing='-999.0',
 names=True)
 control = ma.array([(0, 1.5), (2, -1.)],
mask=[(False, False), (False, True)],
dtype=[('A', np.int), ('B', np.float)])
 print control
 print test
 assert_equal(test, control)
 assert_equal(test.mask, control.mask)

 Right now this fails with the latest version of genloadtxt.  I've  
 worked around
 this by specifying a whole bunch of string representations of the  
 values, but I
 wasn't sure if you knew of a better way that this could be handled  
 within
 genloadtxt.  I can only think of two ways, though I'm not thrilled  
 with either:

 1) Call the converter on the string form of the missing value and  
 compare against
 the converted value from the file to determine if missing. (Probably  
 very slow)

 2) Add a list of objects (ints, floats, etc.) to compare against  
 after conversion
 to determine if they're missing. This might needlessly complicate  
 the function,
 which I know you've already taken pains to optimize.

 If there's no good way to do it, I'm content to live with a  
 workaround.

 Ryan

 -- 
 Ryan May
 Graduate Research Assistant
 School of Meteorology
 University of Oklahoma
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] genloadtxt : last call

2008-12-10 Thread Christopher Barker
Pierre GM wrote:
 in the same place in NumPy; and all the SciPy IO code to be in the
 same place in SciPy.
 +1
 
 So, no problem w/ importing numpy.ma and numpy.records in numpy.lib.io ?

As long as numpy.ma and numpy.records are, and will remain, part of the 
standard numpy distribution, this is fine.

This is a key issue -- what is core numpy and what is not, but I know 
I'd like to see a lot of things built on ma and records, both, so I 
think they do belong in core.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] genloadtxt : last call

2008-12-09 Thread Jarrod Millman
On Fri, Dec 5, 2008 at 3:59 PM, Pierre GM [EMAIL PROTECTED] wrote:
 All,
 Here's the latest version of genloadtxt, with some recent corrections. With
 just a couple of tweaking, we end up with some decent speed: it's still
 slower than np.loadtxt, but only 15% so according to the test at the end of
 the package.

 And so, now what ? Should I put the module in numpy.lib.io ? Elsewhere ?

Thanks for working on this.  I think that having simple, easy-to-use,
flexible, and fast IO code is extremely important; so I really
appreciate this work.

I have a few general comments about the IO code and where I would like
to see it going:

Where should IO code go?


From the user's perspective, I would like all the NumPy IO code to be
in the same place in NumPy; and all the SciPy IO code to be in the
same place in SciPy.  So, for instance, the user shouldn't get
`mloadtxt` from `numpy.ma.io`.  Another way of saying this is that in
IPython, I should be able to see all NumPy IO functions by
tab-completing once.

Slightly less important to me is that I would like to be able to do:
  from numpy import io as npio
  from scipy import io as spio

What is the difference between NumPy and SciPy IO?


It was decided last year that numpy io should provide simple, generic,
core io functionality.  While scipy io would provide more domain- or
application-specific io code (e.g., Matlab IO, WAV IO, etc.)  My
vision for scipy io, which I know isn't shared, is to be more or less
aiming to be all inclusive (e.g., all image, sound, and data formats).
 (That is a different discussion; just wanted it to be clear where I
stand.)

For numpy io, it should include:
 - generic helper routines for data io (i.e., datasource, etc.)
 - a standard, supported binary format (i.e., npy/npz)
 - generic ascii file support (i.e, loadtxt, etc.)

What about AstroAsciiData?
-

I sent an email asking about AstroAsciiData last week.  The only
response I got was from Manuel Metz  saying that he was switching to
AstroAsciiData since it did exactly what he needed.  In my mind, I
would prefer that numpy io had the best ascii data handling.  So I
wonder if it would make sense to incorporate AstroAsciiData?

As far as I know, it is pure Python with a BSD license.  Maybe the
authors would be willing to help integrate the code and continue
maintaining it in numpy.  If others are supportive of this general
approach, I would be happy to approach them.  It is possible that we
won't want all their functionality, but it would be good to avoid
duplicating effort.

I realize that this may not be persuasive to everyone, but I really
feel that IO code is special and that it is an area where numpy/scipy
should devote some effort at consolidating the community on some
standard packages and approaches.

3. What about data source?

On a related note, I wanted to point out datasource.  Data source is a
file interface for handling local and remote data files:
http://projects.scipy.org/scipy/numpy/browser/trunk/numpy/lib/_datasource.py

It was originally developed by Jonathan Taylor and then modified by
Brian Hawthorne and Chris Burns.  It is fairly well-documented and
tested, so it would be easier to take a look at it than or me to
reexplain it here.  The basic idea is to have a drop-in replacement
for file handling, which would abstract away whether the file was
remote or local, compressed or not, etc.  The hope was that it would
allow us to simplify support for remote file access and handling
compressed files by merely using a datasource instead of a filename:
  def loadtxt(fname 
vs.
  def loadtxt(datasource 

I would appreciate hearing whether this seems doable or useful.
Should we remove datasource?  Start using it more?  Does it need to be
slightly or dramatically improved/overhauled?  Renamed `datafile` or
paired with a `datadestination`?  Support
versioning/checksumming/provenance tracking (a tad ambitious;))?  Is
anyone interested in picking up where we left off and improving it?

Thoughts? Suggestions?

Documentation
-

The main reason that I am so interested in the IO code is that it
seems like it is one of the first areas that users will look.  (I
have heard about this Python for scientific programming thing and I
wonder what all the fuss is about?  Let me try NumPy; this seems
pretty good.  Now let's see how to load in some of my data)

I just took a quick look through the documentation and I couldn't find
any in the User Guide and this is the main IO page in the reference
manual:
  http://docs.scipy.org/doc/numpy/reference/routines.io.html

I would like to see a section on data IO in the user guide and have a
more prominent mention of IO code in the reference manual (i.e.,
http://docs.scipy.org/doc/numpy/reference/io.html ?).

Unfortunately, I don't have time to help out; but since it looks like

Re: [Numpy-discussion] genloadtxt : last call

2008-12-09 Thread Christopher Barker
Jarrod Millman wrote:

From the user's perspective, I would like all the NumPy IO code to be
 in the same place in NumPy; and all the SciPy IO code to be in the
 same place in SciPy.

+1

  So I
 wonder if it would make sense to incorporate AstroAsciiData?

Doesn't it overlap a lot with genloadtxt? If so, that's a bit confusing 
to new users.

 3. What about data source?

 Should we remove datasource?  Start using it more?

start  using it more -- it sounds very handy.

  Does it need to be
 slightly or dramatically improved/overhauled?

no comment here - I have no idea.

 Documentation
 -
  Let me try NumPy; this seems
 pretty good.  Now let's see how to load in some of my data)

totally key -- I have a colleague that has used Matlab a fair bi tin 
past that is starting a new project -- he asked me what to use. I, of 
course, suggested python+numpy+scipy. His first question was -- can I 
load data in from excel?


One more comment -- for fast reading of lots of ascii data, fromfile() 
needs some help -- I wish I had more time for it -- maybe some day.

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] genloadtxt : last call

2008-12-09 Thread Pierre GM

On Dec 9, 2008, at 12:59 PM, Christopher Barker wrote:

 Jarrod Millman wrote:

 From the user's perspective, I would like all the NumPy IO code to  
 be
 in the same place in NumPy; and all the SciPy IO code to be in the
 same place in SciPy.

 +1

So, no problem w/ importing numpy.ma and numpy.records in numpy.lib.io ?




 So I
 wonder if it would make sense to incorporate AstroAsciiData?

 Doesn't it overlap a lot with genloadtxt? If so, that's a bit  
 confusing
 to new users.

For the little I browsed, do we need it ? We could get the same thing  
with record arrays...


 3. What about data source?

 Should we remove datasource?  Start using it more?

 start  using it more -- it sounds very handy.

Didn't know it was around. I'll adapt genloadtxt to use it.

 Documentation
 -
 Let me try NumPy; this seems
 pretty good.  Now let's see how to load in some of my data)

 totally key -- I have a colleague that has used Matlab a fair bi tin
 past that is starting a new project -- he asked me what to use. I, of
 course, suggested python+numpy+scipy. His first question was -- can I
 load data in from excel?

So that would go in scipy.io ?



 One more comment -- for fast reading of lots of ascii data, fromfile()
 needs some help -- I wish I had more time for it -- maybe some day.

I'm afraid you'd have to count me out on this one: I don't speak C  
(yet), and don't foresee learning it soon enough to be of any help...
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] genloadtxt : last call

2008-12-09 Thread Gael Varoquaux
On Tue, Dec 09, 2008 at 01:34:29AM -0800, Jarrod Millman wrote:
 It was decided last year that numpy io should provide simple, generic,
 core io functionality.  While scipy io would provide more domain- or
 application-specific io code (e.g., Matlab IO, WAV IO, etc.)  My
 vision for scipy io, which I know isn't shared, is to be more or less
 aiming to be all inclusive (e.g., all image, sound, and data formats).
  (That is a different discussion; just wanted it to be clear where I
 stand.)

Can we get Matthew Brett's nifti reader in there? Please! Pretty please.
That way I can do neuroimaging without compiled code outside of a
standard scientific Python instal.

Gaël
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] genloadtxt : last call

2008-12-08 Thread Ryan May
Pierre GM wrote:
 All,
 Here's the latest version of genloadtxt, with some recent corrections. 
 With just a couple of tweaking, we end up with some decent speed: it's 
 still slower than np.loadtxt, but only 15% so according to the test at 
 the end of the package.
 
 And so, now what ? Should I put the module in numpy.lib.io ? Elsewhere ?
 
 Thx for any comment and suggestions.

Current version works out of the box for me.

Thanks for running point on this.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] genloadtxt : last call

2008-12-06 Thread Gael Varoquaux
On Fri, Dec 05, 2008 at 06:59:25PM -0500, Pierre GM wrote:
 Here's the latest version of genloadtxt, with some recent corrections. With 
 just a couple of tweaking, we end up with some decent speed: it's still 
 slower than np.loadtxt, but only 15% so according to the test at the end of 
 the package.

15% slow-down is acceptable, IMHO. There is fromfile for the fast and
well understood usecase.

Thanks for doing all this work.

Gaël
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion