Re: [Numpy-discussion] genloadtxt : last call
Pierre GM wrote: Ryan, OK, I'll look into that. I won't have time to address it before this next week, however. Option #2 looks like the best. No hurries, I just want to make sure I raise any issues I see while the design is still up for change. In other news, I was considering renaming genloadtxt to genfromtxt, and using ndfromtxt, mafromtxt, recfromtxt, recfromcsv for the function names. That way, loadtxt is untouched. +1 I know I've changed my tune here, but at this point it seems like there's so much more functionality here that calling it loadtxt would be a disservice to how much the new function can do (and how much work you've done). Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
Pierre GM wrote: All, Here's the latest version of genloadtxt, with some recent corrections. With just a couple of tweaking, we end up with some decent speed: it's still slower than np.loadtxt, but only 15% so according to the test at the end of the package. I have one more use issue that you may or may not want to fix. My problem is that missing values are specified by their string representation, so that a string representing a missing value, while having the same actual numeric value, may not compare equal when represented as a string. For instance, if you specify that -999.0 represents a missing value, but the value written to the file is -999.00, you won't end up masking the -999.00 data point. I'm sure a test case will help here: def test_withmissing_float(self): data = StringIO.StringIO('A,B\n0,1.5\n2,-999.00') test = mloadtxt(data, dtype=None, delimiter=',', missing='-999.0', names=True) control = ma.array([(0, 1.5), (2, -1.)], mask=[(False, False), (False, True)], dtype=[('A', np.int), ('B', np.float)]) print control print test assert_equal(test, control) assert_equal(test.mask, control.mask) Right now this fails with the latest version of genloadtxt. I've worked around this by specifying a whole bunch of string representations of the values, but I wasn't sure if you knew of a better way that this could be handled within genloadtxt. I can only think of two ways, though I'm not thrilled with either: 1) Call the converter on the string form of the missing value and compare against the converted value from the file to determine if missing. (Probably very slow) 2) Add a list of objects (ints, floats, etc.) to compare against after conversion to determine if they're missing. This might needlessly complicate the function, which I know you've already taken pains to optimize. If there's no good way to do it, I'm content to live with a workaround. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
Ryan, OK, I'll look into that. I won't have time to address it before this next week, however. Option #2 looks like the best. In other news, I was considering renaming genloadtxt to genfromtxt, and using ndfromtxt, mafromtxt, recfromtxt, recfromcsv for the function names. That way, loadtxt is untouched. On Dec 16, 2008, at 6:07 PM, Ryan May wrote: Pierre GM wrote: All, Here's the latest version of genloadtxt, with some recent corrections. With just a couple of tweaking, we end up with some decent speed: it's still slower than np.loadtxt, but only 15% so according to the test at the end of the package. I have one more use issue that you may or may not want to fix. My problem is that missing values are specified by their string representation, so that a string representing a missing value, while having the same actual numeric value, may not compare equal when represented as a string. For instance, if you specify that -999.0 represents a missing value, but the value written to the file is -999.00, you won't end up masking the -999.00 data point. I'm sure a test case will help here: def test_withmissing_float(self): data = StringIO.StringIO('A,B\n0,1.5\n2,-999.00') test = mloadtxt(data, dtype=None, delimiter=',', missing='-999.0', names=True) control = ma.array([(0, 1.5), (2, -1.)], mask=[(False, False), (False, True)], dtype=[('A', np.int), ('B', np.float)]) print control print test assert_equal(test, control) assert_equal(test.mask, control.mask) Right now this fails with the latest version of genloadtxt. I've worked around this by specifying a whole bunch of string representations of the values, but I wasn't sure if you knew of a better way that this could be handled within genloadtxt. I can only think of two ways, though I'm not thrilled with either: 1) Call the converter on the string form of the missing value and compare against the converted value from the file to determine if missing. (Probably very slow) 2) Add a list of objects (ints, floats, etc.) to compare against after conversion to determine if they're missing. This might needlessly complicate the function, which I know you've already taken pains to optimize. If there's no good way to do it, I'm content to live with a workaround. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
Pierre GM wrote: in the same place in NumPy; and all the SciPy IO code to be in the same place in SciPy. +1 So, no problem w/ importing numpy.ma and numpy.records in numpy.lib.io ? As long as numpy.ma and numpy.records are, and will remain, part of the standard numpy distribution, this is fine. This is a key issue -- what is core numpy and what is not, but I know I'd like to see a lot of things built on ma and records, both, so I think they do belong in core. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
On Fri, Dec 5, 2008 at 3:59 PM, Pierre GM [EMAIL PROTECTED] wrote: All, Here's the latest version of genloadtxt, with some recent corrections. With just a couple of tweaking, we end up with some decent speed: it's still slower than np.loadtxt, but only 15% so according to the test at the end of the package. And so, now what ? Should I put the module in numpy.lib.io ? Elsewhere ? Thanks for working on this. I think that having simple, easy-to-use, flexible, and fast IO code is extremely important; so I really appreciate this work. I have a few general comments about the IO code and where I would like to see it going: Where should IO code go? From the user's perspective, I would like all the NumPy IO code to be in the same place in NumPy; and all the SciPy IO code to be in the same place in SciPy. So, for instance, the user shouldn't get `mloadtxt` from `numpy.ma.io`. Another way of saying this is that in IPython, I should be able to see all NumPy IO functions by tab-completing once. Slightly less important to me is that I would like to be able to do: from numpy import io as npio from scipy import io as spio What is the difference between NumPy and SciPy IO? It was decided last year that numpy io should provide simple, generic, core io functionality. While scipy io would provide more domain- or application-specific io code (e.g., Matlab IO, WAV IO, etc.) My vision for scipy io, which I know isn't shared, is to be more or less aiming to be all inclusive (e.g., all image, sound, and data formats). (That is a different discussion; just wanted it to be clear where I stand.) For numpy io, it should include: - generic helper routines for data io (i.e., datasource, etc.) - a standard, supported binary format (i.e., npy/npz) - generic ascii file support (i.e, loadtxt, etc.) What about AstroAsciiData? - I sent an email asking about AstroAsciiData last week. The only response I got was from Manuel Metz saying that he was switching to AstroAsciiData since it did exactly what he needed. In my mind, I would prefer that numpy io had the best ascii data handling. So I wonder if it would make sense to incorporate AstroAsciiData? As far as I know, it is pure Python with a BSD license. Maybe the authors would be willing to help integrate the code and continue maintaining it in numpy. If others are supportive of this general approach, I would be happy to approach them. It is possible that we won't want all their functionality, but it would be good to avoid duplicating effort. I realize that this may not be persuasive to everyone, but I really feel that IO code is special and that it is an area where numpy/scipy should devote some effort at consolidating the community on some standard packages and approaches. 3. What about data source? On a related note, I wanted to point out datasource. Data source is a file interface for handling local and remote data files: http://projects.scipy.org/scipy/numpy/browser/trunk/numpy/lib/_datasource.py It was originally developed by Jonathan Taylor and then modified by Brian Hawthorne and Chris Burns. It is fairly well-documented and tested, so it would be easier to take a look at it than or me to reexplain it here. The basic idea is to have a drop-in replacement for file handling, which would abstract away whether the file was remote or local, compressed or not, etc. The hope was that it would allow us to simplify support for remote file access and handling compressed files by merely using a datasource instead of a filename: def loadtxt(fname vs. def loadtxt(datasource I would appreciate hearing whether this seems doable or useful. Should we remove datasource? Start using it more? Does it need to be slightly or dramatically improved/overhauled? Renamed `datafile` or paired with a `datadestination`? Support versioning/checksumming/provenance tracking (a tad ambitious;))? Is anyone interested in picking up where we left off and improving it? Thoughts? Suggestions? Documentation - The main reason that I am so interested in the IO code is that it seems like it is one of the first areas that users will look. (I have heard about this Python for scientific programming thing and I wonder what all the fuss is about? Let me try NumPy; this seems pretty good. Now let's see how to load in some of my data) I just took a quick look through the documentation and I couldn't find any in the User Guide and this is the main IO page in the reference manual: http://docs.scipy.org/doc/numpy/reference/routines.io.html I would like to see a section on data IO in the user guide and have a more prominent mention of IO code in the reference manual (i.e., http://docs.scipy.org/doc/numpy/reference/io.html ?). Unfortunately, I don't have time to help out; but since it looks like
Re: [Numpy-discussion] genloadtxt : last call
Jarrod Millman wrote: From the user's perspective, I would like all the NumPy IO code to be in the same place in NumPy; and all the SciPy IO code to be in the same place in SciPy. +1 So I wonder if it would make sense to incorporate AstroAsciiData? Doesn't it overlap a lot with genloadtxt? If so, that's a bit confusing to new users. 3. What about data source? Should we remove datasource? Start using it more? start using it more -- it sounds very handy. Does it need to be slightly or dramatically improved/overhauled? no comment here - I have no idea. Documentation - Let me try NumPy; this seems pretty good. Now let's see how to load in some of my data) totally key -- I have a colleague that has used Matlab a fair bi tin past that is starting a new project -- he asked me what to use. I, of course, suggested python+numpy+scipy. His first question was -- can I load data in from excel? One more comment -- for fast reading of lots of ascii data, fromfile() needs some help -- I wish I had more time for it -- maybe some day. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
On Dec 9, 2008, at 12:59 PM, Christopher Barker wrote: Jarrod Millman wrote: From the user's perspective, I would like all the NumPy IO code to be in the same place in NumPy; and all the SciPy IO code to be in the same place in SciPy. +1 So, no problem w/ importing numpy.ma and numpy.records in numpy.lib.io ? So I wonder if it would make sense to incorporate AstroAsciiData? Doesn't it overlap a lot with genloadtxt? If so, that's a bit confusing to new users. For the little I browsed, do we need it ? We could get the same thing with record arrays... 3. What about data source? Should we remove datasource? Start using it more? start using it more -- it sounds very handy. Didn't know it was around. I'll adapt genloadtxt to use it. Documentation - Let me try NumPy; this seems pretty good. Now let's see how to load in some of my data) totally key -- I have a colleague that has used Matlab a fair bi tin past that is starting a new project -- he asked me what to use. I, of course, suggested python+numpy+scipy. His first question was -- can I load data in from excel? So that would go in scipy.io ? One more comment -- for fast reading of lots of ascii data, fromfile() needs some help -- I wish I had more time for it -- maybe some day. I'm afraid you'd have to count me out on this one: I don't speak C (yet), and don't foresee learning it soon enough to be of any help... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
On Tue, Dec 09, 2008 at 01:34:29AM -0800, Jarrod Millman wrote: It was decided last year that numpy io should provide simple, generic, core io functionality. While scipy io would provide more domain- or application-specific io code (e.g., Matlab IO, WAV IO, etc.) My vision for scipy io, which I know isn't shared, is to be more or less aiming to be all inclusive (e.g., all image, sound, and data formats). (That is a different discussion; just wanted it to be clear where I stand.) Can we get Matthew Brett's nifti reader in there? Please! Pretty please. That way I can do neuroimaging without compiled code outside of a standard scientific Python instal. Gaël ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
Pierre GM wrote: All, Here's the latest version of genloadtxt, with some recent corrections. With just a couple of tweaking, we end up with some decent speed: it's still slower than np.loadtxt, but only 15% so according to the test at the end of the package. And so, now what ? Should I put the module in numpy.lib.io ? Elsewhere ? Thx for any comment and suggestions. Current version works out of the box for me. Thanks for running point on this. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
On Fri, Dec 05, 2008 at 06:59:25PM -0500, Pierre GM wrote: Here's the latest version of genloadtxt, with some recent corrections. With just a couple of tweaking, we end up with some decent speed: it's still slower than np.loadtxt, but only 15% so according to the test at the end of the package. 15% slow-down is acceptable, IMHO. There is fromfile for the fast and well understood usecase. Thanks for doing all this work. Gaël ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion