Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Ryan May

Pierre GM wrote:

I think that treating an explicitly-passed-in ' ' delimiter as
identical to 'no delimiter' is a bad idea. If I say that ' ' is the
delimiter, or '\t' is the delimiter, this should be treated *just*
like ',' being the delimiter, where the expected output is:
['1', '2', '3', '4', '', '5']



Valid point.
Well, all, stay tuned for yet another yet another implementation...



Found a problem.  If you read the names from the file and specify 
usecols, you end up with the first N names read from the file as the 
fields in your output (where N is the number of entries in usecols), 
instead of having the names of the columns you asked for.


For instance:

from StringIO import StringIO
from genload_proposal import loadtxt
f = StringIO('stid stnm relh tair\nnrmn 121 45 9.1')
loadtxt(f, usecols=('stid', 'relh', 'tair'), names=True, dtype=None)
array(('nrmn', 45, 9.0996),
  dtype=[('stid', '|S4'), ('stnm', 'i8'), ('relh', 'f8')])

What I want to come out is:

array(('nrmn', 45, 9.0996),
  dtype=[('stid', '|S4'), ('relh', 'i8'), ('tair', 'f8')])

I've attached a version that fixes this by setting a flag internally if 
the names are read from the file.  If this flag is true, at the end the 
names are filtered down to only the ones that are given in usecols.


I also have one other thought.  Is there any way we can make this handle 
object arrays, or rather, a field containing objects, specifically 
datetime objects?  Right now, this does not work because calling view 
does not work for object arrays.  I'm just looking for a simple way to 
store date/time in my record array (currently a string field).


Ryan

--
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma

Proposal : 
Here's an extension to np.loadtxt, designed to take missing values into account.





import itertools
import numpy as np
import numpy.ma as ma


def _is_string_like(obj):

Check whether obj behaves like a string.

try:
obj + ''
except (TypeError, ValueError):
return False
return True

def _to_filehandle(fname, flag='r', return_opened=False):

Returns the filehandle corresponding to a string or a file.
If the string ends in '.gz', the file is automatically unzipped.

Parameters
--
fname : string, filehandle
Name of the file whose filehandle must be returned.
flag : string, optional
Flag indicating the status of the file ('r' for read, 'w' for write).
return_opened : boolean, optional
Whether to return the opening status of the file.

if _is_string_like(fname):
if fname.endswith('.gz'):
import gzip
fhd = gzip.open(fname, flag)
elif fname.endswith('.bz2'):
import bz2
fhd = bz2.BZ2File(fname)
else:
fhd = file(fname, flag)
opened = True
elif hasattr(fname, 'seek'):
fhd = fname
opened = False
else:
raise ValueError('fname must be a string or file handle')
if return_opened:
return fhd, opened
return fhd


def flatten_dtype(ndtype):

Unpack a structured data-type.


names = ndtype.names
if names is None:
return [ndtype]
else:
types = []
for field in names:
(typ, _) = ndtype.fields[field]
flat_dt = flatten_dtype(typ)
types.extend(flat_dt)
return types


def nested_masktype(datatype):

Construct the dtype of a mask for nested elements.


names = datatype.names
if names:
descr = []
for name in names:
(ndtype, _) = datatype.fields[name]
descr.append((name, nested_masktype(ndtype)))
return descr
# Is this some kind of composite a la (np.float,2)
elif datatype.subdtype:
mdescr = list(datatype.subdtype)
mdescr[0] = np.dtype(bool)
return tuple(mdescr)
else:
return np.bool


class LineSplitter:

Defines a function to split a string at a given delimiter or at given places.

Parameters
--
comment : {'#', string}
Character used to mark the beginning of a comment.
delimiter : var


def __init__(self, delimiter=None, comments='#'):
self.comments = comments
# Delimiter is a character
if delimiter is None:
self._isfixed = False
self.delimiter = None
elif _is_string_like(delimiter):
self._isfixed = False
self.delimiter = delimiter.strip() or None
# Delimiter is a list of field widths
elif hasattr(delimiter, '__iter__'):
self._isfixed = True
idx = np.cumsum([0]+list(delimiter))
self.slices = [slice(i,j) for (i,j) in zip(idx[:-1], idx[1:])]
# Delimiter is a single integer
elif int(delimiter):
self._isfixed = True

Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Alan G Isaac
If I know my data is already clean
and is handled nicely by the
old loadtxt, will I be able to turn
off and the special handling in
order to retain the old load speed?

Alan Isaac

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Christopher Barker
Pierre GM wrote:
 I can try, but in that case, please write me a unittest, so that I  
 have a clear and unambiguous idea of what you expect.

fair enough, though I'm not sure when I'll have time to do it.

I do wonder if anyone else thinks it would be useful to have multiple 
delimiters as an option. I got the idea because with fromfile(), if you 
specify, say ',' as the delimiter, it won't use '\n', only  a comma, so 
there is no way to quickly read a whole bunch of comma delimited data like:

1,2,3,4
5,6,7,8


so I'd like to be able to say to use either ',' or '\n' as the delimiter.

However, if I understand loadtxt() correctly, it's handling the new 
lines separately anyway (to get a 2-d array), so this use case isn't an 
issue. So how likely is it that someone would have:

1  2  3, 4, 5
6  7  8, 8, 9

and want to read that into a single 2-d array?

I'm not sure I've seen it.

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Pierre GM

On Dec 3, 2008, at 12:48 PM, Christopher Barker wrote:

 Pierre GM wrote:
 I can try, but in that case, please write me a unittest, so that I
 have a clear and unambiguous idea of what you expect.

 fair enough, though I'm not sure when I'll have time to do it.

Oh, don;t worry, nothing too fancy: give me a couple lines of input  
data and a line with what you expect. Using Ryan's recent example:

 f = StringIO('stid stnm relh tair\nnrmn 121 45 9.1')
  test = loadtxt(f, usecols=('stid', 'relh', 'tair'), names=True,  
dtype=None)
  control=array(('nrmn', 45, 9.0996),
 dtype=[('stid', '|S4'), ('relh', 'i8'), 
('tair', 'f8')])

That's quite enough for a test.

 I do wonder if anyone else thinks it would be useful to have multiple
 delimiters as an option. I got the idea because with fromfile(), if  
 you
 specify, say ',' as the delimiter, it won't use '\n', only  a comma,  
 so
 there is no way to quickly read a whole bunch of comma delimited  
 data like:

 1,2,3,4
 5,6,7,8
 

 so I'd like to be able to say to use either ',' or '\n' as the  
 delimiter.

I'm not quite sure I follow you.
Do you want to delimiters, one for the field of a record (','), one  
for the records (\n) ?




 However, if I understand loadtxt() correctly, it's handling the new
 lines separately anyway (to get a 2-d array), so this use case isn't  
 an
 issue. So how likely is it that someone would have:

 1  2  3, 4, 5
 6  7  8, 8, 9

 and want to read that into a single 2-d array?

With the current behaviour, you gonna have
[(1 2 3, 4, 5), (6 7 8, 8, 9)] if you use , as a delimiter,
[(1,2,3,,4,,5),(6,7,8,,8,,9)] if you use   as a delimiter.

Mixing delimiter is doable, but I don't think it's that a good idea.  
I'm in favor of sticking to one and only field delimiter, and the  
default line spearator for record delimiter. In other terms, not  
changing anythng.

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Pierre GM

On Dec 3, 2008, at 12:32 PM, Alan G Isaac wrote:

 If I know my data is already clean
 and is handled nicely by the
 old loadtxt, will I be able to turn
 off and the special handling in
 order to retain the old load speed?

Hopefully. I'm looking for the best way to do it. Do you have an  
example you could send me off-list so that I can play with timers ?  
Thx in advance.
P.

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Christopher Barker
by the way, should this work:

io.loadtxt('junk.dat', delimiter=' ')

for more than one space between numbers, like:

1  2  3  4   5
6  7  8  9  10

I get:

io.loadtxt('junk.dat', delimiter=' ')
Traceback (most recent call last):
   File stdin, line 1, in module
   File 
/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/numpy/lib/io.py,
 
line 403, in loadtxt
 X.append(tuple([conv(val) for (conv, val) in zip(converters, vals)]))
ValueError: empty string for float()

with the current version.

  io.loadtxt('junk.dat', delimiter=None)
array([[  1.,   2.,   3.,   4.,   5.],
[  6.,   7.,   8.,   9.,  10.]])

does work.



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Christopher Barker
Pierre GM wrote:
 Oh, don;t worry, nothing too fancy: give me a couple lines of input  
 data and a line with what you expect.

I just went and looked at the existing tests, and you're right, it's 
very easy -- my first foray into the new nose tests -- very nice!

 specify, say ',' as the delimiter, it won't use '\n', only  a comma,  
 so
 there is no way to quickly read a whole bunch of comma delimited  
 data like:

 1,2,3,4
 5,6,7,8
 

 so I'd like to be able to say to use either ',' or '\n' as the  
 delimiter.
 
 I'm not quite sure I follow you.
 Do you want to delimiters, one for the field of a record (','), one  
 for the records (\n) ?

well, in the case of fromfile(), it doesn't do records -- it will only 
give you a 1-d array, so I want it all as a flat array, and you can 
re-size it yourself later. Clearly this is more work (and requires more 
knowledge of your data) than using loadtxt, but sometimes I really want 
FAST data reading of simple formats.

However, this isn't fromfile() we are talking about now, it's loadtxt()...

 So how likely is it that someone would have:

 1  2  3, 4, 5
 6  7  8, 8, 9

 and want to read that into a single 2-d array?
 
 With the current behaviour, you gonna have
 [(1 2 3, 4, 5), (6 7 8, 8, 9)] if you use , as a delimiter,
 [(1,2,3,,4,,5),(6,7,8,,8,,9)] if you use   as a delimiter.

right.

 Mixing delimiter is doable, but I don't think it's that a good idea.  

I can't come up with a use case at this point, so..

 I'm in favor of sticking to one and only field delimiter, and the  
 default line spearator for record delimiter. In other terms, not  
 changing anything.

I agree -- sorry for the noise!

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Christopher Barker
Alan G Isaac wrote:
 If I know my data is already clean
 and is handled nicely by the
 old loadtxt, will I be able to turn
 off and the special handling in
 order to retain the old load speed?

what I'd like to see is a version of loadtxt built on a slightly 
enhanced fromfile() -- that would be blazingly fast for the easy cases 
(simple tabular data of one dtype).

I don't know if the special-casing should be automatic, or just have it 
be a separate function.

Also, fromfile() needs some work, and it needs to be done in C, which is 
less fun, so who knows when it will get done.

As I think about it, maybe what I really want is a simple version of 
loadtxt written in C:

  It would only handle one data type at a time.
  It would support simple comment lines.
  It would only support one delimiter (plus newline).
  It would create a 2-d array from normal, tabular data.
  You could specify:
 how many numbers you wanted,
 or how many rows,
 or read 'till EOF

Actually, this is a lot like matlab's fscanf()

someday

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Pierre GM

On Dec 3, 2008, at 1:00 PM, Christopher Barker wrote:

 by the way, should this work:

 io.loadtxt('junk.dat', delimiter=' ')

 for more than one space between numbers, like:

 1  2  3  4   5
 6  7  8  9  10


On the version I'm working on, both delimiter='' and delimiter=None  
(default) would give you the expected output. delimiter=' ' would  
fail, delimiter='  ' would work.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Manuel Metz
Alan G Isaac wrote:
 If I know my data is already clean
 and is handled nicely by the
 old loadtxt, will I be able to turn
 off and the special handling in
 order to retain the old load speed?
 
 Alan Isaac
 

Hi all,
  that's going in the same direction I was thinking about.
When I thought about an improved version of loadtxt, I wished it was
fault tolerant without loosing too much performance.
  So my solution was much simpler than the very nice genloadtxt function
-- and it works for me.

My ansatz is to leave the existing loadtxt function unchanged. I only
replaced the default converter calls by a fault tolerant converter
class. I attached a patch against io.py in numpy 1.2.1

The nice thing is that it not only handles missing values, but for
example also columns/fields with non-number characters. It just returns
nan in these cases. This is of practical importance for many datafiles
of astronomical catalogues, for example the Hipparcos catalogue data.

Regarding the performance, it is a little bit slower than the original
loadtxt, but not much: on my machine, 10x reading in a clean testfile
with 3 columns and 2 rows I get the following results:

original loadtxt:  ~1.3s
modified loadtxt:  ~1.7s
new genloadtxt  :  ~2.7s

So you see, there is some loss of performance, but not as much as with
the new converter class.

I hope this solution is of interest ...

Manuel
237a238,247
 class _faultsaveconv(object):
 def __init__(self,conv):
 self._conv = conv
 
 def __call__(self, x):
 try:
 return self._conv(x)
 except:
 return np.nan
 
241c251
 return lambda x: bool(int(x))
---
 return _faultsaveconv(lambda x: bool(int(x)))
243c253
 return lambda x: int(float(x))
---
 return _faultsaveconv(lambda x: int(float(x)))
245c255
 return float
---
 return _faultsaveconv(float)
247c257
 return complex
---
 return _faultsaveconv(complex)
249c259
 return str
---
 return _faultsaveconv(str)
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Christopher Barker
Pierre GM wrote:
 On Dec 3, 2008, at 1:00 PM, Christopher Barker wrote:

 for more than one space between numbers, like:

 1  2  3  4   5
 6  7  8  9  10
 
 
 On the version I'm working on, both delimiter='' and delimiter=None  
 (default) would give you the expected output.

so empty string and None both mean any white space? also tabs, etc?

 delimiter=' ' would  fail,

s only exactly that delimiter.

Is that so things like '\t' will work right?

but what about:

4,  5, 34,123, 

In that case, ',' is the delimiter, but whitespace is ignored.
or

4\t  5\t 34\t  123.

we're ignoring extra whitespace there, too, so I'm not sure why we 
shouldn't ignore it in the ' ' case also.

  delimiter='  ' would work.

but in my example, there were sometimes two spaces, sometimes three -- 
so I think it would fail, no?
  1  2  3  4   5.split('  ')
['1', '2', '3', '4', ' 5']

actually, that would work, but four spaces wouldn't.

  1  2  3  45.split('  ')
['1', '2', '3', '4', '', '5']

I guess the solution is to use delimiter=None in that case, and is does 
make sense that you can't have ' ' mean one or more spaces, but \t 
mean only one tab.

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Manuel Metz
Manuel Metz wrote:
 Alan G Isaac wrote:
 If I know my data is already clean
 and is handled nicely by the
 old loadtxt, will I be able to turn
 off and the special handling in
 order to retain the old load speed?

 Alan Isaac

 
 Hi all,
   that's going in the same direction I was thinking about.
 When I thought about an improved version of loadtxt, I wished it was
 fault tolerant without loosing too much performance.
   So my solution was much simpler than the very nice genloadtxt function
 -- and it works for me.
 
 My ansatz is to leave the existing loadtxt function unchanged. I only
 replaced the default converter calls by a fault tolerant converter
 class. I attached a patch against io.py in numpy 1.2.1
 
 The nice thing is that it not only handles missing values, but for
 example also columns/fields with non-number characters. It just returns
 nan in these cases. This is of practical importance for many datafiles
 of astronomical catalogues, for example the Hipparcos catalogue data.
 
 Regarding the performance, it is a little bit slower than the original
 loadtxt, but not much: on my machine, 10x reading in a clean testfile
 with 3 columns and 2 rows I get the following results:
 
 original loadtxt:  ~1.3s
 modified loadtxt:  ~1.7s
 new genloadtxt  :  ~2.7s
 
 So you see, there is some loss of performance, but not as much as with
 the new converter class.
 
 I hope this solution is of interest ...
 
 Manuel


Oops, wrong version of the diff file. Wanted to name the class
_faulttolerantconv ...



237a238,247
 class _faulttolerantconv(object):
 def __init__(self,conv):
 self._conv = conv
 
 def __call__(self, x):
 try:
 return self._conv(x)
 except:
 return np.nan
 
241c251
 return lambda x: bool(int(x))
---
 return _faulttolerantconv(lambda x: bool(int(x)))
243c253
 return lambda x: int(float(x))
---
 return _faulttolerantconv(lambda x: int(float(x)))
245c255
 return float
---
 return _faulttolerantconv(float)
247c257
 return complex
---
 return _faulttolerantconv(complex)
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-03 Thread Pierre GM
Manuel,
Looks nice, I gonna try to see how I can incorporate yours. Note that  
returning np.nan by default will not work w/ Python 2.6 if you want an  
int...


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-02 Thread Alan G Isaac
On 12/2/2008 7:21 AM Joris De Ridder apparently wrote:
 As a historical note, we used to have scipy.io.read_array which at the  
 time was considered by Travis too slow and too grandiose to be put  
 in Numpy. As a consequence, numpy.loadtxt() was created which was  
 simple and fast.  Now it looks like we're going back to something  
 grandiose.   But perhaps it can be made grandiose *and* reasonably  
 fast ;-).

I hope this consideration remains prominent
in this thread. Is the disappearance or
read_array the reason for this change?
What happened to it?

Note that read_array_demo1.py is still
in scipy.io despite the loss of
read_array.

Alan Isaac

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-02 Thread Alan G Isaac
On 12/2/2008 8:12 AM Alan G Isaac apparently wrote:
 I hope this consideration remains prominent
 in this thread. Is the disappearance or
 read_array the reason for this change?
 What happened to it?

Apologies; it is only deprecated, not gone.
Alan Isaac

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-02 Thread Joris De Ridder

On 1 Dec 2008, at 21:47 , Stéfan van der Walt wrote:

 Hi Pierre

 2008/12/1 Pierre GM [EMAIL PROTECTED]:
 * `genloadtxt` is the base function that makes all the work. It
 outputs 2 arrays, one for the data (missing values being substituted
 by the appropriate default) and one for the mask. It would go in
 np.lib.io

 I see the code length increased from 200 lines to 800.  This made me
 wonder about the execution time: initial benchmarks suggest a 3x
 slow-down.  Could this be a problem for loading large text files?  If
 so, should we consider keeping both versions around, or by default
 bypassing all the extra hooks?

 Regards
 Stéfan


As a historical note, we used to have scipy.io.read_array which at the  
time was considered by Travis too slow and too grandiose to be put  
in Numpy. As a consequence, numpy.loadtxt() was created which was  
simple and fast.  Now it looks like we're going back to something  
grandiose.   But perhaps it can be made grandiose *and* reasonably  
fast ;-).

Cheers,
Joris

P.S. As a reference: 
http://article.gmane.org/gmane.comp.python.numeric.general/5556/


Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-02 Thread Zachary Pincus
Hi Pierre,

I've tested the new loadtxt briefly. Looks good, except that there's a  
minor bug when trying to use a specific white-space delimiter (e.g.  
\t) while still allowing other white-space to be allowed in fields  
(e.g. spaces).

Specifically, on line 115 in LineSplitter, we have:
 self.delimiter = delimiter.strip() or None
so if I pass in, say, '\t' as the delimiter, self.delimiter gets set  
to None, which then causes the default behavior of any-whitespace-is- 
delimiter to be used. This makes lines like Gene Name\tPubMed ID 
\tStarting Position get split wrong, even when I explicitly pass in  
'\t' as the delimiter!

Similarly, I believe that some of the tests are formulated wrong:
 def test_nodelimiter(self):
 Test LineSplitter w/o delimiter
 strg =  1 2 3 4  5 # test
 test = LineSplitter(' ')(strg)
 assert_equal(test, ['1', '2', '3', '4', '5'])

I think that treating an explicitly-passed-in ' ' delimiter as  
identical to 'no delimiter' is a bad idea. If I say that ' ' is the  
delimiter, or '\t' is the delimiter, this should be treated *just*  
like ',' being the delimiter, where the expected output is:
['1', '2', '3', '4', '', '5']

At least, that's what I would expect. Treating contiguous blocks of  
whitespace as single delimiters is perfectly reasonable when None is  
provided as the delimiter, but when an explicit delimiter has been  
provided, it strikes me that the code shouldn't try to further- 
interpret it...

Does anyone else have any opinion here?

Zach


On Dec 1, 2008, at 1:21 PM, Pierre GM wrote:

 Well, looks like the attachment is too big, so here's the  
 implementation. The tests will come in another message.

 genload_proposal.py
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-02 Thread Ryan May
Zachary Pincus wrote:
 Specifically, on line 115 in LineSplitter, we have:
  self.delimiter = delimiter.strip() or None
 so if I pass in, say, '\t' as the delimiter, self.delimiter gets set  
 to None, which then causes the default behavior of any-whitespace-is- 
 delimiter to be used. This makes lines like Gene Name\tPubMed ID 
 \tStarting Position get split wrong, even when I explicitly pass in  
 '\t' as the delimiter!
 
 Similarly, I believe that some of the tests are formulated wrong:
  def test_nodelimiter(self):
  Test LineSplitter w/o delimiter
  strg =  1 2 3 4  5 # test
  test = LineSplitter(' ')(strg)
  assert_equal(test, ['1', '2', '3', '4', '5'])
 
 I think that treating an explicitly-passed-in ' ' delimiter as  
 identical to 'no delimiter' is a bad idea. If I say that ' ' is the  
 delimiter, or '\t' is the delimiter, this should be treated *just*  
 like ',' being the delimiter, where the expected output is:
 ['1', '2', '3', '4', '', '5']
 
 At least, that's what I would expect. Treating contiguous blocks of  
 whitespace as single delimiters is perfectly reasonable when None is  
 provided as the delimiter, but when an explicit delimiter has been  
 provided, it strikes me that the code shouldn't try to further- 
 interpret it...
 
 Does anyone else have any opinion here?

I agree.  If the user explicity passes something as a delimiter, we 
should use it and not try to be too smart.

+1

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-02 Thread Ryan May
Pierre GM wrote:
 Well, looks like the attachment is too big, so here's the 
 implementation. The tests will come in another message.

A couple of quick nitpicks:

1) On line 186 (in the NameValidator class), you use 
excludelist.append() to append a list to the end of a list.  I think you 
meant to use excludelist.extend()

2) When validating a list of names, why do you insist on lower casing 
them? (I'm referring to the call to lower() on line 207).  On one hand, 
this would seem nicer than all upper case, but on the other hand this 
can cause confusion for someone who sees certain casing of names in the 
file and expects that data to be laid out the same.

Other than those, it's working fine for me here.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-02 Thread Pierre GM

On Dec 2, 2008, at 3:12 PM, Ryan May wrote:

 Pierre GM wrote:
 Well, looks like the attachment is too big, so here's the
 implementation. The tests will come in another message.

 A couple of quick nitpicks:

 1) On line 186 (in the NameValidator class), you use
 excludelist.append() to append a list to the end of a list.  I think  
 you
 meant to use excludelist.extend()

Good call.

 2) When validating a list of names, why do you insist on lower casing
 them? (I'm referring to the call to lower() on line 207).  On one  
 hand,
 this would seem nicer than all upper case, but on the other hand this
 can cause confusion for someone who sees certain casing of names in  
 the
 file and expects that data to be laid out the same.

I recall a life where names were case-insensitives, so 'dates' and  
'Dates' and 'DATES' were the same field. It should be easy enough to  
get rid of that limitations, or add a parameter for case-sensitivity


On Dec 2, 2008, at 2:47 PM, Zachary Pincus wrote:

 Specifically, on line 115 in LineSplitter, we have:
 self.delimiter = delimiter.strip() or None
 so if I pass in, say, '\t' as the delimiter, self.delimiter gets set
 to None, which then causes the default behavior of any-whitespace-is-
 delimiter to be used. This makes lines like Gene Name\tPubMed ID
 \tStarting Position get split wrong, even when I explicitly pass in
 '\t' as the delimiter!

OK, I'll check that.


 I think that treating an explicitly-passed-in ' ' delimiter as
 identical to 'no delimiter' is a bad idea. If I say that ' ' is the
 delimiter, or '\t' is the delimiter, this should be treated *just*
 like ',' being the delimiter, where the expected output is:
 ['1', '2', '3', '4', '', '5']


Valid point.
Well, all, stay tuned for yet another yet another implementation...






 Other than those, it's working fine for me here.

 Ryan

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-02 Thread Christopher Barker
Pierre GM wrote:
 I think that treating an explicitly-passed-in ' ' delimiter as
 identical to 'no delimiter' is a bad idea. If I say that ' ' is the
 delimiter, or '\t' is the delimiter, this should be treated *just*
 like ',' being the delimiter, where the expected output is:
 ['1', '2', '3', '4', '', '5']

 
 Valid point.
 Well, all, stay tuned for yet another yet another implementation...

While we're at it, it might be nice to be able to pass in more than one 
delimiter: ('\t',' '). though maybe that only combination that I'd 
really want would be something and '\n', which I think is being treated 
specially already.

-Chris




-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-02 Thread Pierre GM
Chris,
I can try, but in that case, please write me a unittest, so that I  
have a clear and unambiguous idea of what you expect.
ANFSCD, have you tried the missing_values option ?


On Dec 2, 2008, at 5:36 PM, Christopher Barker wrote:

 Pierre GM wrote:
 I think that treating an explicitly-passed-in ' ' delimiter as
 identical to 'no delimiter' is a bad idea. If I say that ' ' is the
 delimiter, or '\t' is the delimiter, this should be treated *just*
 like ',' being the delimiter, where the expected output is:
 ['1', '2', '3', '4', '', '5']


 Valid point.
 Well, all, stay tuned for yet another yet another implementation...

 While we're at it, it might be nice to be able to pass in more than  
 one
 delimiter: ('\t',' '). though maybe that only combination that I'd
 really want would be something and '\n', which I think is being  
 treated
 specially already.

 -Chris




 -- 
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 [EMAIL PROTECTED]
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-01 Thread Pierre GM
All,

Please find attached to this message another implementation of  
np.loadtxt, which focuses on missing values. It's basically a  
combination of John Hunter's et al mlab.csv2rec, Ryan May's patches  
and pieces of code I'd been working on over the last few weeks.
Besides some helper classes (StringConverter to convert a string into  
something else, NameValidator to check names..._), you'll find 3  
functions:

* `genloadtxt` is the base function that makes all the work. It  
outputs 2 arrays, one for the data (missing values being substituted  
by the appropriate default) and one for the mask. It would go in  
np.lib.io

* `loadtxt` would replace the current np.loadtxt. It outputs a  
ndarray, where missing data being filled. It would also go in np.lib.io

* `mloadtxt` would go into np.ma.io (to be created) and renamed  
`loadtxt`. Right now, I needed a different name to avoid conflicts. It  
combines the outputs of `genloadtxt` into a single masked array.

You'll also several series of tests, that you can use as examples.

Please give it a try and send me some feedback (bugs, wishes,  
suggestions). I'd like it to make the 1.3.0 release (I need some of  
the functionalities to improve the corresponding function in  
scikits.timeseries, currently fubar...)

P.

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-01 Thread Pierre GM

And now for the tests:


Proposal : 
Here's an extension to np.loadtxt, designed to take missing values into account.



from genload_proposal import *
from numpy.ma.testutils import *
import StringIO

class TestLineSplitter(TestCase):
#
def test_nodelimiter(self):
Test LineSplitter w/o delimiter
strg =  1 2 3 4  5 # test
test = LineSplitter(' ')(strg)
assert_equal(test, ['1', '2', '3', '4', '5'])
test = LineSplitter()(strg)
assert_equal(test, ['1', '2', '3', '4', '5'])

#
def test_delimiter(self):
Test LineSplitter on delimiter
strg = 1,2,3,4,,5
test = LineSplitter(',')(strg)
assert_equal(test, ['1', '2', '3', '4', '', '5'])
#
strg =  1,2,3,4,,5 # test
test = LineSplitter(',')(strg)
assert_equal(test, ['1', '2', '3', '4', '', '5'])
#
strg =  1 2 3 4  5 # test
test = LineSplitter(' ')(strg)
assert_equal(test, ['1', '2', '3', '4', '5'])
#
def test_fixedwidth(self):
Test LineSplitter w/ fixed-width fields
strg =   1  2  3  4 5   # test
test = LineSplitter(3)(strg)
assert_equal(test, ['1', '2', '3', '4', '', '5', ''])
#
strg =   1 3  4  5  6# test
test = LineSplitter((3,6,6,3))(strg)
assert_equal(test, ['1', '3', '4  5', '6'])
#
strg =   1 3  4  5  6# test
test = LineSplitter((6,6,9))(strg)
assert_equal(test, ['1', '3  4', '5  6'])
#
strg =   1 3  4  5  6# test
test = LineSplitter(20)(strg)
assert_equal(test, ['1 3  4  5  6'])
#
strg =   1 3  4  5  6# test
test = LineSplitter(30)(strg)
assert_equal(test, ['1 3  4  5  6'])


class TestStringConverter(TestCase):
Test StringConverter
#
def test_creation(self):
Test creation of a StringConverter
converter = StringConverter(int, -9)
assert_equal(converter._status, 1)
assert_equal(converter.default, -9)
#
def test_upgrade(self):
Tests the upgrade method.
converter = StringConverter()
assert_equal(converter._status, 0)
converter.upgrade('0')
assert_equal(converter._status, 1)
converter.upgrade('0.')
assert_equal(converter._status, 2)
converter.upgrade('0j')
assert_equal(converter._status, 3)
converter.upgrade('a')
assert_equal(converter._status, len(converter._mapper)-1)
#
def test_missing(self):
Tests the use of missing values.
converter = StringConverter(missing_values=('missing','missed'))
converter.upgrade('0')
assert_equal(converter('0'), 0)
assert_equal(converter(''), converter.default)
assert_equal(converter('missing'), converter.default)
assert_equal(converter('missed'), converter.default)
try:
converter('miss')
except ValueError:
pass
#
def test_upgrademapper(self):
Tests updatemapper
import dateutil.parser
import datetime
dateparser = dateutil.parser.parse
StringConverter.upgrade_mapper(dateparser, datetime.date(2000,1,1))
convert = StringConverter(dateparser, datetime.date(2000, 1, 1))
test = convert('2001-01-01')
assert_equal(test, datetime.datetime(2001, 01, 01, 00, 00, 00))



class TestLoadTxt(TestCase):
#
def test_record(self):
Test w/ explicit dtype
data = StringIO.StringIO('1 2\n3 4')
#data.seek(0)
test = loadtxt(data, dtype=[('x', np.int32), ('y', np.int32)])
control = np.array([(1, 2), (3, 4)], dtype=[('x', 'i4'), ('y', 'i4')])
assert_equal(test, control)
#
data = StringIO.StringIO('M 64.0 75.0\nF 25.0 60.0')
#data.seek(0)
descriptor = {'names': ('gender','age','weight'),
  'formats': ('S1', 'i4', 'f4')}
control = np.array([('M', 64.0, 75.0), ('F', 25.0, 60.0)],
   dtype=descriptor)
test = loadtxt(data, dtype=descriptor)
assert_equal(test, control)

def test_array(self):
Test outputing a standard ndarray
data = StringIO.StringIO('1 2\n3 4')
control = np.array([[1,2],[3,4]], dtype=int)
test = loadtxt(data, dtype=int)
assert_array_equal(test, control)
#
data.seek(0)
control = np.array([[1,2],[3,4]], dtype=float)
test = np.loadtxt(data, dtype=float)
assert_array_equal(test, control)

def test_1D(self):
Test squeezing to 1D
control = np.array([1, 2, 3, 4], int)
#
data = StringIO.StringIO('1\n2\n3\n4\n')
test = loadtxt(data, dtype=int)
assert_array_equal(test, control)
#
data = StringIO.StringIO('1,2,3,4\n')
test = loadtxt(data, dtype=int, delimiter=',')

Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-01 Thread Stéfan van der Walt
2008/12/1 Pierre GM [EMAIL PROTECTED]:
 Please find attached to this message another implementation of

Struggling to comply!

Cheers
Stéfan
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-01 Thread Pierre GM
Well, looks like the attachment is too big, so here's the  
implementation. The tests will come in another message.



Proposal : 
Here's an extension to np.loadtxt, designed to take missing values into account.





import itertools
import numpy as np
import numpy.ma as ma


def _is_string_like(obj):

Check whether obj behaves like a string.

try:
obj + ''
except (TypeError, ValueError):
return False
return True

def _to_filehandle(fname, flag='r', return_opened=False):

Returns the filehandle corresponding to a string or a file.
If the string ends in '.gz', the file is automatically unzipped.

Parameters
--
fname : string, filehandle
Name of the file whose filehandle must be returned.
flag : string, optional
Flag indicating the status of the file ('r' for read, 'w' for write).
return_opened : boolean, optional
Whether to return the opening status of the file.

if _is_string_like(fname):
if fname.endswith('.gz'):
import gzip
fhd = gzip.open(fname, flag)
elif fname.endswith('.bz2'):
import bz2
fhd = bz2.BZ2File(fname)
else:
fhd = file(fname, flag)
opened = True
elif hasattr(fname, 'seek'):
fhd = fname
opened = False
else:
raise ValueError('fname must be a string or file handle')
if return_opened:
return fhd, opened
return fhd


def flatten_dtype(ndtype):

Unpack a structured data-type.


names = ndtype.names
if names is None:
return [ndtype]
else:
types = []
for field in names:
(typ, _) = ndtype.fields[field]
flat_dt = flatten_dtype(typ)
types.extend(flat_dt)
return types


def nested_masktype(datatype):

Construct the dtype of a mask for nested elements.


names = datatype.names
if names:
descr = []
for name in names:
(ndtype, _) = datatype.fields[name]
descr.append((name, nested_masktype(ndtype)))
return descr
# Is this some kind of composite a la (np.float,2)
elif datatype.subdtype:
mdescr = list(datatype.subdtype)
mdescr[0] = np.dtype(bool)
return tuple(mdescr)
else:
return np.bool


class LineSplitter:

Defines a function to split a string at a given delimiter or at given places.

Parameters
--
comment : {'#', string}
Character used to mark the beginning of a comment.
delimiter : var


def __init__(self, delimiter=None, comments='#'):
self.comments = comments
# Delimiter is a character
if delimiter is None:
self._isfixed = False
self.delimiter = None
elif _is_string_like(delimiter):
self._isfixed = False
self.delimiter = delimiter.strip() or None
# Delimiter is a list of field widths
elif hasattr(delimiter, '__iter__'):
self._isfixed = True
idx = np.cumsum([0]+list(delimiter))
self.slices = [slice(i,j) for (i,j) in zip(idx[:-1], idx[1:])]
# Delimiter is a single integer
elif int(delimiter):
self._isfixed = True
self.slices = None
self.delimiter = delimiter
else:
self._isfixed = False
self.delimiter = None
#
def __call__(self, line):
# Strip the comments
line = line.split(self.comments)[0]
if not line:
return []
# Fixed-width fields
if self._isfixed:
# Fields have different widths
if self.slices is None:
fixed = self.delimiter
slices = [slice(i, i+fixed)
  for i in range(len(line))[::fixed]]
else:
slices = self.slices
return [line[s].strip() for s in slices]
else:
return [s.strip() for s in line.split(self.delimiter)]


Splits the line at each current delimiter.
Comments are stripped beforehand.



class NameValidator:

Validates a list of strings to use as field names.
The strings are stripped of any non alphanumeric character, and spaces
are replaced by `_`.

During instantiation, the user can define a list of names to exclude, as 
well as a list of invalid characters. Names in the exclude list are appended
a '_' character.

Once an instance has been created, it can be called with a list of names
and a list of valid names will be created.
The `__call__` method accepts an optional keyword, `default`, that sets
the default name in case of ambiguity. By default, `default = 'f'`, so
that names will default to `f0`, `f1`

Parameters
--
excludelist : sequence, optional
A list of names to 

Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-01 Thread John Hunter
On Mon, Dec 1, 2008 at 12:21 PM, Pierre GM [EMAIL PROTECTED] wrote:
 Well, looks like the attachment is too big, so here's the implementation.
 The tests will come in another message.\


It looks like I am doing something wrong -- trying to parse a CSV file
with dates formatted like '2008-10-14', with::

import datetime, sys
import dateutil.parser
StringConverter.upgrade_mapper(dateutil.parser.parse,
default=datetime.date(1900,1,1))
r = loadtxt(sys.argv[1], delimiter=',', names=True)
print r.dtype

I get the following::

Traceback (most recent call last):
  File genload_proposal.py, line 734, in ?
r = loadtxt(sys.argv[1], delimiter=',', names=True)
  File genload_proposal.py, line 711, in loadtxt
(output, _) = genloadtxt(fname, **kwargs)
  File genload_proposal.py, line 646, in genloadtxt
rows[i] = tuple([conv(val) for (conv, val) in zip(converters, vals)])
  File genload_proposal.py, line 385, in __call__
raise ValueError(Cannot convert string '%s' % value)
ValueError: Cannot convert string '2008-10-14'

In debug mode, I see the following where the error occurs

ipdb vals
('2008-10-14', '116.26', '116.40', '103.14', '104.08', '70749800', '104.08')
ipdb converters
[__main__.StringConverter instance at 0xa35fa6c,
__main__.StringConverter instance at 0xa35ff2c,
__main__.StringConverter instance at 0xa35ff8c,
__main__.StringConverter instance at 0xa35ffec,
__main__.StringConverter instance at 0xa15406c,
__main__.StringConverter instance at 0xa1540cc,
__main__.StringConverter instance at 0xa15412c]

It looks like my registry of a custom converter isn't working.  Here
is what the _mapper looks like::

In [23]: StringConverter._mapper
Out[23]:
[(type 'numpy.bool_', function str2bool at 0xa2b8bc4, None),
 (type 'numpy.integer', type 'int', -1),
 (type 'numpy.floating', type 'float', -NaN),
 (type 'complex', type 'complex', (-NaN+0j)),
 (type 'numpy.object_',
  function parse at 0x8cf1534,
  datetime.date(1900, 1, 1)),
 (type 'numpy.string_', type 'str', '???')]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-01 Thread Stéfan van der Walt
Hi Pierre

2008/12/1 Pierre GM [EMAIL PROTECTED]:
 * `genloadtxt` is the base function that makes all the work. It
 outputs 2 arrays, one for the data (missing values being substituted
 by the appropriate default) and one for the mask. It would go in
 np.lib.io

I see the code length increased from 200 lines to 800.  This made me
wonder about the execution time: initial benchmarks suggest a 3x
slow-down.  Could this be a problem for loading large text files?  If
so, should we consider keeping both versions around, or by default
bypassing all the extra hooks?

Regards
Stéfan
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-01 Thread Ryan May
Stéfan van der Walt wrote:
 Hi Pierre
 
 2008/12/1 Pierre GM [EMAIL PROTECTED]:
 * `genloadtxt` is the base function that makes all the work. It
 outputs 2 arrays, one for the data (missing values being substituted
 by the appropriate default) and one for the mask. It would go in
 np.lib.io
 
 I see the code length increased from 200 lines to 800.  This made me
 wonder about the execution time: initial benchmarks suggest a 3x
 slow-down.  Could this be a problem for loading large text files?  If
 so, should we consider keeping both versions around, or by default
 bypassing all the extra hooks?

I've wondered about this being an issue.  On one hand, you hate to make 
existing code noticeably slower.  On the other hand, if speed is 
important to you, why are you using ascii I/O?

I personally am not entirely against having two versions of loadtxt-like 
functions.  However, the idea seems a little odd, seeing as how loadtxt 
was already supposed to be the swiss army knife of text reading.

I'm seeing a similar slowdown with Pierre's version of the code.  The 
version of loadtxt that I cobbled together with the StringConverter 
class (and no missing value support) shows about a 50% slowdown, so 
clearly there's a performance penalty for trying to make a generic 
function that can be all things to all people.  On the other hand, this 
approach reduces code duplication.

I'm not really opinionated on what the right approach is here.  My only 
opinion is that this functionality *really* needs to be in numpy in some 
fashion.  For my own use case, with the old version, I could read a text 
file and by hand separate out columns and mask values.  Now, I open a 
file and get a structured array with an automatically detected dtype 
(names and types!) plus masked values.

My $0.02.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-01 Thread Stéfan van der Walt
2008/12/1 Ryan May [EMAIL PROTECTED]:
 I've wondered about this being an issue.  On one hand, you hate to make
 existing code noticeably slower.  On the other hand, if speed is
 important to you, why are you using ascii I/O?

More I than O!  But I think numpy.fromfile, once fixed up, could
fill this niche nicely.

 I personally am not entirely against having two versions of loadtxt-like
 functions.  However, the idea seems a little odd, seeing as how loadtxt
 was already supposed to be the swiss army knife of text reading.

I haven't investigated the code in too much detail, but wouldn't it be
possible to implement the current set of functionality in a
base-class, which is then specialised to add the rest?  That way, one
could always instantiate TextReader yourself for some added speed.

 I'm not really opinionated on what the right approach is here.  My only
 opinion is that this functionality *really* needs to be in numpy in some
 fashion.  For my own use case, with the old version, I could read a text
 file and by hand separate out columns and mask values.  Now, I open a
 file and get a structured array with an automatically detected dtype
 (names and types!) plus masked values.

That's neat!

Cheers
Stéfan
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-01 Thread Pierre GM
I agree, genloadtxt is a bit blotted, and it's not a surprise it's  
slower than the initial one. I think that in order to be fair,  
comparisons must be performed with matplotlib.mlab.csv2rec, that  
implements as well the autodetection of the dtype. I'm quite in favor  
of keeping a lite version around.



On Dec 1, 2008, at 4:47 PM, Stéfan van der Walt wrote:

 I haven't investigated the code in too much detail, but wouldn't it be
 possible to implement the current set of functionality in a
 base-class, which is then specialised to add the rest?  That way, one
 could always instantiate TextReader yourself for some added speed.

Well, one of the issues is that we need to keep the function  
compatible w/ urllib.urlretrieve (Ryan, am I right?), which means not  
being able to go back to the beginning of a file (no call to .seek).  
Another issue comes from the possibility to define the dtype  
automatically: you need to keep track of the converters, then have to  
do a second loop on the data. Those converters are likely the  
bottleneck, as you need to check whether each value can be interpreted  
as missing or not and respond appropriately.

I thought about creating a base class, with a specific subclass taking  
care of the missing values. I found out it would have duplicated a lot  
of code


In any case, I think that's secondary: we can always optimize pieces  
of the code afterwards. I'd like more feedback on corner cases and  
usage...
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-01 Thread Christopher Barker
Stéfan van der Walt wrote:
 important to you, why are you using ascii I/O?

ascii I/O is slow, so that's a reason in itself to want it not to be slower!

 More I than O!  But I think numpy.fromfile, once fixed up, could
 fill this niche nicely.

I agree -- for the simple cases, fromfile() could work very well -- 
perhaps it could even be used to speed up some special cases of loadtxt.

But is anyone working on fromfile()?

By the way, I think overloading fromfile() for text files is a bit 
misleading for users -- I propose we have a fromtextfile() or something 
instead.

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

2008-12-01 Thread Christopher Barker
Pierre GM wrote:
 Another issue comes from the possibility to define the dtype  
 automatically:

Does all that get bypassed if the dtype(s) is specified? Is it still 
slow in that case?

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion