Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-03 Thread Chris Barker
On Sat, Nov 1, 2014 at 7:31 AM, Warren Weckesser warren.weckes...@gmail.com
 wrote:

 (2) Multiple arrays in a single file:

 ...


 The file contains multiple arrays. Each array is
 preceded by a line containing the number of rows
 and columns in that array. The `max_rows` argument
 would make it easy to read this file with genfromtxt:


+inf on this one -- this is a use case I've been looking for support for
ages!

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-02 Thread Warren Weckesser
On Sat, Nov 1, 2014 at 4:41 PM, Alexander Belopolsky ndar...@mac.com
wrote:


 On Sat, Nov 1, 2014 at 3:15 PM, Warren Weckesser 
 warren.weckes...@gmail.com wrote:

 Is there wider interest in such an argument to `genfromtxt`?  For my
 use-cases, `max_rows` is sufficient.  I can't recall ever needing the full
 generality of a slice for pulling apart a text file.  Does anyone have
 compelling use-cases that are not handled by `max_rows`?


 It is occasionally useful to be able to skip rows after the header.  Maybe
 we should de-deprecate skip_rows and give it the meaning different from
 skip_header in case of names = None?  For example,

 genfromtxt(fname,  skip_header= 3, skip_rows = 1, max_rows = 100)

 would mean skip 3 lines, read column names from the 4-th, skip 5-th,
 process up to 100 more lines.  This may be useful if the file contains some
 meta-data about the column below the header line.  For example, it is
 common to put units of measurement below the column names.



Or you could just call genfromtxt() once with `max_rows=1` to skip a row.
(I'm assuming that the first argument to genfromtxt is the open file
object--or some other iterator--and not the filename.)




 Another application could be processing a large text file in chunks, which
 again can be covered nicely by  skip_rows/max_rows.



You don't really need `skip_rows` for this.  In a previous email (and in
https://github.com/numpy/numpy/pull/5103) I gave an example of using
`max_rows` for handling a file that doesn't have a header.  If the file has
a header, you could process the file in batches using something like the
following example, where the dtype determined in the first batch is used
when reading the subsequent batches:

In [12]: !cat foo.dat
  ab c
1.0  2.0  -9.0
3.0  4.0  -7.6
5.0  6.0  -1.0
7.0  8.0  -3.3
9.0  0.0  -3.4

In [13]: f = open(foo.dat, r)

In [14]: batch1 = genfromtxt(f, dtype=None, names=True, max_rows=2)

In [15]: batch1
Out[15]:
array([(1.0, 2.0, -9.0), (3.0, 4.0, -7.6)],
  dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')])

In [16]: batch2 = genfromtxt(f, dtype=batch1.dtype, max_rows=2)

In [17]: batch2
Out[17]:
array([(5.0, 6.0, -1.0), (7.0, 8.0, -3.3)],
  dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')])

In [18]: batch3 = genfromtxt(f, dtype=batch1.dtype, max_rows=2)

In [19]: batch3
Out[19]:
array((9.0, 0.0, -3.4),
  dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')])



Warren





 I cannot think of a situation where I would need more generality such as
 reading every 3rd row or rows with the given numbers.  Such processing is
 normally done after the text data is loaded into an array.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-02 Thread Alexander Belopolsky
On Sun, Nov 2, 2014 at 1:56 PM, Warren Weckesser warren.weckes...@gmail.com
 wrote:

 Or you could just call genfromtxt() once with `max_rows=1` to skip a row.
 (I'm assuming that the first argument to genfromtxt is the open file
 object--or some other iterator--and not the filename.)


That's hackish.  If I have to resort to something like this, I would just
call next() on the open file object or iterator.

Still, the case of dtype=None, name=None is problematic.   Suppose I want
genfromtxt()  to detect the column names from the 1-st row and data types
from the 3-rd.  How would you do that?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-02 Thread Alexander Belopolsky
Sorry, I meant names=True, not name=None.

On Sun, Nov 2, 2014 at 2:18 PM, Alexander Belopolsky ndar...@mac.com
wrote:


 On Sun, Nov 2, 2014 at 1:56 PM, Warren Weckesser 
 warren.weckes...@gmail.com wrote:

 Or you could just call genfromtxt() once with `max_rows=1` to skip a
 row.  (I'm assuming that the first argument to genfromtxt is the open file
 object--or some other iterator--and not the filename.)


 That's hackish.  If I have to resort to something like this, I would just
 call next() on the open file object or iterator.

 Still, the case of dtype=None, name=None is problematic.   Suppose I want
 genfromtxt()  to detect the column names from the 1-st row and data types
 from the 3-rd.  How would you do that?

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-02 Thread Warren Weckesser
On Sun, Nov 2, 2014 at 2:18 PM, Alexander Belopolsky ndar...@mac.com
wrote:


 On Sun, Nov 2, 2014 at 1:56 PM, Warren Weckesser 
 warren.weckes...@gmail.com wrote:

 Or you could just call genfromtxt() once with `max_rows=1` to skip a
 row.  (I'm assuming that the first argument to genfromtxt is the open file
 object--or some other iterator--and not the filename.)


 That's hackish.  If I have to resort to something like this, I would just
 call next() on the open file object or iterator.



I agree, calling genfromtxt to skip a line is silly.  Calling next() makes
much more sense.




 Still, the case of dtype=None, name=None is problematic.   Suppose I want
 genfromtxt()  to detect the column names from the 1-st row and data types
 from the 3-rd.  How would you do that?



This may sound like a cop out, but at some point, I stop trying to make
genfromtxt() handle every possible case, and instead I would write a custom
header reader to handle this.

Warren



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-02 Thread Alexander Belopolsky
On Sun, Nov 2, 2014 at 2:32 PM, Warren Weckesser warren.weckes...@gmail.com
 wrote:


 Still, the case of dtype=None, name=None is problematic.   Suppose I want
 genfromtxt()  to detect the column names from the 1-st row and data types
 from the 3-rd.  How would you do that?



 This may sound like a cop out, but at some point, I stop trying to make
 genfromtxt() handle every possible case, and instead I would write a custom
 header reader to handle this.


In the abstract, I would agree with you.  It is often the case that 2-3
lines of clear Python code is better than a terse function call with half a
dozen non-obvious options.  Specifically, I would be against the proposed
slice_rows because it is either equivalent to  genfromtxt(islice(..), ..)
or hard to specify.

On the other hand, skip_rows is different for two reasons:

1. It is not a new option.  It is currently a deprecated alias to
skip_header, so a change is expected - either removal or redefinition.
2. The intended use-case - inferring column names and type information from
a file where data is separated from the column names is hard to code
explicitly.  (Try it!)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-02 Thread Warren Weckesser
On 11/2/14, Alexander Belopolsky ndar...@mac.com wrote:
 On Sun, Nov 2, 2014 at 2:32 PM, Warren Weckesser
 warren.weckes...@gmail.com
 wrote:


 Still, the case of dtype=None, name=None is problematic.   Suppose I
 want
 genfromtxt()  to detect the column names from the 1-st row and data
 types
 from the 3-rd.  How would you do that?



 This may sound like a cop out, but at some point, I stop trying to make
 genfromtxt() handle every possible case, and instead I would write a
 custom
 header reader to handle this.


 In the abstract, I would agree with you.  It is often the case that 2-3
 lines of clear Python code is better than a terse function call with half a
 dozen non-obvious options.  Specifically, I would be against the proposed
 slice_rows because it is either equivalent to  genfromtxt(islice(..), ..)
 or hard to specify.


I don't have much more to add to the API discussion at the moment, but
I want to make sure one aspect is clear. (Sorry for the noise if the
following is obvious.)

In an earlier email, I gave my interpretation of the semantics of
`slice_rows` (and `max_rows`), which is that `genfromtxt(f, ...,
slice_rows=arg)` produces the same result as `genfromtxt(f,
...)[arg]`. (The difference is that it only consumes items from the
input iterator f as required by `arg`).  This isn't the same as
`genfromtxt(islice(f, slice args), ...)`, because `genfromtxt` skips
comments and blank lines.  (It also skips invalid lines if the
argument `invalid_raise=False` is used.)  So if the input file was

-
 1  10
# A comment.
 2  20

 3  30
 4  40
 5  50
-

Then `genfromtxt(f, dtype=int, slice_rows=slice(4))` would produce
`array([[1, 10], [2, 20], [3, 30], [4, 40]])`, while
`genfromtxt(islice(f, 4), dtype=int)` would produce `array([1, 10],
[2, 20]])`.

That's my interpretation of how `max_rows` or `slice_rows` should
work.  If that is not what other folks expect, than that should also
be part of the discussion.

Warren




 On the other hand, skip_rows is different for two reasons:

 1. It is not a new option.  It is currently a deprecated alias to
 skip_header, so a change is expected - either removal or redefinition.
 2. The intended use-case - inferring column names and type information from
 a file where data is separated from the column names is hard to code
 explicitly.  (Try it!)

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-01 Thread Warren Weckesser
On 9/24/14, Alan G Isaac alan.is...@gmail.com wrote:
 On 9/24/2014 2:52 PM, Jaime Fernández del Río wrote:
 There is a PR in github that adds a new keyword to the genfromtxt
 function, to limit the number of rows that actually get read in:
 https://github.com/numpy/numpy/pull/5103

 Sorry to come late to this party, but it seems to me that
 more versatile than an `nrows` keyword for the number of rows
 would be a rows keyword for a slice argument.

 fwiw,
 Alan Isaac


I've continued the PR for the addition of the `nrows` (now
`max_rows`) argument to `genfromtxt` here:
https://github.com/numpy/numpy/pull/5253

Alan's suggestion to use a slice is interesting, but I'd like to
see a more concrete proposal for the API.  For example, how does
it interact with `skip_header` and `skip_footer`?  How would one
use it to read a file in batches?

The following are a couple use-cases for `max_rows` (originally
added as comments at https://github.com/numpy/numpy/pull/5103):


(1) Read a file in batches:

Suppose the file a.csv contains:

 0 10
 1 11
 2 12
 3 13
 4 14
 5 15
 6 16
 7 17
 8 18
 9 19

With `max_rows`, the file can be read in batches of, say, 4:

In [31]: f = open(a.csv, r)

In [32]: genfromtxt(f, dtype=None, max_rows=4)
Out[32]:
array([[ 0, 10],
   [ 1, 11],
   [ 2, 12],
   [ 3, 13]])

In [33]: genfromtxt(f, dtype=None, max_rows=4)
Out[33]:
array([[ 4, 14]
   [ 5, 15],
   [ 6, 16],
   [ 7, 17]])

In [33]: genfromtxt(f, dtype=None, max_rows=4)
Out[33]:
array([[ 8, 18],
   [ 9, 19]])


(2) Multiple arrays in a single file:

I've seen a file format of the form

3 5
1.0 1.5 2.1 2.5 4.8
3.5 1.0 8.7 6.0 2.0
4.2 0.7 4.4 5.3 2.0
2 3
89.1 66.3 42.1
12.3 19.0 56.6

The file contains multiple arrays. Each array is
preceded by a line containing the number of rows
and columns in that array. The `max_rows` argument
would make it easy to read this file with genfromtxt:

In [7]: f = open(b.dat, r)

In [8]: nrows, ncols = genfromtxt(f, dtype=None, max_rows=1)

In [9]: A = genfromtxt(f, max_rows=nrows)

In [10]: nrows, ncols = genfromtxt(f, dtype=None, max_rows=1)

In [11]: B = genfromtxt(f, max_rows=nrows)

In [12]: A
Out[12]:
array([[ 1. ,  1.5,  2.1,  2.5,  4.8],
   [ 3.5,  1. ,  8.7,  6. ,  2. ],
   [ 4.2,  0.7,  4.4,  5.3,  2. ]])

In [13]: B
Out[13]:
array([[ 89.1,  66.3,  42.1],
   [ 12.3,  19. ,  56.6]])


Warren


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-01 Thread Alan G Isaac
On 11/1/2014 10:31 AM, Warren Weckesser wrote:
 Alan's suggestion to use a slice is interesting, but I'd like to
 see a more concrete proposal for the API.  For example, how does
 it interact with `skip_header` and `skip_footer`?  How would one
 use it to read a file in batches?


I'm probably just not understanding the question, but the initial
answer I will give is, just like the proposal for `max_rows`.

That is, skip_header and skip_footer are honored, and the remainder
of the file is sliced. For the equivalent of say `max_rows=500`,
one would say `slice_rows=slice(500)`.

Perhaps you could provide an example illustrating the issues this
reply overlooks.

Cheers,
Alan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-01 Thread Warren Weckesser
On Sat, Nov 1, 2014 at 10:54 AM, Alan G Isaac alan.is...@gmail.com wrote:

 On 11/1/2014 10:31 AM, Warren Weckesser wrote:
  Alan's suggestion to use a slice is interesting, but I'd like to
  see a more concrete proposal for the API.  For example, how does
  it interact with `skip_header` and `skip_footer`?  How would one
  use it to read a file in batches?


 I'm probably just not understanding the question, but the initial
 answer I will give is, just like the proposal for `max_rows`.

 That is, skip_header and skip_footer are honored, and the remainder
 of the file is sliced. For the equivalent of say `max_rows=500`,
 one would say `slice_rows=slice(500)`.

 Perhaps you could provide an example illustrating the issues this
 reply overlooks.

 Cheers,
 Alan



OK, so `slice_rows=slice(n)` should behave the same as `max_rows=n`.
Here's my take on how `slice_rows` could be handled.

I intended the result of `genfromtxt(..., max_rows=n)` to produce the same
array as produced by `genfromtxt(...)[:n]`.  So a reasonable way to define
the behavior of `slice_rows` is that `gengromtxt(..., slice_rows=arg)`
returns the same array as `genfromtxt(...)[arg]`.   With that
specification, it is natural for `slice_rows` to accept any object that is
valid for indexing, e.g. `slice_rows=[0,2,3]` or `slice_rows=10`. (But that
wouldn't necessarily have to be implemented.)

The two differences between `genfromtxt(..., slice_rows=arg)` and
`genfromtxt(...)[arg]` are (1) the former is more efficient--it can simply
ignore the rows that won't be part of the final result; and (2) the former
doesn't consume the input iterator beyond what is requested by `arg`.  For
example, `slice_rows=(2,10,2)` would consume 10 items from the input (or
fewer, if there aren't 10 items in the input). Note that the actual indices
for that slice are [2, 4, 6, 8]; even though index 9 is not included in the
result, the corresponding item is consumed from the input iterator.
(That's how I would interpret it, anyway.)

Because the input argument to `genfromtxt` can be an arbitrary iterator,
the use of `slice_rows=slice(n)` is not compatible with the use of
`skip_footer=m`.  Handling `skip_footer=m` requires looking ahead in the
iterator to see if the end of the input is within `m` items, but in
general, looking ahead is not possible without consuming the items. (The
`max_rows` argument has the same problem.  In the current PR, a ValueError
is raised if both `skip_footer` and `max_rows` are given.)

Related to this is how to handle `slice_rows=slice(-3)`.   Either this is
not allowed (for the same reason that `slice_rows=slice(n), skip_footer=m`
is disallowed), or it results in the entire iterator being consumed (and it
is explained in the docstring that this is the effect of a negative `stop`
value in a slice).

Is there wider interest in such an argument to `genfromtxt`?  For my
use-cases, `max_rows` is sufficient.  I can't recall ever needing the full
generality of a slice for pulling apart a text file.  Does anyone have
compelling use-cases that are not handled by `max_rows`?

Warren





 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-01 Thread Alexander Belopolsky
On Sat, Nov 1, 2014 at 3:15 PM, Warren Weckesser warren.weckes...@gmail.com
 wrote:

 Is there wider interest in such an argument to `genfromtxt`?  For my
 use-cases, `max_rows` is sufficient.  I can't recall ever needing the full
 generality of a slice for pulling apart a text file.  Does anyone have
 compelling use-cases that are not handled by `max_rows`?


It is occasionally useful to be able to skip rows after the header.  Maybe
we should de-deprecate skip_rows and give it the meaning different from
skip_header in case of names = None?  For example,

genfromtxt(fname,  skip_header= 3, skip_rows = 1, max_rows = 100)

would mean skip 3 lines, read column names from the 4-th, skip 5-th,
process up to 100 more lines.  This may be useful if the file contains some
meta-data about the column below the header line.  For example, it is
common to put units of measurement below the column names.

Another application could be processing a large text file in chunks, which
again can be covered nicely by  skip_rows/max_rows.

I cannot think of a situation where I would need more generality such as
reading every 3rd row or rows with the given numbers.  Such processing is
normally done after the text data is loaded into an array.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-01 Thread Alan G Isaac
On 11/1/2014 4:41 PM, Alexander Belopolsky wrote:
 I cannot think of a situation where I would need more generality such as 
 reading every 3rd row or rows with the given numbers.  Such processing is
 normally done after the text data is loaded into an array.


I have done this as cheaper than random selection for a quick and dirty
look at large data sets.   Setting maxrows can be very different if the
data has been stored in some structured manner.

I suppose my view is something like this.  We are considering adding a keyword.
If we can get greater functionality at about the same cost, why not?
In that case, it is not really useful to speculate about use cases.
If the costs are substantially greater, then that should be stated.
Cost is a good reason not to do something.

fwiw,
Alan Isaac

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-01 Thread Alan G Isaac
On 11/1/2014 3:15 PM, Warren Weckesser wrote:
 I intended the result of `genfromtxt(..., max_rows=n)` to produce the same 
 array as produced by `genfromtxt(...)[:n]`.

I find that counterintuitive.
I would first honor skip_header.
Cheers,
Alan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-01 Thread Warren Weckesser
On 11/1/14, Alan G Isaac alan.is...@gmail.com wrote:
 On 11/1/2014 4:41 PM, Alexander Belopolsky wrote:
 I cannot think of a situation where I would need more generality such as
 reading every 3rd row or rows with the given numbers.  Such processing is
 normally done after the text data is loaded into an array.


 I have done this as cheaper than random selection for a quick and dirty
 look at large data sets.   Setting maxrows can be very different if the
 data has been stored in some structured manner.

 I suppose my view is something like this.  We are considering adding a
 keyword.
 If we can get greater functionality at about the same cost, why not?
 In that case, it is not really useful to speculate about use cases.
 If the costs are substantially greater, then that should be stated.
 Cost is a good reason not to do something.



`slice_rows` is a generalization of `max_rows`.  It will probably take
a bit more code to implement, and it will require more tests and more
documentation.  So the cost isn't really the same.  But if it solves
real problems for users, the cost may be worth it.

Warren


 fwiw,
 Alan Isaac

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-11-01 Thread Warren Weckesser
On 11/1/14, Alan G Isaac alan.is...@gmail.com wrote:
 On 11/1/2014 3:15 PM, Warren Weckesser wrote:
 I intended the result of `genfromtxt(..., max_rows=n)` to produce the same
 array as produced by `genfromtxt(...)[:n]`.

 I find that counterintuitive.
 I would first honor skip_header.


Sorry for the terse explanation.  I meant for `...` to indicate any
other arguments, including skip_header.

Warren


 Cheers,
 Alan

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add `nrows` to `genfromtxt`

2014-09-24 Thread Alan G Isaac
On 9/24/2014 2:52 PM, Jaime Fernández del Río wrote:
 There is a PR in github that adds a new keyword to the genfromtxt function, 
 to limit the number of rows that actually get read in:
 https://github.com/numpy/numpy/pull/5103

Sorry to come late to this party, but it seems to me that
more versatile than an `nrows` keyword for the number of rows
would be a rows keyword for a slice argument.

fwiw,
Alan Isaac

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion