Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-18 Thread Travis Vaught

On Jul 17, 2012, at 10:47 AM, Nathaniel Smith wrote:

 
 Not that there are likely to be people using skip_header=True as an
 alias for skip_header=1, but if they were it would currently work.

confession
I write messy code like that all the time.
/confession

Best,

Travis



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-17 Thread Nathaniel Smith
On Mon, Jul 16, 2012 at 10:39 PM, Pierre GM pgmdevl...@gmail.com wrote:
 I don't really have any deep issue with `skip_header=True`, besides not
 really liking having an argument whose type can vary. But that's only a
 matter of personal taste. And yes, we could always check the type…

I guess I still have a small preference for skip_header=comments
over skip_header=True, since the latter is more opaque for no purpose.
Also it makes me slightly antsy since skip_header is normally an
integer, and True is, in fact, just an integer with a special
__repr__:

In [2]: isinstance(True, int)
Out[2]: True

In [3]: True + True
Out[3]: 2

Not that there are likely to be people using skip_header=True as an
alias for skip_header=1, but if they were it would currently work.

 Pierre, for a line # A B C #1 #2 #3 the user gets six columns 'A',
 'B', 'C', '#1', '#2', '#3', which is messy but what they deserve for
 using such messy input :)

 OK, we're on the same page.


 Also, if you look closely, the use of index()
 you propose is equivalent to my current code, just more verbose.

 I'm not convinced by line 1353: unless you change it to
 asbyte(comment).join(first_line.split(comments)[1:])
 you gonna lose the '#', aren't you ? With the 'index' way, we just pick the
 first one, as intended. But it's late and multitasking isn't really working
 for me now.

I think you guys are looking for .split(comments, 1).

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-17 Thread Pierre GM
On Tuesday, July 17, 2012 at 17:47 , Nathaniel Smith wrote:

I guess I still have a small preference for skip_header=comments

over skip_header=True, since the latter is more opaque for no purpose.

Also it makes me slightly antsy since skip_header is normally an

integer, and True is, in fact, just an integer with a special

__repr__:


Nathaniel, that's basically my problem with `skip_header=True`. I still
prefer my -1 to your comments, but whatever, personal taste again.



I'm not convinced by line 1353: unless you change it to

asbyte(comment).join(first_line.split(comments)[1:])

you gonna lose the '#', aren't you ? With the 'index' way, we just pick the

first one, as intended. But it's late and multitasking isn't really working

for me now.

I think you guys are looking for .split(comments, 1).

-n


Tadah! Thanks a lot!
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-16 Thread Pierre GM
Hello,
I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as it is
is advisable. It's a regression, and breaking=bad.
Now, I can understand your frustration, so, what about a trade-off? The
first line w/ a comment after the first 'skip_header' ones should be parsed
for column titles (and we call it 'first_commented_line'). We split it
along the comment character, say, #. If there's some non-space character
before the #, we keep this part of 'first_commented_line' as titles: that
should work for your case. If the first non-space character was #, then
what comes after are the titles (that's Tom's case and the current default).
I'm not looking forward to introducing yet another keyword, genfromtxt is
enough of a mess as it is (unless we add a 'need_coffee' one).
What y'all think?
On Jul 13, 2012 7:29 PM, Paul Natsuo Kishimoto m...@paul.kishimoto.name
wrote:

 On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote:
  On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto
  m...@paul.kishimoto.name wrote:
   Hello everyone,
  
   I am a longtime NumPy user, and I just filed my first
 contribution to
   the code as pull request to fix what I felt was a bug in the behaviour
   of genfromtxt() https://github.com/numpy/numpy/pull/351
   It turns out this alters existing behaviour that some people may depend
   on, so I was encouraged to raise the issue on this list to see what the
   consensus was.
  
   This behaviour happens in the specific situation where:
 * Comments are used in the file (the default comment character is
   '#', which I'll use here), AND
 * The kwarg names=True is given. In this case, genfromtxt() is
   supposed to read an initial row containing the names of the
   columns and return an array with a structured dtype.
  
   Currently, these options work with a file like (Example #1):
  
   # gender age weight
   M   21 72.10
   F   35  58.33
   M   33  21.99
  
   …but NOT with a file like (Example #2):
  
   # here is a general file comment
   # it is spread over multiple lines
   gender age weight
   M   21 72.10
   F   35  58.33
   M   33  21.99
  
   …genfromtxt() believes the column names are 'here', 'is', 'a', etc.,
 and
   thinks all of the columns are strings because 'gender', 'age' and
   'weight' are not numbers.
  
   This is because genfromtxt() (after skipping a number of lines
 as
   specified in the optional kwarg skip_header) will use the *first* line
   it encounters to produce column names. If that line contains a comment
   character, genfromtxt() discards everything *up to and including* the
   comment character, and tries to use the content *after* the comment
   character as headers (Example 3):
  
   gender age weight # wrong column names
   M   21  72.10
   F   35  58.33
   M   33  21.99
  
   …the resulting column names are 'wrong', 'column' and 'names'.
  
   My proposed change was that, if the first (or any subsequent) line
   contains a comment character, it should be treated as an *actual
   comment*, and discarded along with anything that follows it on the
 line.
  
   In Example 2, the result would be that the first two lines
 appear empty
   (no text before '#'), and the third line (gender age weight) is used
   for column names.
  
   In Example 3, the result would be that gender age weight is
 used for
   column names while # wrong column names is ignored.
  
   BUT!
  
   In Example 1, the result would be that the first line appears
 empty,
   and M   21  72.10 are used for column names.
  
   In other words, this change would do away with the previous behaviour
   where the very first commented line was (magically?) treated not as a
   comment but instead as column headers. This might break some existing
   code. On the positive side, it would allow the user to be more liberal
   with the format of input files (Example 4):
  
   # here is a general file comment
   # the columns in this table are
   gender age weight # here is a comment on the header line
   # following this line are the data
   M   21  72.10
   F   35  58.33 # here is a comment on a data line
   M   33  21.99
  
   I feel that this is a better/more flexible behaviour for genfromtxt(),
   but—as stated—I am interested in your thoughts.
  
   Cheers,
   --
   Paul Natsuo Kishimoto
  
   SM candidate, Technology  Policy Program (2012)
   Research assistant,  http://globalchange.mit.edu
   https://paul.kishimoto.name  +1 617 302 6105
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
 
  Hi Paul,
 
  At least in astronomy tabular files with the column definitions in the
  

Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-16 Thread Travis Oliphant

On Jul 16, 2012, at 1:52 AM, Pierre GM wrote:

 Hello,
 I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as it is is 
 advisable. It's a regression, and breaking=bad.
 Now, I can understand your frustration, so, what about a trade-off? The first 
 line w/ a comment after the first 'skip_header' ones should be parsed for 
 column titles (and we call it 'first_commented_line'). We split it along the 
 comment character, say, #. If there's some non-space character before the #, 
 we keep this part of 'first_commented_line' as titles: that should work for 
 your case. If the first non-space character was #, then what comes after are 
 the titles (that's Tom's case and the current default).
 I'm not looking forward to introducing yet another keyword, genfromtxt is 
 enough of a mess as it is (unless we add a 'need_coffee' one).
 What y'all think?
 

That seems like an acceptable proposal --- it is consistent with current 
behavior but also satisfies the use-case (without another keyword which is a 
bonus). 

So, 

+1 from me.

-Travis





 On Jul 13, 2012 7:29 PM, Paul Natsuo Kishimoto m...@paul.kishimoto.name 
 wrote:
 On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote:
  On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto
  m...@paul.kishimoto.name wrote:
   Hello everyone,
  
   I am a longtime NumPy user, and I just filed my first 
   contribution to
   the code as pull request to fix what I felt was a bug in the behaviour
   of genfromtxt() https://github.com/numpy/numpy/pull/351
   It turns out this alters existing behaviour that some people may depend
   on, so I was encouraged to raise the issue on this list to see what the
   consensus was.
  
   This behaviour happens in the specific situation where:
 * Comments are used in the file (the default comment character is
   '#', which I'll use here), AND
 * The kwarg names=True is given. In this case, genfromtxt() is
   supposed to read an initial row containing the names of the
   columns and return an array with a structured dtype.
  
   Currently, these options work with a file like (Example #1):
  
   # gender age weight
   M   21 72.10
   F   35  58.33
   M   33  21.99
  
   …but NOT with a file like (Example #2):
  
   # here is a general file comment
   # it is spread over multiple lines
   gender age weight
   M   21 72.10
   F   35  58.33
   M   33  21.99
  
   …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and
   thinks all of the columns are strings because 'gender', 'age' and
   'weight' are not numbers.
  
   This is because genfromtxt() (after skipping a number of lines as
   specified in the optional kwarg skip_header) will use the *first* line
   it encounters to produce column names. If that line contains a comment
   character, genfromtxt() discards everything *up to and including* the
   comment character, and tries to use the content *after* the comment
   character as headers (Example 3):
  
   gender age weight # wrong column names
   M   21  72.10
   F   35  58.33
   M   33  21.99
  
   …the resulting column names are 'wrong', 'column' and 'names'.
  
   My proposed change was that, if the first (or any subsequent) line
   contains a comment character, it should be treated as an *actual
   comment*, and discarded along with anything that follows it on the line.
  
   In Example 2, the result would be that the first two lines appear 
   empty
   (no text before '#'), and the third line (gender age weight) is used
   for column names.
  
   In Example 3, the result would be that gender age weight is 
   used for
   column names while # wrong column names is ignored.
  
   BUT!
  
   In Example 1, the result would be that the first line appears 
   empty,
   and M   21  72.10 are used for column names.
  
   In other words, this change would do away with the previous behaviour
   where the very first commented line was (magically?) treated not as a
   comment but instead as column headers. This might break some existing
   code. On the positive side, it would allow the user to be more liberal
   with the format of input files (Example 4):
  
   # here is a general file comment
   # the columns in this table are
   gender age weight # here is a comment on the header line
   # following this line are the data
   M   21  72.10
   F   35  58.33 # here is a comment on a data line
   M   33  21.99
  
   I feel that this is a better/more flexible behaviour for genfromtxt(),
   but—as stated—I am interested in your thoughts.
  
   Cheers,
   --
   Paul Natsuo Kishimoto
  
   SM candidate, Technology  Policy Program (2012)
   Research assistant,  http://globalchange.mit.edu
   https://paul.kishimoto.name  +1 617 

Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-16 Thread Paul Natsuo Kishimoto
Hi Pierre,

On Mon, 2012-07-16 at 01:54 -0500, Travis Oliphant wrote:
 On Jul 16, 2012, at 1:52 AM, Pierre GM wrote:
 
  Hello,
  I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as
  it is is advisable. It's a regression, and breaking=bad.
  Now, I can understand your frustration, so, what about a trade-off?
  The first line w/ a comment after the first 'skip_header' ones
  should be parsed for column titles (and we call it
  'first_commented_line'). We split it along the comment character,
  say, #. If there's some non-space character before the #, we keep
  this part of 'first_commented_line' as titles: that should work for
  your case. If the first non-space character was #, then what comes
  after are the titles (that's Tom's case and the current default).
  I'm not looking forward to introducing yet another keyword,
  genfromtxt is enough of a mess as it is (unless we add a
  'need_coffee' one).
  What y'all think?

 That seems like an acceptable proposal --- it is consistent with
 current behavior but also satisfies the use-case (without another
 keyword which is a bonus). 

 So, 

 +1 from me.

 -Travis
 
Thanks for jumping in, and for offering a compromise solution. I agree
that genfromtxt() has too many kwargs—it took me several minutes of
reading the docs to realize why it wasn't behaving as expected!

To be ultra clear (since I want to code this), you are suggesting that
'first_commented_line' be a *new* accepted value for the kwarg 'names',
to invoke the behaviour you suggest?

---

If this IS what you mean, I'd counter-propose something in the same
spirit, but a bit simpler…we let the kwarg 'skip_header' take some
additional value, say int(0), int(-1), str('auto'), or True. In this
case, instead of skipping a fixed number of lines, it will skip any
number of consecutive empty OR commented lines; THEN apply the behaviour
you describe.

The semantics of this are more intuitive, because this is what I am
really after: to *skip* a commented *header* of arbitrary length. So my
four examples below could be parsed with:

 1. genfromtxt(..., names=True)
 2. genfromtxt(..., names=True, skip_header=True)
 3. genfromtxt(..., names=True)
 4. genfromtxt(..., names=True, skip_header=True)

…crucially #1 avoids the regression.

Does this seem good to everyone?

---

But if this is NOT what you mean, then what you say does not actually
work with the simple use-case of my Example #2 below. The first
commented line is # here is a... with # as the first non-space
character, so the part after becomes the names 'here', 'is', 'a' etc.

In short, the code can't resolve the ambiguity without some extra
information from the user.
 
  On Jul 13, 2012 7:29 PM, Paul Natsuo Kishimoto
  m...@paul.kishimoto.name wrote:
  On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote:
   On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto
   m...@paul.kishimoto.name wrote:
Hello everyone,
   
I am a longtime NumPy user, and I just filed my
  first contribution to
the code as pull request to fix what I felt was a bug in
  the behaviour
of genfromtxt() https://github.com/numpy/numpy/pull/351
It turns out this alters existing behaviour that some
  people may depend
on, so I was encouraged to raise the issue on this list
  to see what the
consensus was.
   
This behaviour happens in the specific situation where:
  * Comments are used in the file (the default
  comment character is
'#', which I'll use here), AND
  * The kwarg names=True is given. In this case,
  genfromtxt() is
supposed to read an initial row containing the
  names of the
columns and return an array with a structured
  dtype.
   
Currently, these options work with a file like (Example
  #1):
   
# gender age weight
M   21 72.10
F   35  58.33
M   33  21.99
   
…but NOT with a file like (Example #2):
   
# here is a general file comment
# it is spread over multiple lines
gender age weight
M   21 72.10
F   35  58.33
M   33  21.99
   
…genfromtxt() believes the column names are 'here',
  'is', 'a', etc., and
thinks all of the columns are strings because 'gender',
  'age' and
'weight' are not numbers.
   
This is because genfromtxt() (after skipping a
  number of lines as
specified in the optional kwarg skip_header) will use
  the *first* line

Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-16 Thread Pierre GM
To be ultra clear (since I want to code this), you are suggesting that
'first_commented_line' be a *new* accepted value for the kwarg 'names', to
invoke the behaviour you suggest?

 Nope, I was just referring to some hypothetical variable name. I meant
that:

first_values = None
try:
while not first_values:
first_line = fhd.next()
if names is True:
parsed = [m for m in first_line.split(comments) if m.strip()]
if parsed:
first_value = split_line(parsed[0])
else:
...

(it's not tested, I'm writing it as it comes. And I didn't even use the
`first_commented_line` name, sorry)

If this IS what you mean, I'd counter-propose something in the same spirit,
but a bit simpler…we let the kwarg 'skip_header' take some additional
value, say int(0), int(-1), str('auto'), or True.


 In this case, instead of skipping a fixed number of lines, it will skip
any number of consecutive empty OR commented lines;


I really like the idea of having `skip_header=-1` skip all the empty or
commented lines (that is, lines whose first non-space character is the
`comments` character). That'd be rather convenient.



The semantics of this are more intuitive, because this is what I am
really after: to *skip* a commented *header* of arbitrary length. So my
four examples below could be parsed with:

1. genfromtxt(..., names=True)
2. genfromtxt(..., names=True, skip_header=True)
3. genfromtxt(..., names=True)
4. genfromtxt(..., names=True, skip_header=True)

…crucially #1 avoids the regression.

Does this seem good to everyone?


Sounds good w/ `skip_header=-1`

But if this is NOT what you mean, then what you say does not actually work
with the simple use-case of my Example #2 below. The first commented line
is # here is a... with # as the first non-space character, so the part
after becomes the names 'here', 'is', 'a' etc.


In that case, you could always use `skip_header=2`

In short, the code can't resolve the ambiguity without some extra
information from the user.


It's always best not to let the code guess too much anyway...

Well, no regression, and you have a nice plan. I'm for it.
Anybody else?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-16 Thread Paul Natsuo Kishimoto
I've implemented this feature with skip_header=-1 as suggested by
Pierre, and in doing so removed the regression. TravisBot seems to like
it: https://github.com/numpy/numpy/pull/351

On Mon, 2012-07-16 at 16:12 +0200, Pierre GM wrote:
 To be ultra clear (since I want to code this), you are
 suggesting that
 'first_commented_line' be a *new* accepted value for the kwarg
 'names', to invoke the behaviour you suggest?
 
 
 
 Nope, I was just referring to some hypothetical variable name. I meant
 that:
 
 first_values = None
 try:
 while not first_values:
 first_line = fhd.next()
 if names is True:
 parsed = [m for m in first_line.split(comments) if
 m.strip()]
 if parsed:
 first_value = split_line(parsed[0])
 else:
 ...
 
 (it's not tested, I'm writing it as it comes. And I didn't even use
 the `first_commented_line` name, sorry)
 
 
 If this IS what you mean, I'd counter-propose something in the
 same spirit, but a bit simpler…we let the kwarg 'skip_header'
 take some additional value, say int(0), int(-1), str('auto'),
 or True.
 
 
 
 
 In this case, instead of skipping a fixed number of lines, it
 will skip any number of consecutive empty OR commented lines;
 
 
 
 
 I really like the idea of having `skip_header=-1` skip all the empty
 or commented lines (that is, lines whose first non-space character is
 the `comments` character). That'd be rather convenient.
 
  
 
 
 The semantics of this are more intuitive, because this is what
 I am
 really after: to *skip* a commented *header* of arbitrary
 length. So my four examples below could be parsed with:
 
 1. genfromtxt(..., names=True)
 2. genfromtxt(..., names=True, skip_header=True)
 3. genfromtxt(..., names=True)
 4. genfromtxt(..., names=True, skip_header=True)
 
 …crucially #1 avoids the regression.
 
 
 Does this seem good to everyone?
 
 
 
 
 Sounds good w/ `skip_header=-1`
 
 
 But if this is NOT what you mean, then what you say does not
 actually work with the simple use-case of my Example #2 below.
 The first commented line is # here is a... with # as the
 first non-space character, so the part after becomes the names
 'here', 'is', 'a' etc.
 
 
 
 
 In that case, you could always use `skip_header=2`
 
 In short, the code can't resolve the ambiguity without some
 extra
 information from the user.
 
 
 
 
 It's always best not to let the code guess too much anyway...
 
 Well, no regression, and you have a nice plan. I'm for it.
 Anybody else?
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- 
Paul Natsuo Kishimoto

SM candidate, Technology  Policy Program (2012)
Research assistant,  http://globalchange.mit.edu
https://paul.kishimoto.name  +1 617 302 6105


signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-16 Thread Nathaniel Smith
On Mon, Jul 16, 2012 at 8:06 PM, Paul Natsuo Kishimoto
m...@paul.kishimoto.name wrote:
 I've implemented this feature with skip_header=-1 as suggested by
 Pierre, and in doing so removed the regression. TravisBot seems to like
 it: https://github.com/numpy/numpy/pull/351

Can we please not use weird magic values like this? This isn't C.
skip_header=comments would work just as well and be far more
self-explanatory...

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-16 Thread Tom Aldcroft
On Mon, Jul 16, 2012 at 3:06 PM, Paul Natsuo Kishimoto
m...@paul.kishimoto.name wrote:
 I've implemented this feature with skip_header=-1 as suggested by
 Pierre, and in doing so removed the regression. TravisBot seems to like
 it: https://github.com/numpy/numpy/pull/351

 On Mon, 2012-07-16 at 16:12 +0200, Pierre GM wrote:
 To be ultra clear (since I want to code this), you are
 suggesting that
 'first_commented_line' be a *new* accepted value for the kwarg
 'names', to invoke the behaviour you suggest?



 Nope, I was just referring to some hypothetical variable name. I meant
 that:

 first_values = None
 try:
 while not first_values:
 first_line = fhd.next()
 if names is True:
 parsed = [m for m in first_line.split(comments) if
 m.strip()]
 if parsed:
 first_value = split_line(parsed[0])
 else:
 ...

 (it's not tested, I'm writing it as it comes. And I didn't even use
 the `first_commented_line` name, sorry)


 If this IS what you mean, I'd counter-propose something in the
 same spirit, but a bit simpler…we let the kwarg 'skip_header'
 take some additional value, say int(0), int(-1), str('auto'),
 or True.




 In this case, instead of skipping a fixed number of lines, it
 will skip any number of consecutive empty OR commented lines;




 I really like the idea of having `skip_header=-1` skip all the empty
 or commented lines (that is, lines whose first non-space character is
 the `comments` character). That'd be rather convenient.




 The semantics of this are more intuitive, because this is what
 I am
 really after: to *skip* a commented *header* of arbitrary
 length. So my four examples below could be parsed with:

 1. genfromtxt(..., names=True)
 2. genfromtxt(..., names=True, skip_header=True)
 3. genfromtxt(..., names=True)
 4. genfromtxt(..., names=True, skip_header=True)

 …crucially #1 avoids the regression.


 Does this seem good to everyone?




 Sounds good w/ `skip_header=-1`


 But if this is NOT what you mean, then what you say does not
 actually work with the simple use-case of my Example #2 below.
 The first commented line is # here is a... with # as the
 first non-space character, so the part after becomes the names
 'here', 'is', 'a' etc.




 In that case, you could always use `skip_header=2`

 In short, the code can't resolve the ambiguity without some
 extra
 information from the user.




 It's always best not to let the code guess too much anyway...

 Well, no regression, and you have a nice plan. I'm for it.
 Anybody else?


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 --
 Paul Natsuo Kishimoto

 SM candidate, Technology  Policy Program (2012)
 Research assistant,  http://globalchange.mit.edu
 https://paul.kishimoto.name  +1 617 302 6105

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


I think that the proposed solution is OK, but it does make it even
trickier for the average user to predict the behavior of genfromtxt()
for different situations.  Perhaps as part of this pull request Paul
should also update the documentation to include a section describing
this behavior and usage with examples 1 to 4.

- Tom
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-16 Thread Pierre GM
Well, as `skip_header` is a number of lines, I don't really see anything 
particular magical about a `skip_header=-1`. Plus, range(-1) == [], while 
range(comments) raises a TypeError. And then you'd have to figure why the 
exception was raised.

-- 
Pierre GM


On Monday, July 16, 2012 at 21:56 , Nathaniel Smith wrote:

 On Mon, Jul 16, 2012 at 8:06 PM, Paul Natsuo Kishimoto
 m...@paul.kishimoto.name (mailto:m...@paul.kishimoto.name) wrote:
  I've implemented this feature with skip_header=-1 as suggested by
  Pierre, and in doing so removed the regression. TravisBot seems to like
  it: https://github.com/numpy/numpy/pull/351
  
 
 
 Can we please not use weird magic values like this? This isn't C.
 skip_header=comments would work just as well and be far more
 self-explanatory...
 
 -n
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org (mailto:NumPy-Discussion@scipy.org)
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-16 Thread Nathaniel Smith
On Mon, Jul 16, 2012 at 9:01 PM, Pierre GM pgmdevl...@gmail.com wrote:
 Well, as `skip_header` is a number of lines, I don't really see anything
 particular magical about a `skip_header=-1`.

The logic here is:
- if names=True, then genfromtext expects the names to be given in the
first line, and they may or may not be commented out
- BUT, if skip_header=some special value, then any all-comment lines
will be skipped before looking for names, i.e. the names are not
expected to be commented out and comments are given their original
meaning again.

I have no idea how one could derive this understanding by looking at
skip_header=-1. Ah, that -1 is a number of lines, everyone knows that
skipping -1 lines is equivalent to toggling our expectation for
whether the column names will appear inside a comment? The API is
pretty convoluted at this point and I'm not convinced we wouldn't be
better off with adding a new argument like
names_in_comment=False/True, but skip_header=comments at least gives
the reader a fighting chance...

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-16 Thread Pierre GM
Tom, I agree that the documentation should be updated (both the doctoring and 
the relevant parts of the user manual), and specific unit-tests added. Paul, 
that's a direct nudge ;) (I'm sure you don't mind).

I was also considering the weird case
 first_line = # A B C #1 #2 #3
How many columns in that case ? 6 ? 3 ?  
So, instead of using a `split`, maybe we should just check  
 index=first_line.index(comment)
and take `first_line[:index]` (or `first_line[index+1:]` after depending on the 
case).

But then again, it's a weird case.  



--  
Pierre GM


On Monday, July 16, 2012 at 22:00 , Tom Aldcroft wrote:

 On Mon, Jul 16, 2012 at 3:06 PM, Paul Natsuo Kishimoto
 m...@paul.kishimoto.name (mailto:m...@paul.kishimoto.name) wrote:
  I've implemented this feature with skip_header=-1 as suggested by
  Pierre, and in doing so removed the regression. TravisBot seems to like
  it: https://github.com/numpy/numpy/pull/351
   
  On Mon, 2012-07-16 at 16:12 +0200, Pierre GM wrote:
   To be ultra clear (since I want to code this), you are
   suggesting that
   'first_commented_line' be a *new* accepted value for the kwarg
   'names', to invoke the behaviour you suggest?



   Nope, I was just referring to some hypothetical variable name. I meant
   that:

   first_values = None
   try:
   while not first_values:
   first_line = fhd.next()
   if names is True:
   parsed = [m for m in first_line.split(comments) if
   m.strip()]
   if parsed:
   first_value = split_line(parsed[0])
   else:
   ...

   (it's not tested, I'm writing it as it comes. And I didn't even use
   the `first_commented_line` name, sorry)


   If this IS what you mean, I'd counter-propose something in the
   same spirit, but a bit simpler…we let the kwarg 'skip_header'
   take some additional value, say int(0), int(-1), str('auto'),
   or True.




   In this case, instead of skipping a fixed number of lines, it
   will skip any number of consecutive empty OR commented lines;




   I really like the idea of having `skip_header=-1` skip all the empty
   or commented lines (that is, lines whose first non-space character is
   the `comments` character). That'd be rather convenient.




   The semantics of this are more intuitive, because this is what
   I am
   really after: to *skip* a commented *header* of arbitrary
   length. So my four examples below could be parsed with:

   1. genfromtxt(..., names=True)
   2. genfromtxt(..., names=True, skip_header=True)
   3. genfromtxt(..., names=True)
   4. genfromtxt(..., names=True, skip_header=True)

   …crucially #1 avoids the regression.


   Does this seem good to everyone?




   Sounds good w/ `skip_header=-1`


   But if this is NOT what you mean, then what you say does not
   actually work with the simple use-case of my Example #2 below.
   The first commented line is # here is a... with # as the
   first non-space character, so the part after becomes the names
   'here', 'is', 'a' etc.




   In that case, you could always use `skip_header=2`

   In short, the code can't resolve the ambiguity without some
   extra
   information from the user.




   It's always best not to let the code guess too much anyway...

   Well, no regression, and you have a nice plan. I'm for it.
   Anybody else?


   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org (mailto:NumPy-Discussion@scipy.org)
   http://mail.scipy.org/mailman/listinfo/numpy-discussion

   
   
  --
  Paul Natsuo Kishimoto
   
  SM candidate, Technology  Policy Program (2012)
  Research assistant, http://globalchange.mit.edu
  https://paul.kishimoto.name +1 617 302 6105
   
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org (mailto:NumPy-Discussion@scipy.org)
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
   
  
  
 I think that the proposed solution is OK, but it does make it even
 trickier for the average user to predict the behavior of genfromtxt()
 for different situations. Perhaps as part of this pull request Paul
 should also update the documentation to include a section describing
 this behavior and usage with examples 1 to 4.
  
 - Tom
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org (mailto:NumPy-Discussion@scipy.org)
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
  


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-16 Thread Paul Natsuo Kishimoto
On Mon, 2012-07-16 at 21:14 +0100, Nathaniel Smith wrote:
 On Mon, Jul 16, 2012 at 9:01 PM, Pierre GM pgmdevl...@gmail.com wrote:
  Well, as `skip_header` is a number of lines, I don't really see anything
  particular magical about a `skip_header=-1`.
 
 The logic here is:
 - if names=True, then genfromtext expects the names to be given in the
 first line, and they may or may not be commented out
 - BUT, if skip_header=some special value, then any all-comment lines
 will be skipped before looking for names, i.e. the names are not
 expected to be commented out and comments are given their original
 meaning again.
 
 I have no idea how one could derive this understanding by looking at
 skip_header=-1. Ah, that -1 is a number of lines, everyone knows that
 skipping -1 lines is equivalent to toggling our expectation for
 whether the column names will appear inside a comment? The API is
 pretty convoluted at this point and I'm not convinced we wouldn't be
 better off with adding a new argument like
 names_in_comment=False/True, but skip_header=comments at least gives
 the reader a fighting chance...
 
 -n

Another option is to use skip_header=True. The internal monologue
accompanying this is Ah, do I want it to skip the header? Yes, true, I
do, with no thought needed on the number of lines involved. Pierre,
checking the type of the argument is trivial. Nathaniel, is this less
weird/magical? Anyone else?

I don't care what the value is and it's easy to change, but I'll await
some agreement on this point so I don't have to change it yet again in
response to more objections.

Pierre, for a line # A B C #1 #2 #3 the user gets six columns 'A',
'B', 'C', '#1', '#2', '#3', which is messy but what they deserve for
using such messy input :) Also, if you look closely, the use of index()
you propose is equivalent to my current code, just more verbose.

Tom, in my branch I rewrote the documentation for the `names` kwarg in
an attempt to be more clear, but I agree a documentation example of the
non-legacy use would go a long way. I've also realized I neglected to
update the documentation for `skip_header`. I'll do these once there is
consensus on the value to use.

If there was willingness to tolerate a backwards-incompatible change,
the resulting behaviour would be quite simple and intuitive overall, but
that's out of my hands. At the moment I'm just concerned with making an
intuitive behaviour *possible*.

Thanks everyone for your input,
-- 
Paul Natsuo Kishimoto

SM candidate, Technology  Policy Program (2012)
Research assistant,  http://globalchange.mit.edu
https://paul.kishimoto.name  +1 617 302 6105


signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-13 Thread Paul Natsuo Kishimoto
Hello everyone,

I am a longtime NumPy user, and I just filed my first contribution to
the code as pull request to fix what I felt was a bug in the behaviour
of genfromtxt() https://github.com/numpy/numpy/pull/351
It turns out this alters existing behaviour that some people may depend
on, so I was encouraged to raise the issue on this list to see what the
consensus was.

This behaviour happens in the specific situation where:
  * Comments are used in the file (the default comment character is
'#', which I'll use here), AND
  * The kwarg names=True is given. In this case, genfromtxt() is
supposed to read an initial row containing the names of the
columns and return an array with a structured dtype.

Currently, these options work with a file like (Example #1):

# gender age weight
M   21  72.10
F   35  58.33
M   33  21.99

…but NOT with a file like (Example #2):

# here is a general file comment
# it is spread over multiple lines
gender age weight
M   21  72.10
F   35  58.33
M   33  21.99

…genfromtxt() believes the column names are 'here', 'is', 'a', etc., and
thinks all of the columns are strings because 'gender', 'age' and
'weight' are not numbers.

This is because genfromtxt() (after skipping a number of lines as
specified in the optional kwarg skip_header) will use the *first* line
it encounters to produce column names. If that line contains a comment
character, genfromtxt() discards everything *up to and including* the
comment character, and tries to use the content *after* the comment
character as headers (Example 3):

gender age weight # wrong column names
M   21  72.10
F   35  58.33
M   33  21.99

…the resulting column names are 'wrong', 'column' and 'names'.

My proposed change was that, if the first (or any subsequent) line
contains a comment character, it should be treated as an *actual
comment*, and discarded along with anything that follows it on the line.

In Example 2, the result would be that the first two lines appear empty
(no text before '#'), and the third line (gender age weight) is used
for column names.

In Example 3, the result would be that gender age weight is used for
column names while # wrong column names is ignored.

BUT!

In Example 1, the result would be that the first line appears empty,
and M   21  72.10 are used for column names.

In other words, this change would do away with the previous behaviour
where the very first commented line was (magically?) treated not as a
comment but instead as column headers. This might break some existing
code. On the positive side, it would allow the user to be more liberal
with the format of input files (Example 4):

# here is a general file comment
# the columns in this table are
gender age weight # here is a comment on the header line
# following this line are the data
M   21  72.10
F   35  58.33 # here is a comment on a data line
M   33  21.99

I feel that this is a better/more flexible behaviour for genfromtxt(),
but—as stated—I am interested in your thoughts.

Cheers,
-- 
Paul Natsuo Kishimoto

SM candidate, Technology  Policy Program (2012)
Research assistant,  http://globalchange.mit.edu
https://paul.kishimoto.name  +1 617 302 6105


signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-13 Thread Tom Aldcroft
On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto
m...@paul.kishimoto.name wrote:
 Hello everyone,

 I am a longtime NumPy user, and I just filed my first contribution to
 the code as pull request to fix what I felt was a bug in the behaviour
 of genfromtxt() https://github.com/numpy/numpy/pull/351
 It turns out this alters existing behaviour that some people may depend
 on, so I was encouraged to raise the issue on this list to see what the
 consensus was.

 This behaviour happens in the specific situation where:
   * Comments are used in the file (the default comment character is
 '#', which I'll use here), AND
   * The kwarg names=True is given. In this case, genfromtxt() is
 supposed to read an initial row containing the names of the
 columns and return an array with a structured dtype.

 Currently, these options work with a file like (Example #1):

 # gender age weight
 M   21 72.10
 F   35  58.33
 M   33  21.99

 …but NOT with a file like (Example #2):

 # here is a general file comment
 # it is spread over multiple lines
 gender age weight
 M   21 72.10
 F   35  58.33
 M   33  21.99

 …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and
 thinks all of the columns are strings because 'gender', 'age' and
 'weight' are not numbers.

 This is because genfromtxt() (after skipping a number of lines as
 specified in the optional kwarg skip_header) will use the *first* line
 it encounters to produce column names. If that line contains a comment
 character, genfromtxt() discards everything *up to and including* the
 comment character, and tries to use the content *after* the comment
 character as headers (Example 3):

 gender age weight # wrong column names
 M   21  72.10
 F   35  58.33
 M   33  21.99

 …the resulting column names are 'wrong', 'column' and 'names'.

 My proposed change was that, if the first (or any subsequent) line
 contains a comment character, it should be treated as an *actual
 comment*, and discarded along with anything that follows it on the line.

 In Example 2, the result would be that the first two lines appear 
 empty
 (no text before '#'), and the third line (gender age weight) is used
 for column names.

 In Example 3, the result would be that gender age weight is used for
 column names while # wrong column names is ignored.

 BUT!

 In Example 1, the result would be that the first line appears empty,
 and M   21  72.10 are used for column names.

 In other words, this change would do away with the previous behaviour
 where the very first commented line was (magically?) treated not as a
 comment but instead as column headers. This might break some existing
 code. On the positive side, it would allow the user to be more liberal
 with the format of input files (Example 4):

 # here is a general file comment
 # the columns in this table are
 gender age weight # here is a comment on the header line
 # following this line are the data
 M   21  72.10
 F   35  58.33 # here is a comment on a data line
 M   33  21.99

 I feel that this is a better/more flexible behaviour for genfromtxt(),
 but—as stated—I am interested in your thoughts.

 Cheers,
 --
 Paul Natsuo Kishimoto

 SM candidate, Technology  Policy Program (2012)
 Research assistant,  http://globalchange.mit.edu
 https://paul.kishimoto.name  +1 617 302 6105

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Hi Paul,

At least in astronomy tabular files with the column definitions in the
first commented line are reasonably common.  This is driven in part by
wide use of legacy packages like supermongo etc that don't have
intelligent table readers, so users document the column names as a
comment line.  I think making this break might be unfortunate for
users in astronomy.

Dealing with commented header definitions is annoying.  Not that it
matters specifically for your genfromtext() proposal, but in the
asciitable reader this case is handled with a particular reader class
that expects the first comment line to contain the column definitions:

 http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader

Cheers,
Tom
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

2012-07-13 Thread Paul Natsuo Kishimoto
On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote:
 On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto
 m...@paul.kishimoto.name wrote:
  Hello everyone,
 
  I am a longtime NumPy user, and I just filed my first contribution 
  to
  the code as pull request to fix what I felt was a bug in the behaviour
  of genfromtxt() https://github.com/numpy/numpy/pull/351
  It turns out this alters existing behaviour that some people may depend
  on, so I was encouraged to raise the issue on this list to see what the
  consensus was.
 
  This behaviour happens in the specific situation where:
* Comments are used in the file (the default comment character is
  '#', which I'll use here), AND
* The kwarg names=True is given. In this case, genfromtxt() is
  supposed to read an initial row containing the names of the
  columns and return an array with a structured dtype.
 
  Currently, these options work with a file like (Example #1):
 
  # gender age weight
  M   21 72.10
  F   35  58.33
  M   33  21.99
 
  …but NOT with a file like (Example #2):
 
  # here is a general file comment
  # it is spread over multiple lines
  gender age weight
  M   21 72.10
  F   35  58.33
  M   33  21.99
 
  …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and
  thinks all of the columns are strings because 'gender', 'age' and
  'weight' are not numbers.
 
  This is because genfromtxt() (after skipping a number of lines as
  specified in the optional kwarg skip_header) will use the *first* line
  it encounters to produce column names. If that line contains a comment
  character, genfromtxt() discards everything *up to and including* the
  comment character, and tries to use the content *after* the comment
  character as headers (Example 3):
 
  gender age weight # wrong column names
  M   21  72.10
  F   35  58.33
  M   33  21.99
 
  …the resulting column names are 'wrong', 'column' and 'names'.
 
  My proposed change was that, if the first (or any subsequent) line
  contains a comment character, it should be treated as an *actual
  comment*, and discarded along with anything that follows it on the line.
 
  In Example 2, the result would be that the first two lines appear 
  empty
  (no text before '#'), and the third line (gender age weight) is used
  for column names.
 
  In Example 3, the result would be that gender age weight is used 
  for
  column names while # wrong column names is ignored.
 
  BUT!
 
  In Example 1, the result would be that the first line appears empty,
  and M   21  72.10 are used for column names.
 
  In other words, this change would do away with the previous behaviour
  where the very first commented line was (magically?) treated not as a
  comment but instead as column headers. This might break some existing
  code. On the positive side, it would allow the user to be more liberal
  with the format of input files (Example 4):
 
  # here is a general file comment
  # the columns in this table are
  gender age weight # here is a comment on the header line
  # following this line are the data
  M   21  72.10
  F   35  58.33 # here is a comment on a data line
  M   33  21.99
 
  I feel that this is a better/more flexible behaviour for genfromtxt(),
  but—as stated—I am interested in your thoughts.
 
  Cheers,
  --
  Paul Natsuo Kishimoto
 
  SM candidate, Technology  Policy Program (2012)
  Research assistant,  http://globalchange.mit.edu
  https://paul.kishimoto.name  +1 617 302 6105
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 Hi Paul,
 
 At least in astronomy tabular files with the column definitions in the
 first commented line are reasonably common.  This is driven in part by
 wide use of legacy packages like supermongo etc that don't have
 intelligent table readers, so users document the column names as a
 comment line.  I think making this break might be unfortunate for
 users in astronomy.
 
 Dealing with commented header definitions is annoying.  Not that it
 matters specifically for your genfromtext() proposal, but in the
 asciitable reader this case is handled with a particular reader class
 that expects the first comment line to contain the column definitions:
 
  http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader
 
 Cheers,
 Tom

Tom,

Thanks for this information. In thinking about how people would work
around this, I figured it would be fairly easy to discard a comment
character that occurred as the very first character in a file, e.g.:

raw = StringIO(open('example.txt').read()[1:])
data = numpy.genfromtxt(raw, comment='#',