Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
On Jul 17, 2012, at 10:47 AM, Nathaniel Smith wrote: Not that there are likely to be people using skip_header=True as an alias for skip_header=1, but if they were it would currently work. confession I write messy code like that all the time. /confession Best, Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
On Mon, Jul 16, 2012 at 10:39 PM, Pierre GM pgmdevl...@gmail.com wrote: I don't really have any deep issue with `skip_header=True`, besides not really liking having an argument whose type can vary. But that's only a matter of personal taste. And yes, we could always check the type… I guess I still have a small preference for skip_header=comments over skip_header=True, since the latter is more opaque for no purpose. Also it makes me slightly antsy since skip_header is normally an integer, and True is, in fact, just an integer with a special __repr__: In [2]: isinstance(True, int) Out[2]: True In [3]: True + True Out[3]: 2 Not that there are likely to be people using skip_header=True as an alias for skip_header=1, but if they were it would currently work. Pierre, for a line # A B C #1 #2 #3 the user gets six columns 'A', 'B', 'C', '#1', '#2', '#3', which is messy but what they deserve for using such messy input :) OK, we're on the same page. Also, if you look closely, the use of index() you propose is equivalent to my current code, just more verbose. I'm not convinced by line 1353: unless you change it to asbyte(comment).join(first_line.split(comments)[1:]) you gonna lose the '#', aren't you ? With the 'index' way, we just pick the first one, as intended. But it's late and multitasking isn't really working for me now. I think you guys are looking for .split(comments, 1). -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
On Tuesday, July 17, 2012 at 17:47 , Nathaniel Smith wrote: I guess I still have a small preference for skip_header=comments over skip_header=True, since the latter is more opaque for no purpose. Also it makes me slightly antsy since skip_header is normally an integer, and True is, in fact, just an integer with a special __repr__: Nathaniel, that's basically my problem with `skip_header=True`. I still prefer my -1 to your comments, but whatever, personal taste again. I'm not convinced by line 1353: unless you change it to asbyte(comment).join(first_line.split(comments)[1:]) you gonna lose the '#', aren't you ? With the 'index' way, we just pick the first one, as intended. But it's late and multitasking isn't really working for me now. I think you guys are looking for .split(comments, 1). -n Tadah! Thanks a lot! ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
Hello, I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as it is is advisable. It's a regression, and breaking=bad. Now, I can understand your frustration, so, what about a trade-off? The first line w/ a comment after the first 'skip_header' ones should be parsed for column titles (and we call it 'first_commented_line'). We split it along the comment character, say, #. If there's some non-space character before the #, we keep this part of 'first_commented_line' as titles: that should work for your case. If the first non-space character was #, then what comes after are the titles (that's Tom's case and the current default). I'm not looking forward to introducing yet another keyword, genfromtxt is enough of a mess as it is (unless we add a 'need_coffee' one). What y'all think? On Jul 13, 2012 7:29 PM, Paul Natsuo Kishimoto m...@paul.kishimoto.name wrote: On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote: On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto m...@paul.kishimoto.name wrote: Hello everyone, I am a longtime NumPy user, and I just filed my first contribution to the code as pull request to fix what I felt was a bug in the behaviour of genfromtxt() https://github.com/numpy/numpy/pull/351 It turns out this alters existing behaviour that some people may depend on, so I was encouraged to raise the issue on this list to see what the consensus was. This behaviour happens in the specific situation where: * Comments are used in the file (the default comment character is '#', which I'll use here), AND * The kwarg names=True is given. In this case, genfromtxt() is supposed to read an initial row containing the names of the columns and return an array with a structured dtype. Currently, these options work with a file like (Example #1): # gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …but NOT with a file like (Example #2): # here is a general file comment # it is spread over multiple lines gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and thinks all of the columns are strings because 'gender', 'age' and 'weight' are not numbers. This is because genfromtxt() (after skipping a number of lines as specified in the optional kwarg skip_header) will use the *first* line it encounters to produce column names. If that line contains a comment character, genfromtxt() discards everything *up to and including* the comment character, and tries to use the content *after* the comment character as headers (Example 3): gender age weight # wrong column names M 21 72.10 F 35 58.33 M 33 21.99 …the resulting column names are 'wrong', 'column' and 'names'. My proposed change was that, if the first (or any subsequent) line contains a comment character, it should be treated as an *actual comment*, and discarded along with anything that follows it on the line. In Example 2, the result would be that the first two lines appear empty (no text before '#'), and the third line (gender age weight) is used for column names. In Example 3, the result would be that gender age weight is used for column names while # wrong column names is ignored. BUT! In Example 1, the result would be that the first line appears empty, and M 21 72.10 are used for column names. In other words, this change would do away with the previous behaviour where the very first commented line was (magically?) treated not as a comment but instead as column headers. This might break some existing code. On the positive side, it would allow the user to be more liberal with the format of input files (Example 4): # here is a general file comment # the columns in this table are gender age weight # here is a comment on the header line # following this line are the data M 21 72.10 F 35 58.33 # here is a comment on a data line M 33 21.99 I feel that this is a better/more flexible behaviour for genfromtxt(), but—as stated—I am interested in your thoughts. Cheers, -- Paul Natsuo Kishimoto SM candidate, Technology Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi Paul, At least in astronomy tabular files with the column definitions in the
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
On Jul 16, 2012, at 1:52 AM, Pierre GM wrote: Hello, I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as it is is advisable. It's a regression, and breaking=bad. Now, I can understand your frustration, so, what about a trade-off? The first line w/ a comment after the first 'skip_header' ones should be parsed for column titles (and we call it 'first_commented_line'). We split it along the comment character, say, #. If there's some non-space character before the #, we keep this part of 'first_commented_line' as titles: that should work for your case. If the first non-space character was #, then what comes after are the titles (that's Tom's case and the current default). I'm not looking forward to introducing yet another keyword, genfromtxt is enough of a mess as it is (unless we add a 'need_coffee' one). What y'all think? That seems like an acceptable proposal --- it is consistent with current behavior but also satisfies the use-case (without another keyword which is a bonus). So, +1 from me. -Travis On Jul 13, 2012 7:29 PM, Paul Natsuo Kishimoto m...@paul.kishimoto.name wrote: On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote: On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto m...@paul.kishimoto.name wrote: Hello everyone, I am a longtime NumPy user, and I just filed my first contribution to the code as pull request to fix what I felt was a bug in the behaviour of genfromtxt() https://github.com/numpy/numpy/pull/351 It turns out this alters existing behaviour that some people may depend on, so I was encouraged to raise the issue on this list to see what the consensus was. This behaviour happens in the specific situation where: * Comments are used in the file (the default comment character is '#', which I'll use here), AND * The kwarg names=True is given. In this case, genfromtxt() is supposed to read an initial row containing the names of the columns and return an array with a structured dtype. Currently, these options work with a file like (Example #1): # gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …but NOT with a file like (Example #2): # here is a general file comment # it is spread over multiple lines gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and thinks all of the columns are strings because 'gender', 'age' and 'weight' are not numbers. This is because genfromtxt() (after skipping a number of lines as specified in the optional kwarg skip_header) will use the *first* line it encounters to produce column names. If that line contains a comment character, genfromtxt() discards everything *up to and including* the comment character, and tries to use the content *after* the comment character as headers (Example 3): gender age weight # wrong column names M 21 72.10 F 35 58.33 M 33 21.99 …the resulting column names are 'wrong', 'column' and 'names'. My proposed change was that, if the first (or any subsequent) line contains a comment character, it should be treated as an *actual comment*, and discarded along with anything that follows it on the line. In Example 2, the result would be that the first two lines appear empty (no text before '#'), and the third line (gender age weight) is used for column names. In Example 3, the result would be that gender age weight is used for column names while # wrong column names is ignored. BUT! In Example 1, the result would be that the first line appears empty, and M 21 72.10 are used for column names. In other words, this change would do away with the previous behaviour where the very first commented line was (magically?) treated not as a comment but instead as column headers. This might break some existing code. On the positive side, it would allow the user to be more liberal with the format of input files (Example 4): # here is a general file comment # the columns in this table are gender age weight # here is a comment on the header line # following this line are the data M 21 72.10 F 35 58.33 # here is a comment on a data line M 33 21.99 I feel that this is a better/more flexible behaviour for genfromtxt(), but—as stated—I am interested in your thoughts. Cheers, -- Paul Natsuo Kishimoto SM candidate, Technology Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
Hi Pierre, On Mon, 2012-07-16 at 01:54 -0500, Travis Oliphant wrote: On Jul 16, 2012, at 1:52 AM, Pierre GM wrote: Hello, I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as it is is advisable. It's a regression, and breaking=bad. Now, I can understand your frustration, so, what about a trade-off? The first line w/ a comment after the first 'skip_header' ones should be parsed for column titles (and we call it 'first_commented_line'). We split it along the comment character, say, #. If there's some non-space character before the #, we keep this part of 'first_commented_line' as titles: that should work for your case. If the first non-space character was #, then what comes after are the titles (that's Tom's case and the current default). I'm not looking forward to introducing yet another keyword, genfromtxt is enough of a mess as it is (unless we add a 'need_coffee' one). What y'all think? That seems like an acceptable proposal --- it is consistent with current behavior but also satisfies the use-case (without another keyword which is a bonus). So, +1 from me. -Travis Thanks for jumping in, and for offering a compromise solution. I agree that genfromtxt() has too many kwargs—it took me several minutes of reading the docs to realize why it wasn't behaving as expected! To be ultra clear (since I want to code this), you are suggesting that 'first_commented_line' be a *new* accepted value for the kwarg 'names', to invoke the behaviour you suggest? --- If this IS what you mean, I'd counter-propose something in the same spirit, but a bit simpler…we let the kwarg 'skip_header' take some additional value, say int(0), int(-1), str('auto'), or True. In this case, instead of skipping a fixed number of lines, it will skip any number of consecutive empty OR commented lines; THEN apply the behaviour you describe. The semantics of this are more intuitive, because this is what I am really after: to *skip* a commented *header* of arbitrary length. So my four examples below could be parsed with: 1. genfromtxt(..., names=True) 2. genfromtxt(..., names=True, skip_header=True) 3. genfromtxt(..., names=True) 4. genfromtxt(..., names=True, skip_header=True) …crucially #1 avoids the regression. Does this seem good to everyone? --- But if this is NOT what you mean, then what you say does not actually work with the simple use-case of my Example #2 below. The first commented line is # here is a... with # as the first non-space character, so the part after becomes the names 'here', 'is', 'a' etc. In short, the code can't resolve the ambiguity without some extra information from the user. On Jul 13, 2012 7:29 PM, Paul Natsuo Kishimoto m...@paul.kishimoto.name wrote: On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote: On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto m...@paul.kishimoto.name wrote: Hello everyone, I am a longtime NumPy user, and I just filed my first contribution to the code as pull request to fix what I felt was a bug in the behaviour of genfromtxt() https://github.com/numpy/numpy/pull/351 It turns out this alters existing behaviour that some people may depend on, so I was encouraged to raise the issue on this list to see what the consensus was. This behaviour happens in the specific situation where: * Comments are used in the file (the default comment character is '#', which I'll use here), AND * The kwarg names=True is given. In this case, genfromtxt() is supposed to read an initial row containing the names of the columns and return an array with a structured dtype. Currently, these options work with a file like (Example #1): # gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …but NOT with a file like (Example #2): # here is a general file comment # it is spread over multiple lines gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and thinks all of the columns are strings because 'gender', 'age' and 'weight' are not numbers. This is because genfromtxt() (after skipping a number of lines as specified in the optional kwarg skip_header) will use the *first* line
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
To be ultra clear (since I want to code this), you are suggesting that 'first_commented_line' be a *new* accepted value for the kwarg 'names', to invoke the behaviour you suggest? Nope, I was just referring to some hypothetical variable name. I meant that: first_values = None try: while not first_values: first_line = fhd.next() if names is True: parsed = [m for m in first_line.split(comments) if m.strip()] if parsed: first_value = split_line(parsed[0]) else: ... (it's not tested, I'm writing it as it comes. And I didn't even use the `first_commented_line` name, sorry) If this IS what you mean, I'd counter-propose something in the same spirit, but a bit simpler…we let the kwarg 'skip_header' take some additional value, say int(0), int(-1), str('auto'), or True. In this case, instead of skipping a fixed number of lines, it will skip any number of consecutive empty OR commented lines; I really like the idea of having `skip_header=-1` skip all the empty or commented lines (that is, lines whose first non-space character is the `comments` character). That'd be rather convenient. The semantics of this are more intuitive, because this is what I am really after: to *skip* a commented *header* of arbitrary length. So my four examples below could be parsed with: 1. genfromtxt(..., names=True) 2. genfromtxt(..., names=True, skip_header=True) 3. genfromtxt(..., names=True) 4. genfromtxt(..., names=True, skip_header=True) …crucially #1 avoids the regression. Does this seem good to everyone? Sounds good w/ `skip_header=-1` But if this is NOT what you mean, then what you say does not actually work with the simple use-case of my Example #2 below. The first commented line is # here is a... with # as the first non-space character, so the part after becomes the names 'here', 'is', 'a' etc. In that case, you could always use `skip_header=2` In short, the code can't resolve the ambiguity without some extra information from the user. It's always best not to let the code guess too much anyway... Well, no regression, and you have a nice plan. I'm for it. Anybody else? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
I've implemented this feature with skip_header=-1 as suggested by Pierre, and in doing so removed the regression. TravisBot seems to like it: https://github.com/numpy/numpy/pull/351 On Mon, 2012-07-16 at 16:12 +0200, Pierre GM wrote: To be ultra clear (since I want to code this), you are suggesting that 'first_commented_line' be a *new* accepted value for the kwarg 'names', to invoke the behaviour you suggest? Nope, I was just referring to some hypothetical variable name. I meant that: first_values = None try: while not first_values: first_line = fhd.next() if names is True: parsed = [m for m in first_line.split(comments) if m.strip()] if parsed: first_value = split_line(parsed[0]) else: ... (it's not tested, I'm writing it as it comes. And I didn't even use the `first_commented_line` name, sorry) If this IS what you mean, I'd counter-propose something in the same spirit, but a bit simpler…we let the kwarg 'skip_header' take some additional value, say int(0), int(-1), str('auto'), or True. In this case, instead of skipping a fixed number of lines, it will skip any number of consecutive empty OR commented lines; I really like the idea of having `skip_header=-1` skip all the empty or commented lines (that is, lines whose first non-space character is the `comments` character). That'd be rather convenient. The semantics of this are more intuitive, because this is what I am really after: to *skip* a commented *header* of arbitrary length. So my four examples below could be parsed with: 1. genfromtxt(..., names=True) 2. genfromtxt(..., names=True, skip_header=True) 3. genfromtxt(..., names=True) 4. genfromtxt(..., names=True, skip_header=True) …crucially #1 avoids the regression. Does this seem good to everyone? Sounds good w/ `skip_header=-1` But if this is NOT what you mean, then what you say does not actually work with the simple use-case of my Example #2 below. The first commented line is # here is a... with # as the first non-space character, so the part after becomes the names 'here', 'is', 'a' etc. In that case, you could always use `skip_header=2` In short, the code can't resolve the ambiguity without some extra information from the user. It's always best not to let the code guess too much anyway... Well, no regression, and you have a nice plan. I'm for it. Anybody else? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Paul Natsuo Kishimoto SM candidate, Technology Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105 signature.asc Description: This is a digitally signed message part ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
On Mon, Jul 16, 2012 at 8:06 PM, Paul Natsuo Kishimoto m...@paul.kishimoto.name wrote: I've implemented this feature with skip_header=-1 as suggested by Pierre, and in doing so removed the regression. TravisBot seems to like it: https://github.com/numpy/numpy/pull/351 Can we please not use weird magic values like this? This isn't C. skip_header=comments would work just as well and be far more self-explanatory... -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
On Mon, Jul 16, 2012 at 3:06 PM, Paul Natsuo Kishimoto m...@paul.kishimoto.name wrote: I've implemented this feature with skip_header=-1 as suggested by Pierre, and in doing so removed the regression. TravisBot seems to like it: https://github.com/numpy/numpy/pull/351 On Mon, 2012-07-16 at 16:12 +0200, Pierre GM wrote: To be ultra clear (since I want to code this), you are suggesting that 'first_commented_line' be a *new* accepted value for the kwarg 'names', to invoke the behaviour you suggest? Nope, I was just referring to some hypothetical variable name. I meant that: first_values = None try: while not first_values: first_line = fhd.next() if names is True: parsed = [m for m in first_line.split(comments) if m.strip()] if parsed: first_value = split_line(parsed[0]) else: ... (it's not tested, I'm writing it as it comes. And I didn't even use the `first_commented_line` name, sorry) If this IS what you mean, I'd counter-propose something in the same spirit, but a bit simpler…we let the kwarg 'skip_header' take some additional value, say int(0), int(-1), str('auto'), or True. In this case, instead of skipping a fixed number of lines, it will skip any number of consecutive empty OR commented lines; I really like the idea of having `skip_header=-1` skip all the empty or commented lines (that is, lines whose first non-space character is the `comments` character). That'd be rather convenient. The semantics of this are more intuitive, because this is what I am really after: to *skip* a commented *header* of arbitrary length. So my four examples below could be parsed with: 1. genfromtxt(..., names=True) 2. genfromtxt(..., names=True, skip_header=True) 3. genfromtxt(..., names=True) 4. genfromtxt(..., names=True, skip_header=True) …crucially #1 avoids the regression. Does this seem good to everyone? Sounds good w/ `skip_header=-1` But if this is NOT what you mean, then what you say does not actually work with the simple use-case of my Example #2 below. The first commented line is # here is a... with # as the first non-space character, so the part after becomes the names 'here', 'is', 'a' etc. In that case, you could always use `skip_header=2` In short, the code can't resolve the ambiguity without some extra information from the user. It's always best not to let the code guess too much anyway... Well, no regression, and you have a nice plan. I'm for it. Anybody else? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Paul Natsuo Kishimoto SM candidate, Technology Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I think that the proposed solution is OK, but it does make it even trickier for the average user to predict the behavior of genfromtxt() for different situations. Perhaps as part of this pull request Paul should also update the documentation to include a section describing this behavior and usage with examples 1 to 4. - Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
Well, as `skip_header` is a number of lines, I don't really see anything particular magical about a `skip_header=-1`. Plus, range(-1) == [], while range(comments) raises a TypeError. And then you'd have to figure why the exception was raised. -- Pierre GM On Monday, July 16, 2012 at 21:56 , Nathaniel Smith wrote: On Mon, Jul 16, 2012 at 8:06 PM, Paul Natsuo Kishimoto m...@paul.kishimoto.name (mailto:m...@paul.kishimoto.name) wrote: I've implemented this feature with skip_header=-1 as suggested by Pierre, and in doing so removed the regression. TravisBot seems to like it: https://github.com/numpy/numpy/pull/351 Can we please not use weird magic values like this? This isn't C. skip_header=comments would work just as well and be far more self-explanatory... -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org (mailto:NumPy-Discussion@scipy.org) http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
On Mon, Jul 16, 2012 at 9:01 PM, Pierre GM pgmdevl...@gmail.com wrote: Well, as `skip_header` is a number of lines, I don't really see anything particular magical about a `skip_header=-1`. The logic here is: - if names=True, then genfromtext expects the names to be given in the first line, and they may or may not be commented out - BUT, if skip_header=some special value, then any all-comment lines will be skipped before looking for names, i.e. the names are not expected to be commented out and comments are given their original meaning again. I have no idea how one could derive this understanding by looking at skip_header=-1. Ah, that -1 is a number of lines, everyone knows that skipping -1 lines is equivalent to toggling our expectation for whether the column names will appear inside a comment? The API is pretty convoluted at this point and I'm not convinced we wouldn't be better off with adding a new argument like names_in_comment=False/True, but skip_header=comments at least gives the reader a fighting chance... -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
Tom, I agree that the documentation should be updated (both the doctoring and the relevant parts of the user manual), and specific unit-tests added. Paul, that's a direct nudge ;) (I'm sure you don't mind). I was also considering the weird case first_line = # A B C #1 #2 #3 How many columns in that case ? 6 ? 3 ? So, instead of using a `split`, maybe we should just check index=first_line.index(comment) and take `first_line[:index]` (or `first_line[index+1:]` after depending on the case). But then again, it's a weird case. -- Pierre GM On Monday, July 16, 2012 at 22:00 , Tom Aldcroft wrote: On Mon, Jul 16, 2012 at 3:06 PM, Paul Natsuo Kishimoto m...@paul.kishimoto.name (mailto:m...@paul.kishimoto.name) wrote: I've implemented this feature with skip_header=-1 as suggested by Pierre, and in doing so removed the regression. TravisBot seems to like it: https://github.com/numpy/numpy/pull/351 On Mon, 2012-07-16 at 16:12 +0200, Pierre GM wrote: To be ultra clear (since I want to code this), you are suggesting that 'first_commented_line' be a *new* accepted value for the kwarg 'names', to invoke the behaviour you suggest? Nope, I was just referring to some hypothetical variable name. I meant that: first_values = None try: while not first_values: first_line = fhd.next() if names is True: parsed = [m for m in first_line.split(comments) if m.strip()] if parsed: first_value = split_line(parsed[0]) else: ... (it's not tested, I'm writing it as it comes. And I didn't even use the `first_commented_line` name, sorry) If this IS what you mean, I'd counter-propose something in the same spirit, but a bit simpler…we let the kwarg 'skip_header' take some additional value, say int(0), int(-1), str('auto'), or True. In this case, instead of skipping a fixed number of lines, it will skip any number of consecutive empty OR commented lines; I really like the idea of having `skip_header=-1` skip all the empty or commented lines (that is, lines whose first non-space character is the `comments` character). That'd be rather convenient. The semantics of this are more intuitive, because this is what I am really after: to *skip* a commented *header* of arbitrary length. So my four examples below could be parsed with: 1. genfromtxt(..., names=True) 2. genfromtxt(..., names=True, skip_header=True) 3. genfromtxt(..., names=True) 4. genfromtxt(..., names=True, skip_header=True) …crucially #1 avoids the regression. Does this seem good to everyone? Sounds good w/ `skip_header=-1` But if this is NOT what you mean, then what you say does not actually work with the simple use-case of my Example #2 below. The first commented line is # here is a... with # as the first non-space character, so the part after becomes the names 'here', 'is', 'a' etc. In that case, you could always use `skip_header=2` In short, the code can't resolve the ambiguity without some extra information from the user. It's always best not to let the code guess too much anyway... Well, no regression, and you have a nice plan. I'm for it. Anybody else? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org (mailto:NumPy-Discussion@scipy.org) http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Paul Natsuo Kishimoto SM candidate, Technology Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org (mailto:NumPy-Discussion@scipy.org) http://mail.scipy.org/mailman/listinfo/numpy-discussion I think that the proposed solution is OK, but it does make it even trickier for the average user to predict the behavior of genfromtxt() for different situations. Perhaps as part of this pull request Paul should also update the documentation to include a section describing this behavior and usage with examples 1 to 4. - Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org (mailto:NumPy-Discussion@scipy.org) http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
On Mon, 2012-07-16 at 21:14 +0100, Nathaniel Smith wrote: On Mon, Jul 16, 2012 at 9:01 PM, Pierre GM pgmdevl...@gmail.com wrote: Well, as `skip_header` is a number of lines, I don't really see anything particular magical about a `skip_header=-1`. The logic here is: - if names=True, then genfromtext expects the names to be given in the first line, and they may or may not be commented out - BUT, if skip_header=some special value, then any all-comment lines will be skipped before looking for names, i.e. the names are not expected to be commented out and comments are given their original meaning again. I have no idea how one could derive this understanding by looking at skip_header=-1. Ah, that -1 is a number of lines, everyone knows that skipping -1 lines is equivalent to toggling our expectation for whether the column names will appear inside a comment? The API is pretty convoluted at this point and I'm not convinced we wouldn't be better off with adding a new argument like names_in_comment=False/True, but skip_header=comments at least gives the reader a fighting chance... -n Another option is to use skip_header=True. The internal monologue accompanying this is Ah, do I want it to skip the header? Yes, true, I do, with no thought needed on the number of lines involved. Pierre, checking the type of the argument is trivial. Nathaniel, is this less weird/magical? Anyone else? I don't care what the value is and it's easy to change, but I'll await some agreement on this point so I don't have to change it yet again in response to more objections. Pierre, for a line # A B C #1 #2 #3 the user gets six columns 'A', 'B', 'C', '#1', '#2', '#3', which is messy but what they deserve for using such messy input :) Also, if you look closely, the use of index() you propose is equivalent to my current code, just more verbose. Tom, in my branch I rewrote the documentation for the `names` kwarg in an attempt to be more clear, but I agree a documentation example of the non-legacy use would go a long way. I've also realized I neglected to update the documentation for `skip_header`. I'll do these once there is consensus on the value to use. If there was willingness to tolerate a backwards-incompatible change, the resulting behaviour would be quite simple and intuitive overall, but that's out of my hands. At the moment I'm just concerned with making an intuitive behaviour *possible*. Thanks everyone for your input, -- Paul Natsuo Kishimoto SM candidate, Technology Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105 signature.asc Description: This is a digitally signed message part ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
Hello everyone, I am a longtime NumPy user, and I just filed my first contribution to the code as pull request to fix what I felt was a bug in the behaviour of genfromtxt() https://github.com/numpy/numpy/pull/351 It turns out this alters existing behaviour that some people may depend on, so I was encouraged to raise the issue on this list to see what the consensus was. This behaviour happens in the specific situation where: * Comments are used in the file (the default comment character is '#', which I'll use here), AND * The kwarg names=True is given. In this case, genfromtxt() is supposed to read an initial row containing the names of the columns and return an array with a structured dtype. Currently, these options work with a file like (Example #1): # gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …but NOT with a file like (Example #2): # here is a general file comment # it is spread over multiple lines gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and thinks all of the columns are strings because 'gender', 'age' and 'weight' are not numbers. This is because genfromtxt() (after skipping a number of lines as specified in the optional kwarg skip_header) will use the *first* line it encounters to produce column names. If that line contains a comment character, genfromtxt() discards everything *up to and including* the comment character, and tries to use the content *after* the comment character as headers (Example 3): gender age weight # wrong column names M 21 72.10 F 35 58.33 M 33 21.99 …the resulting column names are 'wrong', 'column' and 'names'. My proposed change was that, if the first (or any subsequent) line contains a comment character, it should be treated as an *actual comment*, and discarded along with anything that follows it on the line. In Example 2, the result would be that the first two lines appear empty (no text before '#'), and the third line (gender age weight) is used for column names. In Example 3, the result would be that gender age weight is used for column names while # wrong column names is ignored. BUT! In Example 1, the result would be that the first line appears empty, and M 21 72.10 are used for column names. In other words, this change would do away with the previous behaviour where the very first commented line was (magically?) treated not as a comment but instead as column headers. This might break some existing code. On the positive side, it would allow the user to be more liberal with the format of input files (Example 4): # here is a general file comment # the columns in this table are gender age weight # here is a comment on the header line # following this line are the data M 21 72.10 F 35 58.33 # here is a comment on a data line M 33 21.99 I feel that this is a better/more flexible behaviour for genfromtxt(), but—as stated—I am interested in your thoughts. Cheers, -- Paul Natsuo Kishimoto SM candidate, Technology Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105 signature.asc Description: This is a digitally signed message part ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto m...@paul.kishimoto.name wrote: Hello everyone, I am a longtime NumPy user, and I just filed my first contribution to the code as pull request to fix what I felt was a bug in the behaviour of genfromtxt() https://github.com/numpy/numpy/pull/351 It turns out this alters existing behaviour that some people may depend on, so I was encouraged to raise the issue on this list to see what the consensus was. This behaviour happens in the specific situation where: * Comments are used in the file (the default comment character is '#', which I'll use here), AND * The kwarg names=True is given. In this case, genfromtxt() is supposed to read an initial row containing the names of the columns and return an array with a structured dtype. Currently, these options work with a file like (Example #1): # gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …but NOT with a file like (Example #2): # here is a general file comment # it is spread over multiple lines gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and thinks all of the columns are strings because 'gender', 'age' and 'weight' are not numbers. This is because genfromtxt() (after skipping a number of lines as specified in the optional kwarg skip_header) will use the *first* line it encounters to produce column names. If that line contains a comment character, genfromtxt() discards everything *up to and including* the comment character, and tries to use the content *after* the comment character as headers (Example 3): gender age weight # wrong column names M 21 72.10 F 35 58.33 M 33 21.99 …the resulting column names are 'wrong', 'column' and 'names'. My proposed change was that, if the first (or any subsequent) line contains a comment character, it should be treated as an *actual comment*, and discarded along with anything that follows it on the line. In Example 2, the result would be that the first two lines appear empty (no text before '#'), and the third line (gender age weight) is used for column names. In Example 3, the result would be that gender age weight is used for column names while # wrong column names is ignored. BUT! In Example 1, the result would be that the first line appears empty, and M 21 72.10 are used for column names. In other words, this change would do away with the previous behaviour where the very first commented line was (magically?) treated not as a comment but instead as column headers. This might break some existing code. On the positive side, it would allow the user to be more liberal with the format of input files (Example 4): # here is a general file comment # the columns in this table are gender age weight # here is a comment on the header line # following this line are the data M 21 72.10 F 35 58.33 # here is a comment on a data line M 33 21.99 I feel that this is a better/more flexible behaviour for genfromtxt(), but—as stated—I am interested in your thoughts. Cheers, -- Paul Natsuo Kishimoto SM candidate, Technology Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi Paul, At least in astronomy tabular files with the column definitions in the first commented line are reasonably common. This is driven in part by wide use of legacy packages like supermongo etc that don't have intelligent table readers, so users document the column names as a comment line. I think making this break might be unfortunate for users in astronomy. Dealing with commented header definitions is annoying. Not that it matters specifically for your genfromtext() proposal, but in the asciitable reader this case is handled with a particular reader class that expects the first comment line to contain the column definitions: http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader Cheers, Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote: On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto m...@paul.kishimoto.name wrote: Hello everyone, I am a longtime NumPy user, and I just filed my first contribution to the code as pull request to fix what I felt was a bug in the behaviour of genfromtxt() https://github.com/numpy/numpy/pull/351 It turns out this alters existing behaviour that some people may depend on, so I was encouraged to raise the issue on this list to see what the consensus was. This behaviour happens in the specific situation where: * Comments are used in the file (the default comment character is '#', which I'll use here), AND * The kwarg names=True is given. In this case, genfromtxt() is supposed to read an initial row containing the names of the columns and return an array with a structured dtype. Currently, these options work with a file like (Example #1): # gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …but NOT with a file like (Example #2): # here is a general file comment # it is spread over multiple lines gender age weight M 21 72.10 F 35 58.33 M 33 21.99 …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and thinks all of the columns are strings because 'gender', 'age' and 'weight' are not numbers. This is because genfromtxt() (after skipping a number of lines as specified in the optional kwarg skip_header) will use the *first* line it encounters to produce column names. If that line contains a comment character, genfromtxt() discards everything *up to and including* the comment character, and tries to use the content *after* the comment character as headers (Example 3): gender age weight # wrong column names M 21 72.10 F 35 58.33 M 33 21.99 …the resulting column names are 'wrong', 'column' and 'names'. My proposed change was that, if the first (or any subsequent) line contains a comment character, it should be treated as an *actual comment*, and discarded along with anything that follows it on the line. In Example 2, the result would be that the first two lines appear empty (no text before '#'), and the third line (gender age weight) is used for column names. In Example 3, the result would be that gender age weight is used for column names while # wrong column names is ignored. BUT! In Example 1, the result would be that the first line appears empty, and M 21 72.10 are used for column names. In other words, this change would do away with the previous behaviour where the very first commented line was (magically?) treated not as a comment but instead as column headers. This might break some existing code. On the positive side, it would allow the user to be more liberal with the format of input files (Example 4): # here is a general file comment # the columns in this table are gender age weight # here is a comment on the header line # following this line are the data M 21 72.10 F 35 58.33 # here is a comment on a data line M 33 21.99 I feel that this is a better/more flexible behaviour for genfromtxt(), but—as stated—I am interested in your thoughts. Cheers, -- Paul Natsuo Kishimoto SM candidate, Technology Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi Paul, At least in astronomy tabular files with the column definitions in the first commented line are reasonably common. This is driven in part by wide use of legacy packages like supermongo etc that don't have intelligent table readers, so users document the column names as a comment line. I think making this break might be unfortunate for users in astronomy. Dealing with commented header definitions is annoying. Not that it matters specifically for your genfromtext() proposal, but in the asciitable reader this case is handled with a particular reader class that expects the first comment line to contain the column definitions: http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader Cheers, Tom Tom, Thanks for this information. In thinking about how people would work around this, I figured it would be fairly easy to discard a comment character that occurred as the very first character in a file, e.g.: raw = StringIO(open('example.txt').read()[1:]) data = numpy.genfromtxt(raw, comment='#',