[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
[ https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897267#comment-16897267 ] Neal Richardson commented on ARROW-6004: This patch does not change the default, only correctly implements the option. > [C++] CSV reader ignore_empty_lines option doesn't handle empty lines > - > > Key: ARROW-6004 > URL: https://issues.apache.org/jira/browse/ARROW-6004 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Minor > Labels: csv, pull-request-available > Time Spent: 3.5h > Remaining Estimate: 0h > > Followup to https://issues.apache.org/jira/browse/ARROW-5747. If > {{ignore_empty_lines}} is false and there are empty lines, it fails to parse > (again, with {{Invalid: Empty CSV file}}). > Correct behavior should be to fill those empty lines with missing data for > all columns. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
[ https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897264#comment-16897264 ] Francois Saint-Jacques commented on ARROW-6004: --- I was agreeing with the (current) default instead of adding nulls. I think it's worth having as an option. > [C++] CSV reader ignore_empty_lines option doesn't handle empty lines > - > > Key: ARROW-6004 > URL: https://issues.apache.org/jira/browse/ARROW-6004 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Minor > Labels: csv, pull-request-available > Time Spent: 3.5h > Remaining Estimate: 0h > > Followup to https://issues.apache.org/jira/browse/ARROW-5747. If > {{ignore_empty_lines}} is false and there are empty lines, it fails to parse > (again, with {{Invalid: Empty CSV file}}). > Correct behavior should be to fill those empty lines with missing data for > all columns. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
[ https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897263#comment-16897263 ] Neal Richardson commented on ARROW-6004: It's an option many (most? all?) of the other common CSV readers have, so we should have it too. > [C++] CSV reader ignore_empty_lines option doesn't handle empty lines > - > > Key: ARROW-6004 > URL: https://issues.apache.org/jira/browse/ARROW-6004 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Minor > Labels: csv, pull-request-available > Time Spent: 3.5h > Remaining Estimate: 0h > > Followup to https://issues.apache.org/jira/browse/ARROW-5747. If > {{ignore_empty_lines}} is false and there are empty lines, it fails to parse > (again, with {{Invalid: Empty CSV file}}). > Correct behavior should be to fill those empty lines with missing data for > all columns. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
[ https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897257#comment-16897257 ] Joris Van den Bossche commented on ARROW-6004: -- [~fsaintjacques] skipping empty lines is still the default behaviour, those nulls are only introduced when specifying the keyword. I don't think we have plans to change the default, so the question is then more: is it worth having this option? (pandas.read_csv has it for example) > [C++] CSV reader ignore_empty_lines option doesn't handle empty lines > - > > Key: ARROW-6004 > URL: https://issues.apache.org/jira/browse/ARROW-6004 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Minor > Labels: csv, pull-request-available > Time Spent: 3h 20m > Remaining Estimate: 0h > > Followup to https://issues.apache.org/jira/browse/ARROW-5747. If > {{ignore_empty_lines}} is false and there are empty lines, it fails to parse > (again, with {{Invalid: Empty CSV file}}). > Correct behavior should be to fill those empty lines with missing data for > all columns. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
[ https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897238#comment-16897238 ] Francois Saint-Jacques commented on ARROW-6004: --- I'd expect the empty lines to be skipped, if one wants nulls, it should be a line of the exact number of commas. > [C++] CSV reader ignore_empty_lines option doesn't handle empty lines > - > > Key: ARROW-6004 > URL: https://issues.apache.org/jira/browse/ARROW-6004 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Minor > Labels: csv, pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > Followup to https://issues.apache.org/jira/browse/ARROW-5747. If > {{ignore_empty_lines}} is false and there are empty lines, it fails to parse > (again, with {{Invalid: Empty CSV file}}). > Correct behavior should be to fill those empty lines with missing data for > all columns. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
[ https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891078#comment-16891078 ] Antoine Pitrou commented on ARROW-6004: --- Pandas does this: {code:python} >>> pd.read_csv(io.BytesIO(b"""ab,cd\n12,34\n\r\n56,78\n""")) >>> >>> ab cd 0 12 34 1 56 78 >>> pd.read_csv(io.BytesIO(b"""ab,cd\n12,34\n\r\n56,78\n"""), >>> skip_blank_lines=False) >>> abcd 0 12.0 34.0 1 NaN NaN 2 56.0 78.0 {code} > [C++] CSV reader ignore_empty_lines option doesn't handle empty lines > - > > Key: ARROW-6004 > URL: https://issues.apache.org/jira/browse/ARROW-6004 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Minor > Labels: csv > > Followup to https://issues.apache.org/jira/browse/ARROW-5747. If > {{ignore_empty_lines}} is false and there are empty lines, it fails to parse > (again, with {{Invalid: Empty CSV file}}). > Correct behavior should be to fill those empty lines with missing data for > all columns. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
[ https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891011#comment-16891011 ] Wes McKinney commented on ARROW-6004: - Hm, I am not sure about this. What do other CSV readers do (e.g. {{pandas.read_csv}}, {{readr::read_csv}}, {{datatable::fread}}, etc.)? > [C++] CSV reader ignore_empty_lines option doesn't handle empty lines > - > > Key: ARROW-6004 > URL: https://issues.apache.org/jira/browse/ARROW-6004 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Minor > Labels: csv > > Followup to https://issues.apache.org/jira/browse/ARROW-5747. If > {{ignore_empty_lines}} is false and there are empty lines, it fails to parse > (again, with {{Invalid: Empty CSV file}}). > Correct behavior should be to fill those empty lines with missing data for > all columns. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
[ https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890923#comment-16890923 ] Antoine Pitrou commented on ARROW-6004: --- > Correct behavior should be to fill those empty lines with missing data for > all columns. Uh, why would it? I know CSV is not an extremely well-specified format, but that sounds to me a bit far-fetched. It's also a bit problematic: how do you deal with empty lines at the beginning of a file? You don't know the number of columns yet. [~wesmckinn] > [C++] CSV reader ignore_empty_lines option doesn't handle empty lines > - > > Key: ARROW-6004 > URL: https://issues.apache.org/jira/browse/ARROW-6004 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Minor > Labels: csv > > Followup to https://issues.apache.org/jira/browse/ARROW-5747. If > {{ignore_empty_lines}} is false and there are empty lines, it fails to parse > (again, with {{Invalid: Empty CSV file}}). > Correct behavior should be to fill those empty lines with missing data for > all columns. -- This message was sent by Atlassian JIRA (v7.6.14#76016)