[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines

2019-07-31 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897267#comment-16897267
 ] 

Neal Richardson commented on ARROW-6004:


This patch does not change the default, only correctly implements the option.

> [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
> -
>
> Key: ARROW-6004
> URL: https://issues.apache.org/jira/browse/ARROW-6004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: csv, pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Followup to https://issues.apache.org/jira/browse/ARROW-5747. If 
> {{ignore_empty_lines}} is false and there are empty lines, it fails to parse 
> (again, with {{Invalid: Empty CSV file}}).
> Correct behavior should be to fill those empty lines with missing data for 
> all columns.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines

2019-07-31 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897264#comment-16897264
 ] 

Francois Saint-Jacques commented on ARROW-6004:
---

I was agreeing with the (current) default instead of adding nulls. I think it's 
worth having as an option.

> [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
> -
>
> Key: ARROW-6004
> URL: https://issues.apache.org/jira/browse/ARROW-6004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: csv, pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Followup to https://issues.apache.org/jira/browse/ARROW-5747. If 
> {{ignore_empty_lines}} is false and there are empty lines, it fails to parse 
> (again, with {{Invalid: Empty CSV file}}).
> Correct behavior should be to fill those empty lines with missing data for 
> all columns.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines

2019-07-31 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897263#comment-16897263
 ] 

Neal Richardson commented on ARROW-6004:


It's an option many (most? all?) of the other common CSV readers have, so we 
should have it too.

> [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
> -
>
> Key: ARROW-6004
> URL: https://issues.apache.org/jira/browse/ARROW-6004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: csv, pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Followup to https://issues.apache.org/jira/browse/ARROW-5747. If 
> {{ignore_empty_lines}} is false and there are empty lines, it fails to parse 
> (again, with {{Invalid: Empty CSV file}}).
> Correct behavior should be to fill those empty lines with missing data for 
> all columns.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines

2019-07-31 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897257#comment-16897257
 ] 

Joris Van den Bossche commented on ARROW-6004:
--

[~fsaintjacques] skipping empty lines is still the default behaviour, those 
nulls are only introduced when specifying the keyword. I don't think we have 
plans to change the default, so the question is then more: is it worth having 
this option? (pandas.read_csv has it for example)

> [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
> -
>
> Key: ARROW-6004
> URL: https://issues.apache.org/jira/browse/ARROW-6004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: csv, pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Followup to https://issues.apache.org/jira/browse/ARROW-5747. If 
> {{ignore_empty_lines}} is false and there are empty lines, it fails to parse 
> (again, with {{Invalid: Empty CSV file}}).
> Correct behavior should be to fill those empty lines with missing data for 
> all columns.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines

2019-07-31 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897238#comment-16897238
 ] 

Francois Saint-Jacques commented on ARROW-6004:
---

I'd expect the empty lines to be skipped, if one wants nulls, it should be a 
line of the exact number of commas.

> [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
> -
>
> Key: ARROW-6004
> URL: https://issues.apache.org/jira/browse/ARROW-6004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: csv, pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Followup to https://issues.apache.org/jira/browse/ARROW-5747. If 
> {{ignore_empty_lines}} is false and there are empty lines, it fails to parse 
> (again, with {{Invalid: Empty CSV file}}).
> Correct behavior should be to fill those empty lines with missing data for 
> all columns.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines

2019-07-23 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891078#comment-16891078
 ] 

Antoine Pitrou commented on ARROW-6004:
---

Pandas does this:
{code:python}
>>> pd.read_csv(io.BytesIO(b"""ab,cd\n12,34\n\r\n56,78\n"""))   
>>> 
>>>   
   ab  cd
0  12  34
1  56  78
>>> pd.read_csv(io.BytesIO(b"""ab,cd\n12,34\n\r\n56,78\n"""), 
>>> skip_blank_lines=False) 
>>> 
 abcd
0  12.0  34.0
1   NaN   NaN
2  56.0  78.0
{code}


> [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
> -
>
> Key: ARROW-6004
> URL: https://issues.apache.org/jira/browse/ARROW-6004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Minor
>  Labels: csv
>
> Followup to https://issues.apache.org/jira/browse/ARROW-5747. If 
> {{ignore_empty_lines}} is false and there are empty lines, it fails to parse 
> (again, with {{Invalid: Empty CSV file}}).
> Correct behavior should be to fill those empty lines with missing data for 
> all columns.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines

2019-07-23 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891011#comment-16891011
 ] 

Wes McKinney commented on ARROW-6004:
-

Hm, I am not sure about this. What do other CSV readers do (e.g. 
{{pandas.read_csv}}, {{readr::read_csv}}, {{datatable::fread}}, etc.)? 

> [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
> -
>
> Key: ARROW-6004
> URL: https://issues.apache.org/jira/browse/ARROW-6004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Minor
>  Labels: csv
>
> Followup to https://issues.apache.org/jira/browse/ARROW-5747. If 
> {{ignore_empty_lines}} is false and there are empty lines, it fails to parse 
> (again, with {{Invalid: Empty CSV file}}).
> Correct behavior should be to fill those empty lines with missing data for 
> all columns.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines

2019-07-23 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890923#comment-16890923
 ] 

Antoine Pitrou commented on ARROW-6004:
---

> Correct behavior should be to fill those empty lines with missing data for 
> all columns.

Uh, why would it? I know CSV is not an extremely well-specified format, but 
that sounds to me a bit far-fetched. It's also a bit problematic: how do you 
deal with empty lines at the beginning of a file? You don't know the number of 
columns yet.

[~wesmckinn]

> [C++] CSV reader ignore_empty_lines option doesn't handle empty lines
> -
>
> Key: ARROW-6004
> URL: https://issues.apache.org/jira/browse/ARROW-6004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Minor
>  Labels: csv
>
> Followup to https://issues.apache.org/jira/browse/ARROW-5747. If 
> {{ignore_empty_lines}} is false and there are empty lines, it fails to parse 
> (again, with {{Invalid: Empty CSV file}}).
> Correct behavior should be to fill those empty lines with missing data for 
> all columns.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)