[jira] [Commented] (SPARK-12420) Have a built-in CSV data source implementation

2016-04-27 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260698#comment-15260698
 ] 

koert kuipers commented on SPARK-12420:
---

thanks for getting back so quickly, i will take a look at that PR

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Making 
> this built-in for the most common source can provide a better experience for 
> first-time users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12420) Have a built-in CSV data source implementation

2016-04-27 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260677#comment-15260677
 ] 

Hossein Falaki commented on SPARK-12420:


HI [~koert]. There is pending PR with extensive set of controls for null 
values: https://github.com/apache/spark/pull/11947

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Making 
> this built-in for the most common source can provide a better experience for 
> first-time users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12420) Have a built-in CSV data source implementation

2016-04-27 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260656#comment-15260656
 ] 

koert kuipers commented on SPARK-12420:
---

hello, i see that the (admittedly somewhat crazy sounding) option 
"treatEmptyValuesAsNulls" is missing. this used to be in spark-csv. can someone 
point me to where it is or what the alternative is?

we use this setting together with the setting for "nullValue" so that empty 
values come in a nulls, and nulls get written back out as empty values. this is 
very typical behavior that is the default for many other frameworks such as 
scalding when reading csv files.

so for example a line in a file like this:
{noformat}
a,,5
{noformat}
should become Row("a", null, 5) (this is where the "treatEmptyValuesAsNulls" 
kicks in)

and going in the other direction Row("a", null, 5) should be written out again 
as:
{noformat}
a,,5
{noformat}
(this is where "nullValue" kicks in)


> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Making 
> this built-in for the most common source can provide a better experience for 
> first-time users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12420) Have a built-in CSV data source implementation

2016-01-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085097#comment-15085097
 ] 

Apache Spark commented on SPARK-12420:
--

User 'falaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/10615

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Having 
> to rely on a 3rd party component for this is a very bad user experience for 
> new users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12420) Have a built-in CSV data source implementation

2015-12-28 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072580#comment-15072580
 ] 

Hyukjin Kwon commented on SPARK-12420:
--

+1, I was wondering why it has been staying third party.

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Having 
> to rely on a 3rd party component for this is a very bad user experience for 
> new users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12420) Have a built-in CSV data source implementation

2015-12-18 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063736#comment-15063736
 ] 

Jeff Zhang commented on SPARK-12420:


+1, this is very common use data format. Not sure why it is not built in at the 
beginning. If there's no license issue, then definitely should make it built-in 

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Having 
> to rely on a 3rd party component for this is a very bad user experience for 
> new users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org