[jira] [Assigned] (SPARK-33940) allow configuring the max column name length in csv writer

2021-01-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33940:


Assignee: Nan Zhu

> allow configuring the max column name length in csv writer
> --
>
> Key: SPARK-33940
> URL: https://issues.apache.org/jira/browse/SPARK-33940
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nan Zhu
>Assignee: Nan Zhu
>Priority: Major
>
> csv writer actually has an implicit limit on column name length due to 
> univocity-parser, 
>  
> when we initialize a writer 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,]
>  it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
> eventually 
> ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)]
>  
> in that stringCache.get, it has a maxStringLength cap 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104]
>  which is 1024 by default
>  
> we do not expose this as configurable option, leading to NPE when we have a 
> column name larger than 1024, 
>  
> ```
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
> [info]   at 
> org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
> ```
>  
> it could be reproduced by a simple unit test
>  
> ```
> val row1 = Row("a")
> val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
> val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
> df.repartition(1)
>  .write
>  .option("header", "true")
>  .option("maxColumnNameLength", 1025)
>  .csv(dataPath)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33940) allow configuring the max column name length in csv writer

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33940:


Assignee: (was: Apache Spark)

> allow configuring the max column name length in csv writer
> --
>
> Key: SPARK-33940
> URL: https://issues.apache.org/jira/browse/SPARK-33940
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nan Zhu
>Priority: Major
>
> csv writer actually has an implicit limit on column name length due to 
> univocity-parser, 
>  
> when we initialize a writer 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,]
>  it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
> eventually 
> ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)]
>  
> in that stringCache.get, it has a maxStringLength cap 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104]
>  which is 1024 by default
>  
> we do not expose this as configurable option, leading to NPE when we have a 
> column name larger than 1024, 
>  
> ```
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
> [info]   at 
> org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
> ```
>  
> it could be reproduced by a simple unit test
>  
> ```
> val row1 = Row("a")
> val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
> val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
> df.repartition(1)
>  .write
>  .option("header", "true")
>  .option("maxColumnNameLength", 1025)
>  .csv(dataPath)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33940) allow configuring the max column name length in csv writer

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33940:


Assignee: Apache Spark

> allow configuring the max column name length in csv writer
> --
>
> Key: SPARK-33940
> URL: https://issues.apache.org/jira/browse/SPARK-33940
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nan Zhu
>Assignee: Apache Spark
>Priority: Major
>
> csv writer actually has an implicit limit on column name length due to 
> univocity-parser, 
>  
> when we initialize a writer 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,]
>  it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
> eventually 
> ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)]
>  
> in that stringCache.get, it has a maxStringLength cap 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104]
>  which is 1024 by default
>  
> we do not expose this as configurable option, leading to NPE when we have a 
> column name larger than 1024, 
>  
> ```
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
> [info]   at 
> org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
> ```
>  
> it could be reproduced by a simple unit test
>  
> ```
> val row1 = Row("a")
> val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
> val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
> df.repartition(1)
>  .write
>  .option("header", "true")
>  .option("maxColumnNameLength", 1025)
>  .csv(dataPath)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org