[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20949 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r204988217 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -514,6 +516,41 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("SPARK-19018: Save csv with custom charset") { + +// scalastyle:off nonascii +val content = "µà áâä ÃÃÃ" +// scalastyle:on nonascii + +Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding => + withTempPath { path => +val csvDir = new File(path, "csv") +Seq(content).toDF().write + .option("encoding", encoding) + .csv(csvDir.getCanonicalPath) + +csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile => + val readback = Files.readAllBytes(csvFile.toPath) + val expected = (content + Properties.lineSeparator).getBytes(Charset.forName(encoding)) + assert(readback === expected) +}) + } +} + } + + test("SPARK-19018: error handling for unsupported charsets") { +val exception = intercept[SparkException] { + withTempPath { path => +val csvDir = new File(path, "csv").getCanonicalPath +Seq("a,A,c,A,b,B").toDF().write + .option("encoding", "1-9588-osi") + .csv(csvDir) --- End diff -- nit: you could use directly `path.getCanonicalPath` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r204988168 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -514,6 +516,41 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("SPARK-19018: Save csv with custom charset") { + +// scalastyle:off nonascii +val content = "µà áâä ÃÃÃ" +// scalastyle:on nonascii + +Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding => + withTempPath { path => +val csvDir = new File(path, "csv") +Seq(content).toDF().write + .option("encoding", encoding) + .csv(csvDir.getCanonicalPath) + +csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile => --- End diff -- nit: `.foreach({` -> `.foreach {` per https://github.com/databricks/scala-style-guide#anonymous-methods --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r204988574 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -514,6 +516,41 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("SPARK-19018: Save csv with custom charset") { + +// scalastyle:off nonascii +val content = "µà áâä ÃÃÃ" +// scalastyle:on nonascii + +Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding => + withTempPath { path => +val csvDir = new File(path, "csv") +Seq(content).toDF().write --- End diff -- nit: `.write.repartition(1)` to make sure we write only one file --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user crafty-coder commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r203306174 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("SPARK-19018: Save csv with custom charset") { + +// scalastyle:off nonascii +val content = "µà áâä ÃÃÃ" +// scalastyle:on nonascii + +Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding => + withTempDir { dir => +val csvDir = new File(dir, "csv") + +val originalDF = Seq(content).toDF("_c0").repartition(1) +originalDF.write + .option("encoding", encoding) + .csv(csvDir.getCanonicalPath) + +csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile => --- End diff -- What do you mean? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user crafty-coder commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r203286908 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("SPARK-19018: Save csv with custom charset") { + +// scalastyle:off nonascii +val content = "µà áâä ÃÃÃ" +// scalastyle:on nonascii + +Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding => + withTempDir { dir => +val csvDir = new File(dir, "csv") + +val originalDF = Seq(content).toDF("_c0").repartition(1) +originalDF.write + .option("encoding", encoding) + .csv(csvDir.getCanonicalPath) + +csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile => + val readback = Files.readAllBytes(csvFile.toPath) + val expected = (content + "\n").getBytes(Charset.forName(encoding)) --- End diff -- Good Point! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r203229065 --- Diff: python/pyspark/sql/readwriter.py --- @@ -895,6 +895,8 @@ def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=No the quote character. If None is set, the default value is escape character when escape and quote characters are different, ``\0`` otherwise.. +:param encoding: sets the encoding (charset) to be used on the csv file. If None is set, it + uses the default value, ``UTF-8``. --- End diff -- Likewise, let's match the doc to JSON's. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r203228930 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("SPARK-19018: Save csv with custom charset") { + +// scalastyle:off nonascii +val content = "µà áâä ÃÃÃ" +// scalastyle:on nonascii + +Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding => + withTempDir { dir => +val csvDir = new File(dir, "csv") + +val originalDF = Seq(content).toDF("_c0").repartition(1) --- End diff -- `toDF("_c0")` -> `toDF()` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r203228844 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("SPARK-19018: Save csv with custom charset") { + +// scalastyle:off nonascii +val content = "µà áâä ÃÃÃ" +// scalastyle:on nonascii + +Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding => + withTempDir { dir => --- End diff -- `withTempDir` -> `withTempPath` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r203228679 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("SPARK-19018: Save csv with custom charset") { + +// scalastyle:off nonascii +val content = "µà áâä ÃÃÃ" +// scalastyle:on nonascii + +Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding => + withTempDir { dir => +val csvDir = new File(dir, "csv") + +val originalDF = Seq(content).toDF("_c0").repartition(1) +originalDF.write + .option("encoding", encoding) + .csv(csvDir.getCanonicalPath) + +csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile => + val readback = Files.readAllBytes(csvFile.toPath) + val expected = (content + "\n").getBytes(Charset.forName(encoding)) + assert(readback === expected) +}) + } +} + } + + test("SPARK-19018: error handling for unsupported charsets") { +val exception = intercept[SparkException] { + withTempDir { dir => --- End diff -- `withTempPath` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r203228640 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("SPARK-19018: Save csv with custom charset") { + +// scalastyle:off nonascii +val content = "µà áâä ÃÃÃ" +// scalastyle:on nonascii + +Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding => + withTempDir { dir => +val csvDir = new File(dir, "csv") + +val originalDF = Seq(content).toDF("_c0").repartition(1) +originalDF.write + .option("encoding", encoding) + .csv(csvDir.getCanonicalPath) + +csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile => + val readback = Files.readAllBytes(csvFile.toPath) + val expected = (content + "\n").getBytes(Charset.forName(encoding)) --- End diff -- Currently, the newline is dependent on Univocity. This test is going to be broken on Windows. Let's use platform's newline --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r203228403 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("SPARK-19018: Save csv with custom charset") { + +// scalastyle:off nonascii +val content = "µà áâä ÃÃÃ" +// scalastyle:on nonascii + +Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding => + withTempDir { dir => +val csvDir = new File(dir, "csv") + +val originalDF = Seq(content).toDF("_c0").repartition(1) +originalDF.write + .option("encoding", encoding) + .csv(csvDir.getCanonicalPath) + +csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile => --- End diff -- `h({ ` => `h { ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r203228243 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -146,7 +148,12 @@ private[csv] class CsvOutputWriter( context: TaskAttemptContext, params: CSVOptions) extends OutputWriter with Logging { - private val writer = CodecStreams.createOutputStreamWriter(context, new Path(path)) + private val charset = Charset.forName(params.charset) + + private val writer = CodecStreams.createOutputStreamWriter( +context, --- End diff -- tiny nit: ```scala private val writer = CodecStreams.createOutputStreamWriter( context, new Path(path), charset) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r203227873 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -625,6 +625,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * enclosed in quotes. Default is to only escape values containing a quote character. * `header` (default `false`): writes the names of columns as the first line. * `nullValue` (default empty string): sets the string representation of a null value. + * `encoding` (default `UTF-8`): encoding to use when saving to file. --- End diff -- I think we should match the doc with JSON's https://github.com/apache/spark/blob/6ea582e36ab0a2e4e01340f6fc8cfb8d493d567d/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L525-L526 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user crafty-coder commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r203023263 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +512,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("Save csv with custom charset") { +Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding => + withTempDir { dir => +val csvDir = new File(dir, "csv").getCanonicalPath +// scalastyle:off +val originalDF = Seq("µà áâä ÃÃÃ").toDF("_c0") +// scalastyle:on +originalDF.write + .option("header", "false") --- End diff -- My bad, there is no reason. It's fixed on the next commit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r202781709 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +513,44 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("Save csv with custom charset") { --- End diff -- Could you prepend `SPARK-19018` to the test title. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r197662087 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -513,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("Save csv with custom charset") { +Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding => + withTempDir { dir => +val csvDir = new File(dir, "csv").getCanonicalPath +// scalastyle:off +val originalDF = Seq("µà áâä ÃÃÃ").toDF("_c0") +// scalastyle:on +originalDF.write + .option("header", "false") + .option("encoding", encoding) + .csv(csvDir) + +val df = spark + .read + .option("header", "false") + .option("encoding", encoding) --- End diff -- Now it's fine. I think we decided to support encoding in CSV/JSON datasources. Ignore the comment above. We can proceed separately. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r197662012 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +512,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("Save csv with custom charset") { +Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding => + withTempDir { dir => +val csvDir = new File(dir, "csv").getCanonicalPath +// scalastyle:off --- End diff -- Let's ignore the specific rule for this, e.g.: ``` // scalastyle:off nonascii ... // scalastyle:on nonascii ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r197644850 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -146,7 +148,13 @@ private[csv] class CsvOutputWriter( context: TaskAttemptContext, params: CSVOptions) extends OutputWriter with Logging { - private val writer = CodecStreams.createOutputStreamWriter(context, new Path(path)) + private val charset = Charset.forName(params.charset) + + private val writer = CodecStreams.createOutputStreamWriter( +context, +new Path(path), +charset + ) --- End diff -- Move the `)` up like `charset)`. See https://github.com/databricks/scala-style-guide --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r197644657 --- Diff: python/pyspark/sql/readwriter.py --- @@ -895,6 +895,8 @@ def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=No the quote character. If None is set, the default value is escape character when escape and quote characters are different, ``\0`` otherwise.. +:param encoding: sets encoding used for encoding the file. If None is set, it --- End diff -- Could you reformulate this `encoding used for encoding` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r197644302 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +512,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("Save csv with custom charset") { +Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding => + withTempDir { dir => +val csvDir = new File(dir, "csv").getCanonicalPath +// scalastyle:off +val originalDF = Seq("µà áâä ÃÃÃ").toDF("_c0") +// scalastyle:on +originalDF.write + .option("header", "false") --- End diff -- The header flag is disabled by default. Just in case, are there any specific reasons fro testing without CSV header? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r197643948 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -512,6 +512,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("Save csv with custom charset") { +Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding => --- End diff -- Could you check the `UTF-16` and `UTF-32` encoding too. The written csv files must contain [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark) for such encodings. I am not sure that Spark CSV datasource is able to read it in per-line mode (`multiLine` is set to `false`). Probably, you need to switch to multLine mode or read the files by Scala's library like in JsonSuite: https://github.com/apache/spark/blob/c7e2742f9bce2fcb7c717df80761939272beff54/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2322-L2338 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20949#discussion_r178424628 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -513,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("Save csv with custom charset") { +Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding => + withTempDir { dir => +val csvDir = new File(dir, "csv").getCanonicalPath +// scalastyle:off +val originalDF = Seq("µà áâä ÃÃÃ").toDF("_c0") +// scalastyle:on +originalDF.write + .option("header", "false") + .option("encoding", encoding) + .csv(csvDir) + +val df = spark + .read + .option("header", "false") + .option("encoding", encoding) --- End diff -- I think our CSV read encoding option is incomplete for now .. there are many discussions about this now. I am going to fix the read path soon. Let me revisit this after fixing it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...
GitHub user crafty-coder opened a pull request: https://github.com/apache/spark/pull/20949 [SPARK-19018][SQL] Add support for custom encoding on csv writer ## What changes were proposed in this pull request? Add support for custom encoding on csv writer, see https://issues.apache.org/jira/browse/SPARK-19018 ## How was this patch tested? Added two unit tests in CSVSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/crafty-coder/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20949.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20949 commit b9a7bf03b312da151e1d7e37338092bbf5bcb38a Author: crafty-coder Date: 2018-03-30T19:35:04Z [SPARK-19018][SQL] Add support for custom encoding on csv writer --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org