subject:"\[GitHub\] spark pull request #20949\: \[SPARK\-19018\]\[SQL\] Add support for custom encodin..."

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-24 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20949


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r204988217
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -514,6 +516,41 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempPath { path =>
+val csvDir = new File(path, "csv")
+Seq(content).toDF().write
+  .option("encoding", encoding)
+  .csv(csvDir.getCanonicalPath)
+
+csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ 
csvFile =>
+  val readback = Files.readAllBytes(csvFile.toPath)
+  val expected = (content + 
Properties.lineSeparator).getBytes(Charset.forName(encoding))
+  assert(readback === expected)
+})
+  }
+}
+  }
+
+  test("SPARK-19018: error handling for unsupported charsets") {
+val exception = intercept[SparkException] {
+  withTempPath { path =>
+val csvDir = new File(path, "csv").getCanonicalPath
+Seq("a,A,c,A,b,B").toDF().write
+  .option("encoding", "1-9588-osi")
+  .csv(csvDir)
--- End diff --

nit: you could use directly `path.getCanonicalPath`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r204988168
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -514,6 +516,41 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempPath { path =>
+val csvDir = new File(path, "csv")
+Seq(content).toDF().write
+  .option("encoding", encoding)
+  .csv(csvDir.getCanonicalPath)
+
+csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ 
csvFile =>
--- End diff --

nit: `.foreach({` -> `.foreach {` per 
https://github.com/databricks/scala-style-guide#anonymous-methods


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r204988574
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -514,6 +516,41 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempPath { path =>
+val csvDir = new File(path, "csv")
+Seq(content).toDF().write
--- End diff --

nit: `.write.repartition(1)` to make sure we write only one file


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-18 Thread crafty-coder

Github user crafty-coder commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203306174
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv")
+
+val originalDF = Seq(content).toDF("_c0").repartition(1)
+originalDF.write
+  .option("encoding", encoding)
+  .csv(csvDir.getCanonicalPath)
+
+csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ 
csvFile =>
--- End diff --

What do you mean?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-18 Thread crafty-coder

Github user crafty-coder commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203286908
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv")
+
+val originalDF = Seq(content).toDF("_c0").repartition(1)
+originalDF.write
+  .option("encoding", encoding)
+  .csv(csvDir.getCanonicalPath)
+
+csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ 
csvFile =>
+  val readback = Files.readAllBytes(csvFile.toPath)
+  val expected = (content + 
"\n").getBytes(Charset.forName(encoding))
--- End diff --

Good Point!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203229065
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -895,6 +895,8 @@ def csv(self, path, mode=None, compression=None, 
sep=None, quote=None, escape=No
   the quote character. If None is 
set, the default value is
   escape character when escape and 
quote characters are
   different, ``\0`` otherwise..
+:param encoding: sets the encoding (charset) to be used on the csv 
file. If None is set, it
+  uses the default value, 
``UTF-8``.
--- End diff --

Likewise, let's match the doc to JSON's.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228930
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv")
+
+val originalDF = Seq(content).toDF("_c0").repartition(1)
--- End diff --

`toDF("_c0")` -> `toDF()`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228844
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
--- End diff --

`withTempDir` -> `withTempPath`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228679
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv")
+
+val originalDF = Seq(content).toDF("_c0").repartition(1)
+originalDF.write
+  .option("encoding", encoding)
+  .csv(csvDir.getCanonicalPath)
+
+csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ 
csvFile =>
+  val readback = Files.readAllBytes(csvFile.toPath)
+  val expected = (content + 
"\n").getBytes(Charset.forName(encoding))
+  assert(readback === expected)
+})
+  }
+}
+  }
+
+  test("SPARK-19018: error handling for unsupported charsets") {
+val exception = intercept[SparkException] {
+  withTempDir { dir =>
--- End diff --

`withTempPath`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228640
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv")
+
+val originalDF = Seq(content).toDF("_c0").repartition(1)
+originalDF.write
+  .option("encoding", encoding)
+  .csv(csvDir.getCanonicalPath)
+
+csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ 
csvFile =>
+  val readback = Files.readAllBytes(csvFile.toPath)
+  val expected = (content + 
"\n").getBytes(Charset.forName(encoding))
--- End diff --

Currently, the newline is dependent on Univocity. This test is going to be 
broken on Windows. Let's use platform's newline


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228403
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv")
+
+val originalDF = Seq(content).toDF("_c0").repartition(1)
+originalDF.write
+  .option("encoding", encoding)
+  .csv(csvDir.getCanonicalPath)
+
+csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ 
csvFile =>
--- End diff --

`h({ ` => `h { `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228243
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -146,7 +148,12 @@ private[csv] class CsvOutputWriter(
 context: TaskAttemptContext,
 params: CSVOptions) extends OutputWriter with Logging {
 
-  private val writer = CodecStreams.createOutputStreamWriter(context, new 
Path(path))
+  private val charset = Charset.forName(params.charset)
+
+  private val writer = CodecStreams.createOutputStreamWriter(
+context,
--- End diff --

tiny nit:

```scala
private val writer = CodecStreams.createOutputStreamWriter(
  context, new Path(path), charset)
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203227873
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -625,6 +625,7 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* enclosed in quotes. Default is to only escape values containing a 
quote character.
* `header` (default `false`): writes the names of columns as the 
first line.
* `nullValue` (default empty string): sets the string 
representation of a null value.
+   * `encoding` (default `UTF-8`): encoding to use when saving to 
file.
--- End diff --

I think we should match the doc with JSON's


https://github.com/apache/spark/blob/6ea582e36ab0a2e4e01340f6fc8cfb8d493d567d/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L525-L526


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread crafty-coder

Github user crafty-coder commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203023263
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +512,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("Save csv with custom charset") {
+Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv").getCanonicalPath
+// scalastyle:off
+val originalDF = Seq("ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ").toDF("_c0")
+// scalastyle:on
+originalDF.write
+  .option("header", "false")
--- End diff --

My bad, there is no reason. It's fixed on the next commit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-16 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r202781709
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,44 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("Save csv with custom charset") {
--- End diff --

Could you prepend `SPARK-19018` to the test title.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-06-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r197662087
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -513,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("Save csv with custom charset") {
+Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv").getCanonicalPath
+// scalastyle:off
+val originalDF = Seq("ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ").toDF("_c0")
+// scalastyle:on
+originalDF.write
+  .option("header", "false")
+  .option("encoding", encoding)
+  .csv(csvDir)
+
+val df = spark
+  .read
+  .option("header", "false")
+  .option("encoding", encoding)
--- End diff --

Now it's fine. I think we decided to support encoding in CSV/JSON 
datasources. Ignore the comment above. We can proceed separately.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-06-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r197662012
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +512,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("Save csv with custom charset") {
+Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv").getCanonicalPath
+// scalastyle:off
--- End diff --

Let's ignore the specific rule for this, e.g.:

```
// scalastyle:off nonascii
...
// scalastyle:on nonascii
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-06-24 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r197644850
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -146,7 +148,13 @@ private[csv] class CsvOutputWriter(
 context: TaskAttemptContext,
 params: CSVOptions) extends OutputWriter with Logging {
 
-  private val writer = CodecStreams.createOutputStreamWriter(context, new 
Path(path))
+  private val charset = Charset.forName(params.charset)
+
+  private val writer = CodecStreams.createOutputStreamWriter(
+context,
+new Path(path),
+charset
+  )
--- End diff --

Move the `)` up like `charset)`. See 
https://github.com/databricks/scala-style-guide


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-06-24 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r197644657
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -895,6 +895,8 @@ def csv(self, path, mode=None, compression=None, 
sep=None, quote=None, escape=No
   the quote character. If None is 
set, the default value is
   escape character when escape and 
quote characters are
   different, ``\0`` otherwise..
+:param encoding: sets encoding used for encoding the file. If None 
is set, it
--- End diff --

Could you reformulate this `encoding used for encoding` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-06-24 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r197644302
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +512,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("Save csv with custom charset") {
+Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv").getCanonicalPath
+// scalastyle:off
+val originalDF = Seq("ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ").toDF("_c0")
+// scalastyle:on
+originalDF.write
+  .option("header", "false")
--- End diff --

The header flag is disabled by default. Just in case, are there any 
specific reasons fro testing without CSV header?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-06-24 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r197643948
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +512,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("Save csv with custom charset") {
+Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding =>
--- End diff --

Could you check the `UTF-16` and `UTF-32` encoding too. The written csv 
files must contain [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark) for 
such encodings. I am not sure that Spark CSV datasource is able to read it in 
per-line mode (`multiLine` is set to `false`). Probably, you need to switch to 
multLine mode or read the files by Scala's library like in JsonSuite: 
https://github.com/apache/spark/blob/c7e2742f9bce2fcb7c717df80761939272beff54/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2322-L2338


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r178424628
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -513,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("Save csv with custom charset") {
+Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv").getCanonicalPath
+// scalastyle:off
+val originalDF = Seq("ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ").toDF("_c0")
+// scalastyle:on
+originalDF.write
+  .option("header", "false")
+  .option("encoding", encoding)
+  .csv(csvDir)
+
+val df = spark
+  .read
+  .option("header", "false")
+  .option("encoding", encoding)
--- End diff --

I think our CSV read encoding option is incomplete for now .. there are 
many discussions about this now. I am going to fix the read path soon. Let me 
revisit this after fixing it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-03-30 Thread crafty-coder

GitHub user crafty-coder opened a pull request:

https://github.com/apache/spark/pull/20949

[SPARK-19018][SQL] Add support for custom encoding on csv writer

## What changes were proposed in this pull request?

Add support for custom encoding on csv writer, see 
https://issues.apache.org/jira/browse/SPARK-19018

## How was this patch tested?

Added two unit tests in CSVSuite


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/crafty-coder/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20949.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20949


commit b9a7bf03b312da151e1d7e37338092bbf5bcb38a
Author: crafty-coder 
Date:   2018-03-30T19:35:04Z

[SPARK-19018][SQL] Add support for custom encoding on csv writer




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

24 matches

Site Navigation

Mail list logo

Footer information