[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...
Github user MaxGekk closed the pull request at: https://github.com/apache/spark/pull/20885 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20885#discussion_r176958504 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -85,6 +85,38 @@ private[sql] class JSONOptions( val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) + val charset: Option[String] = Some("UTF-8") --- End diff -- It sounds like we need to review https://github.com/apache/spark/pull/20849 first --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20885#discussion_r176958338 --- Diff: python/pyspark/sql/readwriter.py --- @@ -176,7 +176,7 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, allowComments=None, allowUnquotedFieldNames=None, allowSingleQuotes=None, allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None, mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None, - multiLine=None, allowUnquotedControlChars=None): + multiLine=None, allowUnquotedControlChars=None, lineSep=None): --- End diff -- rename it to `recordDelimiter ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20885#discussion_r176867866 --- Diff: python/pyspark/sql/readwriter.py --- @@ -770,12 +773,15 @@ def json(self, path, mode=None, compression=None, dateFormat=None, timestampForm formats follow the formats at ``java.text.SimpleDateFormat``. This applies to timestamp type. If None is set, it uses the default value, ``-MM-dd'T'HH:mm:ss.SSSXXX``. +:param lineSep: defines the line separator that should be used for writing. If None is +set, it uses the default value, ``\\n``. --- End diff -- It is a method of DataFrameWriter. It writes exactly `'\n'` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20885#discussion_r176821110 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -85,6 +85,38 @@ private[sql] class JSONOptions( val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) + val charset: Option[String] = Some("UTF-8") + + /** + * A sequence of bytes between two consecutive json records. Format of the option is: + * selector (1 char) + delimiter body (any length) | sequence of chars --- End diff -- I'm afraid of defining our own rule here, is there any standard we can follow? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20885#discussion_r176820312 --- Diff: python/pyspark/sql/readwriter.py --- @@ -770,12 +773,15 @@ def json(self, path, mode=None, compression=None, dateFormat=None, timestampForm formats follow the formats at ``java.text.SimpleDateFormat``. This applies to timestamp type. If None is set, it uses the default value, ``-MM-dd'T'HH:mm:ss.SSSXXX``. +:param lineSep: defines the line separator that should be used for writing. If None is +set, it uses the default value, ``\\n``. --- End diff -- ``` it covers all ``\\r``, ``\\r\\n`` and ``\\n``. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/20885 [SPARK-23724][SPARK-23765][SQL] Line separator for the json datasource ## What changes were proposed in this pull request? Currently, [TextInputJsonDataSource](https://github.com/databricks/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L86) uses [HadoopFileLinesReader](https://github.com/databricks/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L125) to split json file to separate lines. The former one [splits](https://github.com/databricks/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L125) json lines by [LineRecordReader](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L68) without providing recordDelimiter. As a consequence of that, the hadoop library [reads lines terminated by one of CR, LF, or CRLF](https://github.com/apache/hadoop/blob/trunk/hadoop-common-pro ject/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L185-L254). The changes allow to specify the line separator instead of using the auto detection method of hadoop library. If the separator is not specified, the line separation method of Hadoop is used by default. ## How was this patch tested? Added new tests for writing/reading json files with custom line separator You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 json-line-sep Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20885.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20885 commit a794988407b6fd28364f5d993a6a52ac0b85ec5f Author: Maxim GekkDate: 2018-02-24T20:11:00Z Adding the delimiter option encoded in base64 commit dccdaa2e97cb4e2f6f8ea7e03320cdb05a43668c Author: Maxim Gekk Date: 2018-02-24T20:59:46Z Separator encoded as a sequence of bytes in hex commit d0abab7e4b74dd42e06972f9484bc712b8f11c63 Author: Maxim Gekk Date: 2018-02-24T21:06:08Z Refactoring: removed unused imports and renaming a parameter commit 674179601b4c82e315eb1156df0f3f5035e91154 Author: Maxim Gekk Date: 2018-03-04T17:24:42Z The sep option is renamed to recordSeparator. The supported format is sequence of bytes in hex like x0d 0a commit e4faae155cb5b0761da9ac72a12f67cdde6b2e6b Author: Maxim Gekk Date: 2018-03-18T12:40:21Z Renaming recordSeparator to recordDelimiter commit 01f4ef584a2cc1ce460359f260ebbe22808d034e Author: Maxim Gekk Date: 2018-03-18T13:17:59Z Comments for the recordDelimiter option commit 24cedb9d809b026fa36b01fb2b425918b43857df Author: Maxim Gekk Date: 2018-03-18T14:36:31Z Support other formats of recordDelimiter commit d40dda22587deaf79cfad3b20ccf6854554fc11d Author: Maxim Gekk Date: 2018-03-18T16:30:26Z Checking different charsets and record delimiters commit ad6496c6d9415bcd49630272b5d6c327ffcb1378 Author: Maxim Gekk Date: 2018-03-18T16:39:07Z Renaming test's method to make it more readable commit 358863d91bf0c0d9761aa13698eb7f8532e5fc90 Author: Maxim Gekk Date: 2018-03-18T17:20:38Z Test of reading json in different charsets and delimiters commit 7e5be5e2b4cf7f77914a0d91e74ea31ab8c272d0 Author: Maxim Gekk Date: 2018-03-18T20:25:47Z Fix inferring of csv schema for any charsets commit d138d2d4e7b6e0c3e46d73939ff06a875128d59d Author: Maxim Gekk Date: 2018-03-18T21:02:44Z Fix errors of scalastyle check commit c26ef5d3d2a3970c80c973eec696805929bd7725 Author: Maxim Gekk Date: 2018-03-22T11:20:34Z Reserving format for regular expressions and concatenated json commit 5f0b0694f142bd69127c8991d83a24f528316b2b Author: Maxim Gekk Date: 2018-03-22T20:18:21Z Fix recordDelimiter tests commit ef8248f862949becdb3d370ac94a1cfc1f7c3068 Author: Maxim Gekk Date: 2018-03-22T20:34:56Z Additional cases are added to the delimiter test commit 2efac082ea4e40b89b4d01274851c0dcdd49eb44 Author: Maxim Gekk Date: 2018-03-22T21:01:56Z Renaming recordDelimiter to lineSeparator commit b2020fa99584d03e1754a4a1b5991dce4875f448 Author: Maxim Gekk Date: 2018-03-22T21:38:33Z Adding HyukjinKwon changes commit f99c1e16f2ad90c2a94e8c4b206b5b740506e136 Author: Maxim Gekk