[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...

2018-03-29 Thread MaxGekk
Github user MaxGekk closed the pull request at:

https://github.com/apache/spark/pull/20885


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...

2018-03-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20885#discussion_r176958504
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -85,6 +85,38 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
+  val charset: Option[String] = Some("UTF-8")
--- End diff --

It sounds like we need to review https://github.com/apache/spark/pull/20849 
first


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...

2018-03-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20885#discussion_r176958338
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -176,7 +176,7 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
  allowComments=None, allowUnquotedFieldNames=None, 
allowSingleQuotes=None,
  allowNumericLeadingZero=None, 
allowBackslashEscapingAnyCharacter=None,
  mode=None, columnNameOfCorruptRecord=None, dateFormat=None, 
timestampFormat=None,
- multiLine=None, allowUnquotedControlChars=None):
+ multiLine=None, allowUnquotedControlChars=None, lineSep=None):
--- End diff --

rename it to `recordDelimiter `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...

2018-03-23 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20885#discussion_r176867866
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -770,12 +773,15 @@ def json(self, path, mode=None, compression=None, 
dateFormat=None, timestampForm
 formats follow the formats at 
``java.text.SimpleDateFormat``.
 This applies to timestamp type. If None is 
set, it uses the
 default value, 
``-MM-dd'T'HH:mm:ss.SSSXXX``.
+:param lineSep: defines the line separator that should be used for 
writing. If None is
+set, it uses the default value, ``\\n``.
--- End diff --

It is a method of DataFrameWriter. It writes exactly `'\n'` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...

2018-03-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20885#discussion_r176821110
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -85,6 +85,38 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
+  val charset: Option[String] = Some("UTF-8")
+
+  /**
+   * A sequence of bytes between two consecutive json records. Format of 
the option is:
+   *   selector (1 char) + delimiter body (any length) | sequence of chars
--- End diff --

I'm afraid of defining our own rule here, is there any standard we can 
follow?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...

2018-03-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20885#discussion_r176820312
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -770,12 +773,15 @@ def json(self, path, mode=None, compression=None, 
dateFormat=None, timestampForm
 formats follow the formats at 
``java.text.SimpleDateFormat``.
 This applies to timestamp type. If None is 
set, it uses the
 default value, 
``-MM-dd'T'HH:mm:ss.SSSXXX``.
+:param lineSep: defines the line separator that should be used for 
writing. If None is
+set, it uses the default value, ``\\n``.
--- End diff --

```
it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20885: [SPARK-23724][SPARK-23765][SQL] Line separator fo...

2018-03-22 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/20885

[SPARK-23724][SPARK-23765][SQL] Line separator for the json datasource

## What changes were proposed in this pull request?

Currently, 
[TextInputJsonDataSource](https://github.com/databricks/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L86)
 uses 
[HadoopFileLinesReader](https://github.com/databricks/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L125)
 to split json file to separate lines. The former one 
[splits](https://github.com/databricks/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L125)
 json lines by 
[LineRecordReader](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L68)
 without providing recordDelimiter. As a consequence of that, the hadoop 
library [reads lines terminated by one of CR, LF, or 
CRLF](https://github.com/apache/hadoop/blob/trunk/hadoop-common-pro
 
ject/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L185-L254).
 The changes allow to specify the line separator instead of using the auto 
detection method of hadoop library.  If the separator is not specified, the 
line separation method of Hadoop is used by default.

## How was this patch tested?

Added new tests for writing/reading json files with custom line separator

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 json-line-sep

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20885.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20885


commit a794988407b6fd28364f5d993a6a52ac0b85ec5f
Author: Maxim Gekk 
Date:   2018-02-24T20:11:00Z

Adding the delimiter option encoded in base64

commit dccdaa2e97cb4e2f6f8ea7e03320cdb05a43668c
Author: Maxim Gekk 
Date:   2018-02-24T20:59:46Z

Separator encoded as a sequence of bytes in hex

commit d0abab7e4b74dd42e06972f9484bc712b8f11c63
Author: Maxim Gekk 
Date:   2018-02-24T21:06:08Z

Refactoring: removed unused imports and renaming a parameter

commit 674179601b4c82e315eb1156df0f3f5035e91154
Author: Maxim Gekk 
Date:   2018-03-04T17:24:42Z

The sep option is renamed to recordSeparator. The supported format is 
sequence of bytes in hex like x0d 0a

commit e4faae155cb5b0761da9ac72a12f67cdde6b2e6b
Author: Maxim Gekk 
Date:   2018-03-18T12:40:21Z

Renaming recordSeparator to recordDelimiter

commit 01f4ef584a2cc1ce460359f260ebbe22808d034e
Author: Maxim Gekk 
Date:   2018-03-18T13:17:59Z

Comments for the recordDelimiter option

commit 24cedb9d809b026fa36b01fb2b425918b43857df
Author: Maxim Gekk 
Date:   2018-03-18T14:36:31Z

Support other formats of recordDelimiter

commit d40dda22587deaf79cfad3b20ccf6854554fc11d
Author: Maxim Gekk 
Date:   2018-03-18T16:30:26Z

Checking different charsets and record delimiters

commit ad6496c6d9415bcd49630272b5d6c327ffcb1378
Author: Maxim Gekk 
Date:   2018-03-18T16:39:07Z

Renaming test's method to make it more readable

commit 358863d91bf0c0d9761aa13698eb7f8532e5fc90
Author: Maxim Gekk 
Date:   2018-03-18T17:20:38Z

Test of reading json in different charsets and delimiters

commit 7e5be5e2b4cf7f77914a0d91e74ea31ab8c272d0
Author: Maxim Gekk 
Date:   2018-03-18T20:25:47Z

Fix inferring of csv schema for any charsets

commit d138d2d4e7b6e0c3e46d73939ff06a875128d59d
Author: Maxim Gekk 
Date:   2018-03-18T21:02:44Z

Fix errors of scalastyle check

commit c26ef5d3d2a3970c80c973eec696805929bd7725
Author: Maxim Gekk 
Date:   2018-03-22T11:20:34Z

Reserving format for regular expressions and concatenated json

commit 5f0b0694f142bd69127c8991d83a24f528316b2b
Author: Maxim Gekk 
Date:   2018-03-22T20:18:21Z

Fix recordDelimiter tests

commit ef8248f862949becdb3d370ac94a1cfc1f7c3068
Author: Maxim Gekk 
Date:   2018-03-22T20:34:56Z

Additional cases are added to the delimiter test

commit 2efac082ea4e40b89b4d01274851c0dcdd49eb44
Author: Maxim Gekk 
Date:   2018-03-22T21:01:56Z

Renaming recordDelimiter to lineSeparator

commit b2020fa99584d03e1754a4a1b5991dce4875f448
Author: Maxim Gekk 
Date:   2018-03-22T21:38:33Z

Adding HyukjinKwon changes

commit f99c1e16f2ad90c2a94e8c4b206b5b740506e136
Author: Maxim Gekk