[GitHub] spark pull request #20302: [SPARK-23094] Fix invalid character handling in J...

2018-01-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20302#discussion_r162682668
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
 ---
@@ -105,4 +107,36 @@ class JsonHadoopFsRelationSuite extends 
HadoopFsRelationTest {
   )
 }
   }
+
+  test("invalid json with leading nulls - from file (multiLine=true)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val expected = s"""$badJson\n{"a":1}\n"""
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
true).schema(schema).load(path)
+  checkAnswer(df, Row(null, expected))
+}
+  }
+
+  test("invalid json with leading nulls - from file (multiLine=false)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
false).schema(schema).load(path)
+  checkAnswer(df, Seq(Row(1, null), Row(null, badJson)))
+}
+  }
+
+  test("invalid json with leading nulls - from dataset") {
--- End diff --

See the PR https://github.com/apache/spark/pull/20331


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20302: [SPARK-23094] Fix invalid character handling in J...

2018-01-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20302#discussion_r162682091
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
 ---
@@ -105,4 +107,36 @@ class JsonHadoopFsRelationSuite extends 
HadoopFsRelationTest {
   )
 }
   }
+
+  test("invalid json with leading nulls - from file (multiLine=true)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val expected = s"""$badJson\n{"a":1}\n"""
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
true).schema(schema).load(path)
+  checkAnswer(df, Row(null, expected))
+}
+  }
+
+  test("invalid json with leading nulls - from file (multiLine=false)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
false).schema(schema).load(path)
+  checkAnswer(df, Seq(Row(1, null), Row(null, badJson)))
+}
+  }
+
+  test("invalid json with leading nulls - from dataset") {
--- End diff --

This test suite is still in ` org.apache.spark.sql.sources`. We should move 
these test suite to `/sql/core`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20302: [SPARK-23094] Fix invalid character handling in J...

2018-01-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20302#discussion_r162495107
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
 ---
@@ -105,4 +107,36 @@ class JsonHadoopFsRelationSuite extends 
HadoopFsRelationTest {
   )
 }
   }
+
+  test("invalid json with leading nulls - from file (multiLine=true)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val expected = s"""$badJson\n{"a":1}\n"""
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
true).schema(schema).load(path)
+  checkAnswer(df, Row(null, expected))
+}
+  }
+
+  test("invalid json with leading nulls - from file (multiLine=false)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
false).schema(schema).load(path)
+  checkAnswer(df, Seq(Row(1, null), Row(null, badJson)))
+}
+  }
+
+  test("invalid json with leading nulls - from dataset") {
--- End diff --

I think usually we encourage to put the test cases in a better suite, 
in-place.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20302: [SPARK-23094] Fix invalid character handling in J...

2018-01-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20302#discussion_r162491837
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
 ---
@@ -105,4 +107,36 @@ class JsonHadoopFsRelationSuite extends 
HadoopFsRelationTest {
   )
 }
   }
+
+  test("invalid json with leading nulls - from file (multiLine=true)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val expected = s"""$badJson\n{"a":1}\n"""
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
true).schema(schema).load(path)
+  checkAnswer(df, Row(null, expected))
+}
+  }
+
+  test("invalid json with leading nulls - from file (multiLine=false)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
false).schema(schema).load(path)
+  checkAnswer(df, Seq(Row(1, null), Row(null, badJson)))
+}
+  }
+
+  test("invalid json with leading nulls - from dataset") {
--- End diff --

The whole suite should be moved. Not related to this PR. I will take this. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20302: [SPARK-23094] Fix invalid character handling in J...

2018-01-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20302


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20302: [SPARK-23094] Fix invalid character handling in J...

2018-01-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20302#discussion_r162248399
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
 ---
@@ -105,4 +107,36 @@ class JsonHadoopFsRelationSuite extends 
HadoopFsRelationTest {
   )
 }
   }
+
+  test("invalid json with leading nulls - from file (multiLine=true)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val expected = s"""$badJson\n{"a":1}\n"""
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
true).schema(schema).load(path)
+  checkAnswer(df, Row(null, expected))
+}
+  }
+
+  test("invalid json with leading nulls - from file (multiLine=false)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
false).schema(schema).load(path)
+  checkAnswer(df, Seq(Row(1, null), Row(null, badJson)))
+}
+  }
+
+  test("invalid json with leading nulls - from dataset") {
--- End diff --

Nit: this case looks already passing tho, and I thought we should put this 
in `JsonSuite`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20302: [SPARK-23094] Fix invalid character handling in J...

2018-01-17 Thread brkyvz
GitHub user brkyvz opened a pull request:

https://github.com/apache/spark/pull/20302

[SPARK-23094] Fix invalid character handling in JsonDataSource

## What changes were proposed in this pull request?

There were two related fixes regarding `from_json`, `get_json_object` and 
`json_tuple` ([Fix 
#1](https://github.com/apache/spark/commit/c8803c06854683c8761fdb3c0e4c55d5a9e22a95),
 [Fix 
#2](https://github.com/apache/spark/commit/86174ea89b39a300caaba6baffac70f3dc702788)),
 but they weren't comprehensive it seems. I wanted to extend those fixes to all 
the parsers, and add tests for each case.

## How was this patch tested?

Regression tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/brkyvz/spark json-invfix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20302.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20302


commit 231152ab0d615166bdea354916a9f6ca0aaf7e6a
Author: Burak Yavuz 
Date:   2018-01-18T01:01:00Z

[ES-4104][WARMFIX] Fix invalid character handling in JsonDataSource

Cherry-pick of https://github.com/databricks/spark/pull/1135

## What changes were proposed in this pull request?

I shall also merge this upstream, but wanted to merge here first since it 
was an ES ticket related to Qubole workloads.

There were two related fixes regarding `from_json`, `get_json_object` and 
`json_tuple` in OSS ([Fix 
#1](https://github.com/apache/spark/commit/c8803c06854683c8761fdb3c0e4c55d5a9e22a95),
 [Fix 
#2](https://github.com/apache/spark/commit/86174ea89b39a300caaba6baffac70f3dc702788)),
 but they weren't comprehensive it seems. I wanted to extend those fixes to all 
the parsers, and add tests for each case.

## How was this patch tested?

Regression tests

Author: Burak Yavuz 

Closes #1135 from brkyvz/json-32-dbx.

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.

## **IMPORTANT** Warmfix instructions

If this PR needs to be warmfixed (i.e. merged into the release branch after 
the code freeze), please follow steps below.

What type of warmfix is this? Please select **exactly one choice**, or 
write description in Other.

- [ ] Regression (e.g. fixing the behavior of a feature that regressed in 
the current release cycle)
- [ ] ES ticket fix (e.g. Customer or internally requested update/fix)
- [ ] Other (please describe):

Make the following updates to this PR:

- [ ] Add `[WARMFIX]` in the title of this PR.
- [ ] Label the PR using label(s) corrsponding to the WARMFIX branch(es). 
The label name should be in the format of `dbr-branch-a.b` (e.g. 
`dbr-branch-3.2`), which matches the release branch name for Runtime release 
`a.b`.
- [ ] Ask your team lead to sign off this warmfix and add the 
`warmfix-approved` label.
- [ ] When merging the PR using the merge script, make sure to get this PR 
merged into the following branches:

  - The branch against which your PR is opened, and
  - Any extra release branch(es) corresponding to the `dbr-branch-a.b` 
label(s) applied to your PR.

Author: Burak Yavuz 

Closes #1616 from brkyvz/inv-json4x.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org