[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

2018-04-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20963#discussion_r180610436
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   assert(df.schema === expectedSchema)
 }
   }
+
+  test("SPARK-23849: schema inferring touches less data if samplingRation 
< 1.0") {
+val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+  57, 62, 68, 72)
+withTempPath { path =>
+  val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+  for (i <- 0 until 100) {
+if (predefinedSample.contains(i)) {
+  writer.write(s"""{"f1":${i.toString}}""" + "\n")
+} else {
+  writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+}
+  }
+  writer.close()
+
+  val ds = spark.read.option("samplingRatio", 
0.1).json(path.getCanonicalPath)
--- End diff --

^ It's based upon actual experience before. There was a similar case that 
the test was broken due to the number of partitions and it took me a while to 
debug it, https://issues.apache.org/jira/browse/SPARK-13728


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

2018-04-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20963#discussion_r180605579
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   assert(df.schema === expectedSchema)
 }
   }
+
+  test("SPARK-23849: schema inferring touches less data if samplingRation 
< 1.0") {
+val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+  57, 62, 68, 72)
+withTempPath { path =>
+  val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+  for (i <- 0 until 100) {
+if (predefinedSample.contains(i)) {
+  writer.write(s"""{"f1":${i.toString}}""" + "\n")
+} else {
+  writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+}
+  }
+  writer.close()
+
+  val ds = spark.read.option("samplingRatio", 
0.1).json(path.getCanonicalPath)
--- End diff --

yup, please set the appropriate numbers. I think it's fine if it has some 
comments so that we read and fix the tests if it's broken.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

2018-04-10 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20963#discussion_r180535344
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   assert(df.schema === expectedSchema)
 }
   }
+
+  test("SPARK-23849: schema inferring touches less data if samplingRation 
< 1.0") {
+val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+  57, 62, 68, 72)
+withTempPath { path =>
+  val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+  for (i <- 0 until 100) {
+if (predefinedSample.contains(i)) {
+  writer.write(s"""{"f1":${i.toString}}""" + "\n")
+} else {
+  writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+}
+  }
+  writer.close()
+
+  val ds = spark.read.option("samplingRatio", 
0.1).json(path.getCanonicalPath)
--- End diff --

It seems specifying only `spark.sql.files.maxPartitionBytes` is not enough. 
Please, look at the 
[formula](https://github.com/apache/spark/blob/400a1d9e25c1196f0be87323bd89fb3af0660166/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L406)
 and [slicing input 
files](https://github.com/apache/spark/blob/400a1d9e25c1196f0be87323bd89fb3af0660166/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L415):

```
val maxSplitBytes = Math.min(defaultMaxSplitBytes, 
Math.max(openCostInBytes, bytesPerCore))
```

Is ok if I just check that file size is less than `maxSplitBytes`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20963#discussion_r179937139
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   assert(df.schema === expectedSchema)
 }
   }
+
+  test("SPARK-23849: schema inferring touches less data if samplingRation 
< 1.0") {
+val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+  57, 62, 68, 72)
+withTempPath { path =>
+  val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+  for (i <- 0 until 100) {
+if (predefinedSample.contains(i)) {
+  writer.write(s"""{"f1":${i.toString}}""" + "\n")
+} else {
+  writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+}
+  }
+  writer.close()
+
+  val ds = spark.read.option("samplingRatio", 
0.1).json(path.getCanonicalPath)
--- End diff --

OK but I don't think we are guaranteed to have always one partition here 
though. Shall we at least explicitly set `spark.sql.files.maxPartitionBytes` 
big enough with some comments?

I think we shouldn't encourage this way because it should likely be easy to 
be broken IMHO. I am fine with it anyway as I can't think of a better way on 
the other hand.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

2018-04-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20963


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

2018-04-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20963#discussion_r179934861
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   assert(df.schema === expectedSchema)
 }
   }
+
+  test("SPARK-23849: schema inferring touches less data if samplingRation 
< 1.0") {
+val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+  57, 62, 68, 72)
--- End diff --

Not need to have so many elements in this set. Please combine the tests in 
your CSV PR.

Instead of calling `json()`, we can do it using `format("json")`. Then, you 
can combine the test cases for both CSV and Json. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

2018-04-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20963#discussion_r179934815
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   assert(df.schema === expectedSchema)
 }
   }
+
+  test("SPARK-23849: schema inferring touches less data if samplingRation 
< 1.0") {
+val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+  57, 62, 68, 72)
+withTempPath { path =>
+  val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+  for (i <- 0 until 100) {
+if (predefinedSample.contains(i)) {
+  writer.write(s"""{"f1":${i.toString}}""" + "\n")
+} else {
+  writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+}
+  }
+  writer.close()
+
+  val ds = spark.read.option("samplingRatio", 
0.1).json(path.getCanonicalPath)
--- End diff --

Yes. The seed is also given. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

2018-04-04 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20963#discussion_r179257489
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   assert(df.schema === expectedSchema)
 }
   }
+
+  test("SPARK-23849: schema inferring touches less data if samplingRation 
< 1.0") {
+val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+  57, 62, 68, 72)
+withTempPath { path =>
+  val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+  for (i <- 0 until 100) {
+if (predefinedSample.contains(i)) {
+  writer.write(s"""{"f1":${i.toString}}""" + "\n")
+} else {
+  writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+}
+  }
+  writer.close()
+
+  val ds = spark.read.option("samplingRatio", 
0.1).json(path.getCanonicalPath)
--- End diff --

It could be if the partitionIndex is flaky here: 
https://github.com/apache/spark/blob/2ce37b50fc01558f49ad22f89c8659f50544ffec/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L320
 but in both tests there is only one partition with stable index `0`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

2018-04-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20963#discussion_r178691639
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   assert(df.schema === expectedSchema)
 }
   }
+
+  test("SPARK-23849: schema inferring touches less data if samplingRation 
< 1.0") {
+val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+  57, 62, 68, 72)
+withTempPath { path =>
+  val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+  for (i <- 0 until 100) {
+if (predefinedSample.contains(i)) {
+  writer.write(s"""{"f1":${i.toString}}""" + "\n")
+} else {
+  writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+}
+  }
+  writer.close()
+
+  val ds = spark.read.option("samplingRatio", 
0.1).json(path.getCanonicalPath)
--- End diff --

@MaxGekk, wouldn't this test be flaky?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

2018-04-02 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/20963

[SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource

## What changes were proposed in this pull request?

Proposed tests checks that only subset of input dataset is touched during 
schema inferring.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 json-sampling-tests

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20963.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20963


commit 76e38a8670303afe20b530fc2a837a1363a01974
Author: Maxim Gekk 
Date:   2018-04-02T13:28:32Z

Adding samplingRation tests for json

commit 1acc3ece4121ffc3209fb65a20526dec574820e5
Author: Maxim Gekk 
Date:   2018-04-02T13:34:06Z

Removing debug code

commit 0d5fcfb59e112cc989753404813d32432257a0f6
Author: Maxim Gekk 
Date:   2018-04-02T18:45:18Z

Adding the ticket ref to test titles

commit a664465d8a3042f06cf7a866283c34a674b24939
Author: Maxim Gekk 
Date:   2018-04-02T18:48:56Z

Making Scala style checker happy




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org