[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20963#discussion_r180610436 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { assert(df.schema === expectedSchema) } } + + test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") { +val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, + 57, 62, 68, 72) +withTempPath { path => + val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath), +StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW) + for (i <- 0 until 100) { +if (predefinedSample.contains(i)) { + writer.write(s"""{"f1":${i.toString}}""" + "\n") +} else { + writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n") +} + } + writer.close() + + val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath) --- End diff -- ^ It's based upon actual experience before. There was a similar case that the test was broken due to the number of partitions and it took me a while to debug it, https://issues.apache.org/jira/browse/SPARK-13728 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20963#discussion_r180605579 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { assert(df.schema === expectedSchema) } } + + test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") { +val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, + 57, 62, 68, 72) +withTempPath { path => + val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath), +StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW) + for (i <- 0 until 100) { +if (predefinedSample.contains(i)) { + writer.write(s"""{"f1":${i.toString}}""" + "\n") +} else { + writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n") +} + } + writer.close() + + val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath) --- End diff -- yup, please set the appropriate numbers. I think it's fine if it has some comments so that we read and fix the tests if it's broken. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20963#discussion_r180535344 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { assert(df.schema === expectedSchema) } } + + test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") { +val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, + 57, 62, 68, 72) +withTempPath { path => + val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath), +StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW) + for (i <- 0 until 100) { +if (predefinedSample.contains(i)) { + writer.write(s"""{"f1":${i.toString}}""" + "\n") +} else { + writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n") +} + } + writer.close() + + val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath) --- End diff -- It seems specifying only `spark.sql.files.maxPartitionBytes` is not enough. Please, look at the [formula](https://github.com/apache/spark/blob/400a1d9e25c1196f0be87323bd89fb3af0660166/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L406) and [slicing input files](https://github.com/apache/spark/blob/400a1d9e25c1196f0be87323bd89fb3af0660166/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L415): ``` val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) ``` Is ok if I just check that file size is less than `maxSplitBytes`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20963#discussion_r179937139 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { assert(df.schema === expectedSchema) } } + + test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") { +val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, + 57, 62, 68, 72) +withTempPath { path => + val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath), +StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW) + for (i <- 0 until 100) { +if (predefinedSample.contains(i)) { + writer.write(s"""{"f1":${i.toString}}""" + "\n") +} else { + writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n") +} + } + writer.close() + + val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath) --- End diff -- OK but I don't think we are guaranteed to have always one partition here though. Shall we at least explicitly set `spark.sql.files.maxPartitionBytes` big enough with some comments? I think we shouldn't encourage this way because it should likely be easy to be broken IMHO. I am fine with it anyway as I can't think of a better way on the other hand. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20963 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20963#discussion_r179934861 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { assert(df.schema === expectedSchema) } } + + test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") { +val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, + 57, 62, 68, 72) --- End diff -- Not need to have so many elements in this set. Please combine the tests in your CSV PR. Instead of calling `json()`, we can do it using `format("json")`. Then, you can combine the test cases for both CSV and Json. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20963#discussion_r179934815 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { assert(df.schema === expectedSchema) } } + + test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") { +val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, + 57, 62, 68, 72) +withTempPath { path => + val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath), +StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW) + for (i <- 0 until 100) { +if (predefinedSample.contains(i)) { + writer.write(s"""{"f1":${i.toString}}""" + "\n") +} else { + writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n") +} + } + writer.close() + + val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath) --- End diff -- Yes. The seed is also given. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20963#discussion_r179257489 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { assert(df.schema === expectedSchema) } } + + test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") { +val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, + 57, 62, 68, 72) +withTempPath { path => + val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath), +StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW) + for (i <- 0 until 100) { +if (predefinedSample.contains(i)) { + writer.write(s"""{"f1":${i.toString}}""" + "\n") +} else { + writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n") +} + } + writer.close() + + val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath) --- End diff -- It could be if the partitionIndex is flaky here: https://github.com/apache/spark/blob/2ce37b50fc01558f49ad22f89c8659f50544ffec/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L320 but in both tests there is only one partition with stable index `0` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20963#discussion_r178691639 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { assert(df.schema === expectedSchema) } } + + test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") { +val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, + 57, 62, 68, 72) +withTempPath { path => + val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath), +StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW) + for (i <- 0 until 100) { +if (predefinedSample.contains(i)) { + writer.write(s"""{"f1":${i.toString}}""" + "\n") +} else { + writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n") +} + } + writer.close() + + val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath) --- End diff -- @MaxGekk, wouldn't this test be flaky? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/20963 [SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource ## What changes were proposed in this pull request? Proposed tests checks that only subset of input dataset is touched during schema inferring. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 json-sampling-tests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20963.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20963 commit 76e38a8670303afe20b530fc2a837a1363a01974 Author: Maxim GekkDate: 2018-04-02T13:28:32Z Adding samplingRation tests for json commit 1acc3ece4121ffc3209fb65a20526dec574820e5 Author: Maxim Gekk Date: 2018-04-02T13:34:06Z Removing debug code commit 0d5fcfb59e112cc989753404813d32432257a0f6 Author: Maxim Gekk Date: 2018-04-02T18:45:18Z Adding the ticket ref to test titles commit a664465d8a3042f06cf7a866283c34a674b24939 Author: Maxim Gekk Date: 2018-04-02T18:48:56Z Making Scala style checker happy --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org