[GitHub] spark issue #20851: [SPARK-23727][SQL] Support for pushing down filters for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20851 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88353/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20851: [SPARK-23727][SQL] Support for pushing down filters for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20851 **[Test build #88353 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88353/testReport)** for PR 20851 at commit [`15bd28d`](https://github.com/apache/spark/commit/15bd28d93613acf0adb0f2762977bcd233cf3b9f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20798: [SPARK-23645][PYTHON] Allow python udfs to be called wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20798 **[Test build #88360 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88360/testReport)** for PR 20798 at commit [`65de58f`](https://github.com/apache/spark/commit/65de58f04c0e54ce13274a89e8aae1346dfa93be). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88354/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18906 **[Test build #88354 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88354/testReport)** for PR 18906 at commit [`64f0500`](https://github.com/apache/spark/commit/64f05000a2a323f260e0ef7a385096b7a10b2ef1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20856: [SPARK-23731][SQL] FileSourceScanExec throws NullPointer...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20856 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20856: [SPARK-23731][SQL] FileSourceScanExec throws NullPointer...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20856 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1595/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20856: [SPARK-23731][SQL] FileSourceScanExec throws NullPointer...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20856 **[Test build #88359 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88359/testReport)** for PR 20856 at commit [`3981421`](https://github.com/apache/spark/commit/39814216026da32eee5aabf3886bbedd3b90ed08). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20856: [SPARK-23731][SQL] FileSourceScanExec throws Null...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/20856 [SPARK-23731][SQL] FileSourceScanExec throws NullPointerException in subexpression elimination ## What changes were proposed in this pull request? Avoids ("fixes") a NullPointerException in subexpression elimination for subqueries with FileSourceScanExec. ## How was this patch tested? Local build. No new tests as I could not reproduce it other than using the query and data under NDA. Waiting for Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark SPARK-23731-FileSourceScanExec-throws-NPE Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20856.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20856 commit 39814216026da32eee5aabf3886bbedd3b90ed08 Author: Jacek Laskowski Date: 2018-03-18T17:12:32Z [SPARK-23731][SQL] FileSourceScanExec throws NullPointerException in subexpression elimination --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17254: [SPARK-19917][SQL]qualified partition path stored in cat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17254 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1594/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17254: [SPARK-19917][SQL]qualified partition path stored in cat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17254 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20855: [SPARK-23731][SQL] FileSourceScanExec throws Null...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/20855 [SPARK-23731][SQL] FileSourceScanExec throws NullPointerException in subexpression elimination ## What changes were proposed in this pull request? Avoids (not necessarily fixes) a NullPointerException in subexpression elimination for subqueries with FileSourceScanExec. ## How was this patch tested? Local build. No new tests as I could not reproduce it other than using the query and data under NDA. Waiting for Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark SPARK-23731-FileSourceScanExec-throws-NPE Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20855.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20855 commit 8ef323c572cee181e3bdbddeeb7119eda03d78f4 Author: Dongjoon Hyun Date: 2018-01-17T06:32:18Z [SPARK-23072][SQL][TEST] Add a Unicode schema test for file-based data sources ## What changes were proposed in this pull request? After [SPARK-20682](https://github.com/apache/spark/pull/19651), Apache Spark 2.3 is able to read ORC files with Unicode schema. Previously, it raises `org.apache.spark.sql.catalyst.parser.ParseException`. This PR adds a Unicode schema test for CSV/JSON/ORC/Parquet file-based data sources. Note that TEXT data source only has [a single column with a fixed name 'value'](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L71). ## How was this patch tested? Pass the newly added test case. Author: Dongjoon Hyun Closes #20266 from dongjoon-hyun/SPARK-23072. (cherry picked from commit a0aedb0ded4183cc33b27e369df1cbf862779e26) Signed-off-by: Wenchen Fan commit bfbc2d41b8a9278b347b6df2d516fe4679b41076 Author: Henry Robinson Date: 2018-01-17T08:01:41Z [SPARK-23062][SQL] Improve EXCEPT documentation ## What changes were proposed in this pull request? Make the default behavior of EXCEPT (i.e. EXCEPT DISTINCT) more explicit in the documentation, and call out the change in behavior from 1.x. Author: Henry Robinson Closes #20254 from henryr/spark-23062. (cherry picked from commit 1f3d933e0bd2b1e934a233ed699ad39295376e71) Signed-off-by: gatorsmile commit cbb6bda437b0d2832496b5c45f8264e5527f1cce Author: Dongjoon Hyun Date: 2018-01-17T13:53:36Z [SPARK-21783][SQL] Turn on ORC filter push-down by default ## What changes were proposed in this pull request? ORC filter push-down is disabled by default from the beginning, [SPARK-2883](https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149 ). Now, Apache Spark starts to depend on Apache ORC 1.4.1. For Apache Spark 2.3, this PR turns on ORC filter push-down by default like Parquet ([SPARK-9207](https://issues.apache.org/jira/browse/SPARK-21783)) as a part of [SPARK-20901](https://issues.apache.org/jira/browse/SPARK-20901), "Feature parity for ORC with Parquet". ## How was this patch tested? Pass the existing tests. Author: Dongjoon Hyun Closes #20265 from dongjoon-hyun/SPARK-21783. (cherry picked from commit 0f8a28617a0742d5a99debfbae91222c2e3b5cec) Signed-off-by: Wenchen Fan commit aae73a21a42fa366a09c2be1a4b91308ef211beb Author: Wang Gengliang Date: 2018-01-17T16:05:26Z [SPARK-23079][SQL] Fix query constraints propagation with aliases ## What changes were proposed in this pull request? Previously, PR #19201 fix the problem of non-converging constraints. After that PR #19149 improve the loop and constraints is inferred only once. So the problem of non-converging constraints is gone. However, the case below will fail. ``` spark.range(5).write.saveAsTable("t") val t = spark.read.table("t") val left = t.withColumn("xid", $"id" + lit(1)).as("x") val right = t.withColumnRenamed("id", "xid").as("y") val df = left.join(right, "xid").filter("id = 3").toDF() checkAnswer(df, Row(4, 3)) ``` Because `aliasMap` replace all the aliased child. See the test case in PR for details. This PR is to fix this bug by removing useless code for preventing non-converging constraints. It can be also fixed with #20270, but this is much simpler and clean up the code. ## How was this patch tested? Unit test Author: Wang Gengliang Closes #20278 from gengliangwang/FixConstraintSimple. (cherry picked from commit 8598a982b4147abe5f1aae005fea0
[GitHub] spark pull request #20855: [SPARK-23731][SQL] FileSourceScanExec throws Null...
Github user jaceklaskowski closed the pull request at: https://github.com/apache/spark/pull/20855 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20742: [SPARK-23572][docs] Bring "security.md" up to date.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20742 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20742: [SPARK-23572][docs] Bring "security.md" up to date.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20742 **[Test build #88358 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88358/testReport)** for PR 20742 at commit [`53c1710`](https://github.com/apache/spark/commit/53c1710f54888714744b3f0934ceeb732ed88f81). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20742: [SPARK-23572][docs] Bring "security.md" up to date.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20742 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1593/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20742: [SPARK-23572][docs] Bring "security.md" up to date.
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20742 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20774 **[Test build #88357 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88357/testReport)** for PR 20774 at commit [`5fbbc30`](https://github.com/apache/spark/commit/5fbbc30625b756b3671bce1e6677e7382fde5eec). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20774 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20774 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1592/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20851: [SPARK-23727][SQL] Support for pushing down filters for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20851 **[Test build #88356 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88356/testReport)** for PR 20851 at commit [`1f2b450`](https://github.com/apache/spark/commit/1f2b45013305fddd7bbf75a56ae5d1e3b6979d94). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20717: [SPARK-23564][SQL] Add isNotNull check for left anti and...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20717 any more comments @cloud-fan ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20719: [SPARK-23568][ML] Use metadata numAttributes if availabl...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20719 @holdenk @sethah @srowen @viirya may you please help reviewing this PR if you have time? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20701: [SPARK-23528][ML] Add numIter to ClusteringSummary
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20701 **[Test build #88355 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88355/testReport)** for PR 20701 at commit [`f6ee4a2`](https://github.com/apache/spark/commit/f6ee4a2b4bb2444d65ab0e26a141304b327bd998). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20701: [SPARK-23528][ML] Add numIter to ClusteringSummary
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1591/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20701: [SPARK-23528][ML] Add numIter to ClusteringSummary
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20701 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18906 **[Test build #88354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88354/testReport)** for PR 18906 at commit [`64f0500`](https://github.com/apache/spark/commit/64f05000a2a323f260e0ef7a385096b7a10b2ef1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175293032 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -148,6 +193,15 @@ private[parquet] object ParquetFilters { case BinaryType => (n: String, v: Any) => FilterApi.gtEq(binaryColumn(n), Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])) +case DateType => --- End diff -- Have added, kindly help review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175293026 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +314,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { + def d: Date = new Date(Date.valueOf("2018-03-01").getTime + 24 * 60 * 60 * 1000 * (int - 1)) +} + +withParquetDataFrame((1 to 4).map(i => Tuple1(i.d))) { implicit df => --- End diff -- Could you kindly give me some examples about what kind of boundary tests? I checked parquet integer push down and ORC date type push down, seems like have covered all their tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20854 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20854 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1590/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20854 **[Test build #88352 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88352/testReport)** for PR 20854 at commit [`d0b40a9`](https://github.com/apache/spark/commit/d0b40a9ff6368051d737224dd9931a7ef1b428cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20851: [SPARK-23727][SQL] Support for pushing down filters for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20851 **[Test build #88353 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88353/testReport)** for PR 20851 at commit [`15bd28d`](https://github.com/apache/spark/commit/15bd28d93613acf0adb0f2762977bcd233cf3b9f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP]
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20854 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20701: [SPARK-23528][ML] Add numIter to ClusteringSummar...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20701#discussion_r175292115 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala --- @@ -36,8 +36,9 @@ import org.apache.spark.sql.{Row, SparkSession} * A clustering model for K-means. Each point belongs to the cluster with the closest center. */ @Since("0.8.0") -class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector], - @Since("2.4.0") val distanceMeasure: String) +class KMeansModel private[spark] (@Since("1.0.0") val clusterCenters: Array[Vector], --- End diff -- I just didn't want the user to be able to create a KMeansModel setting the number of iterations. I moved the other constructor which is still available. I don't have strong reasons against making this public, so I am removing the private clause if you think we best let it to be public. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20701: [SPARK-23528][ML] Add numIter to ClusteringSummar...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20701#discussion_r175292059 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala --- @@ -312,4 +312,5 @@ class BisectingKMeansSummary private[clustering] ( predictions: DataFrame, predictionCol: String, featuresCol: String, -k: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k) +k: Int, +numIter: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k, numIter) --- End diff -- thanks for pointing this out, I completely missed it. Thank you, I am adding them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20854 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20701: [SPARK-23528][ML] Add numIter to ClusteringSummar...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20701#discussion_r175292026 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala --- @@ -46,6 +47,10 @@ class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vec private val clusterCentersWithNorm = if (clusterCenters == null) null else clusterCenters.map(new VectorWithNorm(_)) + @Since("2.4.0") --- End diff -- I think this is the right one. 0.8.0 is the annotation for the `KMeansModel` class, while the previous main constructor was added (by me) is a previous PR for 2.4.0 in order to add the `distanceMeasure` variable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20854 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88351/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20854 **[Test build #88351 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88351/testReport)** for PR 20854 at commit [`d0b40a9`](https://github.com/apache/spark/commit/d0b40a9ff6368051d737224dd9931a7ef1b428cb). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20774 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88348/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20774 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20774 **[Test build #88348 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88348/testReport)** for PR 20774 at commit [`a16deaa`](https://github.com/apache/spark/commit/a16deaa2ba54657a69b0cb0f09ec86c80339baa9). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class PromoteStrings(conf: SQLConf) extends TypeCoercionRule ` * ` case class InConversion(conf: SQLConf) extends TypeCoercionRule ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20830 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20830 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88350/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20830 **[Test build #88350 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88350/testReport)** for PR 20830 at commit [`b7a4a91`](https://github.com/apache/spark/commit/b7a4a914fbdaddb4c56ee24257f477ff984e170e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20854 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1589/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20854 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20854 **[Test build #88351 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88351/testReport)** for PR 20854 at commit [`d0b40a9`](https://github.com/apache/spark/commit/d0b40a9ff6368051d737224dd9931a7ef1b428cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [W...
GitHub user hvanhovell opened a pull request: https://github.com/apache/spark/pull/20854 [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP] ## What changes were proposed in this pull request? This PR adds an interpreted version of `UnsafeRowJoiner` to Spark SQL. Its performance is almost to par with the code generated `UnsafeRowJoiner`. There seems to be an overhead of 10ns per call. It might be an idea to not use code generation at all for an `UnsafeRowJoiner` ## How was this patch tested? Modified existing row joiner tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hvanhovell/spark SPARK-23712 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20854.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20854 commit b637ded5ddd38f58e2c0d1b5172ebed5cb9014e2 Author: Herman van Hovell Date: 2018-03-17T13:42:13Z Add interpreted unsafe row joiner commit d0b40a9ff6368051d737224dd9931a7ef1b428cb Author: Herman van Hovell Date: 2018-03-18T12:16:30Z Add benchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20830 **[Test build #88350 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88350/testReport)** for PR 20830 at commit [`b7a4a91`](https://github.com/apache/spark/commit/b7a4a914fbdaddb4c56ee24257f477ff984e170e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20830 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20830 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1588/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20830 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20830 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88349/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20830 **[Test build #88349 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88349/testReport)** for PR 20830 at commit [`b7a4a91`](https://github.com/apache/spark/commit/b7a4a914fbdaddb4c56ee24257f477ff984e170e). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20830 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20830 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20830 **[Test build #88349 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88349/testReport)** for PR 20830 at commit [`b7a4a91`](https://github.com/apache/spark/commit/b7a4a914fbdaddb4c56ee24257f477ff984e170e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20830 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1587/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20830 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20841 Merged to master and branch-2.3. Thank you @ueshin, @BryanCutler and @viirya for reviewing this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20841: [SPARK-23706][PYTHON] spark.conf.get(value, defau...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20841 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20774 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1586/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20774 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20774 **[Test build #88348 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88348/testReport)** for PR 20774 at commit [`a16deaa`](https://github.com/apache/spark/commit/a16deaa2ba54657a69b0cb0f09ec86c80339baa9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20774 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support DATE predict push down...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175283818 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -50,6 +50,15 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) +case DateType => + (n: String, v: Any) => { +FilterApi.eq( + intColumn(n), + Option(v).map{ date => --- End diff -- nit: `p{` -> `p {` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175283674 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -85,6 +85,12 @@ private[sql] class JSONOptions( val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) + /** + * Standard charset name. For example UTF-8, UTF-16 and UTF-32. + * If charset is not specified (None), it will be detected automatically. --- End diff -- > Could you tell me how this PR blocks solving the problem in Hadoop's LineReader? Because the exposed `charset` option is incomplete here because the encodings. Also, I want to see how we can solve that problem in SPARK-23724 first too. I am actually not quite worried of the whole changes proposed here for now. Why don't we just fix that problem first if you plan to fix both eventually anyway? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175283491 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2063,4 +2063,178 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { ) } } + + def testFile(fileName: String): String = { + Thread.currentThread().getContextClassLoader.getResource(fileName).toString + } + + test("json in UTF-16 with BOM") { +val fileName = "json-tests/utf16WithBOM.json" +val schema = new StructType().add("firstName", StringType).add("lastName", StringType) +val jsonDF = spark.read.schema(schema) + // The mode filters null rows produced because new line delimiter + // for UTF-8 is used by default. --- End diff -- Also, this is where we need a decision, right? It already does not work correctly. Another option for a min fix to follow rfc7159 is to describe that we don't support other encodings for now, to be clear. I approved https://github.com/apache/spark/pull/20614 only respecting/assuming that it causes an actual issue to some sites and the release was close (which is true I guess now). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175283468 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -85,6 +85,12 @@ private[sql] class JSONOptions( val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) + /** + * Standard charset name. For example UTF-8, UTF-16 and UTF-32. + * If charset is not specified (None), it will be detected automatically. --- End diff -- A fix in hadoop line reader and this PR solve 2 different problem. Any fix in hadoop line reader will not fix the problem of wrong encoding detection. I don't understand why this PR must depend on a fix in line reader. I would say a custom record separator will solve newline problem too (https://issues.apache.org/jira/browse/SPARK-23724). > Shouldn't we better fix text datasource with the hadoop's line reader first? Could you tell me how this PR blocks solving the problem in Hadoop's LineReader? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175283216 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2063,4 +2063,178 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { ) } } + + def testFile(fileName: String): String = { + Thread.currentThread().getContextClassLoader.getResource(fileName).toString + } + + test("json in UTF-16 with BOM") { +val fileName = "json-tests/utf16WithBOM.json" +val schema = new StructType().add("firstName", StringType).add("lastName", StringType) +val jsonDF = spark.read.schema(schema) + // The mode filters null rows produced because new line delimiter + // for UTF-8 is used by default. --- End diff -- @MaxGekk, see what happens in the test code here now. Lines are separated by a newline with UTF-8 and then the records are parsed by a different encoding. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175282994 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -85,6 +85,12 @@ private[sql] class JSONOptions( val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) + /** + * Standard charset name. For example UTF-8, UTF-16 and UTF-32. + * If charset is not specified (None), it will be detected automatically. --- End diff -- Json's schema inference use the text datasource to separate the lines before we go through jackson parser where the charset for newlines should be respected. Shouldn't we better fix text datasource with the hadoop's line reader first? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175282421 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2063,4 +2063,178 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { ) } } + + def testFile(fileName: String): String = { + Thread.currentThread().getContextClassLoader.getResource(fileName).toString + } + + test("json in UTF-16 with BOM") { +val fileName = "json-tests/utf16WithBOM.json" +val schema = new StructType().add("firstName", StringType).add("lastName", StringType) +val jsonDF = spark.read.schema(schema) + // The mode filters null rows produced because new line delimiter + // for UTF-8 is used by default. --- End diff -- We declare that we are able to read JSON. According to the rfc7159 (8.1 Character Encoding): ``` JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32). ``` Users can think that Spark can read json in charset different from UTF-8 because it SHALL do that according to the rfc, and we DON'T directly declare that jsons such encodings cannot be read successfully. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175282099 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -85,6 +85,12 @@ private[sql] class JSONOptions( val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) + /** + * Standard charset name. For example UTF-8, UTF-16 and UTF-32. + * If charset is not specified (None), it will be detected automatically. --- End diff -- ok. How this one helps to solve the problem that I am trying to solve by this PR: jackson's charset auto-detection mechanism can fail on even UTF-8 encoding and can infer wrong charset (see https://github.com/apache/spark/pull/20302) due to many reasons. And an user doesn't have any possibilities to fix the issue and bypass the auto-detection. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175281945 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -57,12 +57,39 @@ public Object getBaseObject() { return base; } public long getBaseOffset() { return offset; } - private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, -4, 4, 4, 4, 4, 4, 4, 4, -5, 5, 5, 5, -6, 6}; + /** + * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte which + * indicates the size of the char. See Unicode standard in page 126: + * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf + * + * BinaryHex Comments + * 0xxx 0x00..0x7F Only byte of a 1-byte character encoding + * 10xx 0x80..0xBF Continuation bytes (1-3 continuation bytes) + * 110x 0xC0..0xDF First byte of a 2-byte character encoding --- End diff -- hmm, is this `0xC2..0xDF`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20841 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175281639 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -85,6 +85,12 @@ private[sql] class JSONOptions( val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) + /** + * Standard charset name. For example UTF-8, UTF-16 and UTF-32. + * If charset is not specified (None), it will be detected automatically. --- End diff -- See this https://github.com/apache/spark/commit/8fb2a02e2ce6832e3d9338a7d0148dfac9fa24c2. It uses Text datasource when it loads lines when we infer schema. If we fix encodings with newline first, it's required to Text datasource first I believe. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175281659 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2063,4 +2063,178 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { ) } } + + def testFile(fileName: String): String = { + Thread.currentThread().getContextClassLoader.getResource(fileName).toString + } + + test("json in UTF-16 with BOM") { +val fileName = "json-tests/utf16WithBOM.json" +val schema = new StructType().add("firstName", StringType).add("lastName", StringType) +val jsonDF = spark.read.schema(schema) + // The mode filters null rows produced because new line delimiter + // for UTF-8 is used by default. --- End diff -- Could you point out where we support other encodings? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175281528 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2063,4 +2063,178 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { ) } } + + def testFile(fileName: String): String = { + Thread.currentThread().getContextClassLoader.getResource(fileName).toString + } + + test("json in UTF-16 with BOM") { +val fileName = "json-tests/utf16WithBOM.json" +val schema = new StructType().add("firstName", StringType).add("lastName", StringType) +val jsonDF = spark.read.schema(schema) + // The mode filters null rows produced because new line delimiter + // for UTF-8 is used by default. --- End diff -- Could you point me out the place in docs where we strictly restricted input charset to UTF-8 only. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175281468 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -85,6 +85,12 @@ private[sql] class JSONOptions( val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) + /** + * Standard charset name. For example UTF-8, UTF-16 and UTF-32. + * If charset is not specified (None), it will be detected automatically. --- End diff -- > schema inference in JSON is dependent on Text datasource Could you clarify this, please. It is not completely clear to me what do you mean. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175281525 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2063,4 +2063,178 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { ) } } + + def testFile(fileName: String): String = { + Thread.currentThread().getContextClassLoader.getResource(fileName).toString + } + + test("json in UTF-16 with BOM") { +val fileName = "json-tests/utf16WithBOM.json" +val schema = new StructType().add("firstName", StringType).add("lastName", StringType) +val jsonDF = spark.read.schema(schema) --- End diff -- I think you should have explained this in PR description .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175281445 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2063,4 +2063,178 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { ) } } + + def testFile(fileName: String): String = { + Thread.currentThread().getContextClassLoader.getResource(fileName).toString + } + + test("json in UTF-16 with BOM") { +val fileName = "json-tests/utf16WithBOM.json" +val schema = new StructType().add("firstName", StringType).add("lastName", StringType) +val jsonDF = spark.read.schema(schema) + // The mode filters null rows produced because new line delimiter + // for UTF-8 is used by default. --- End diff -- This test case checks if a JSON file encoded with UTF-16 can be parsed line by line with UTF-8 newlines, and then each line is parsed by Jackson. Your customer's usecase accidentally works and I think we never documented this behaviour so far. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175281373 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2063,4 +2063,178 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { ) } } + + def testFile(fileName: String): String = { + Thread.currentThread().getContextClassLoader.getResource(fileName).toString + } + + test("json in UTF-16 with BOM") { +val fileName = "json-tests/utf16WithBOM.json" +val schema = new StructType().add("firstName", StringType).add("lastName", StringType) +val jsonDF = spark.read.schema(schema) --- End diff -- No because of many empty strings produced by Hadoop LineRecordReader. It will be fixed in separate PRs for the issues: SPARK-23725 and/or SPARK-23724 . For now you have to specify schema or use multiline mode as a temporary workaround. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175281365 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -85,6 +85,12 @@ private[sql] class JSONOptions( val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) + /** + * Standard charset name. For example UTF-8, UTF-16 and UTF-32. + * If charset is not specified (None), it will be detected automatically. --- End diff -- Shall we fix that first with text datasource since schema inference in JSON is dependent on Text datasource? You are exposing incomplete option now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175281238 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2063,4 +2063,178 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { ) } } + + def testFile(fileName: String): String = { + Thread.currentThread().getContextClassLoader.getResource(fileName).toString + } + + test("json in UTF-16 with BOM") { +val fileName = "json-tests/utf16WithBOM.json" +val schema = new StructType().add("firstName", StringType).add("lastName", StringType) +val jsonDF = spark.read.schema(schema) + // The mode filters null rows produced because new line delimiter + // for UTF-8 is used by default. --- End diff -- The test came from customer's use case, when we broke backward compatibility with previous versions by forcibly set input stream as UTF-8: https://github.com/apache/spark/pull/20302 . You can see the test case in the PR when jackson-json parser is not able to detect charset correctly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20849#discussion_r175280808 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -85,6 +85,12 @@ private[sql] class JSONOptions( val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) + /** + * Standard charset name. For example UTF-8, UTF-16 and UTF-32. + * If charset is not specified (None), it will be detected automatically. --- End diff -- Do you mean the encoding of records/lines delimiter? It depends on the mode. In multiline mode, jackson is able to do that. In the case of per-line mode, Hadoop LinerRecordReader could accept delimiters in any charsets but by defaults it splits input by `'\r'`, `'\n'`, and `'\r\n'` in UTF-8. This will be fixed in separate PRs for the issues: https://issues.apache.org/jira/browse/SPARK-23724 and https://issues.apache.org/jira/browse/SPARK-23725 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20850: [SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20850 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20850: [SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20850 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88347/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20850: [SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20850 **[Test build #88347 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88347/testReport)** for PR 20850 at commit [`06e7435`](https://github.com/apache/spark/commit/06e7435c7a9f5278f468b75605c9aedc26d0f304). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20851: [SPARK-23727][SQL] Support DATE predict push down in par...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20851 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88344/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20851: [SPARK-23727][SQL] Support DATE predict push down in par...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20851 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20774 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20774 **[Test build #88346 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88346/testReport)** for PR 20774 at commit [`a16deaa`](https://github.com/apache/spark/commit/a16deaa2ba54657a69b0cb0f09ec86c80339baa9). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class PromoteStrings(conf: SQLConf) extends TypeCoercionRule ` * ` case class InConversion(conf: SQLConf) extends TypeCoercionRule ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20774: [SPARK-23549][SQL] Cast to timestamp when comparing time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20774 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88346/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20851: [SPARK-23727][SQL] Support DATE predict push down in par...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20851 **[Test build #88344 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88344/testReport)** for PR 20851 at commit [`079af71`](https://github.com/apache/spark/commit/079af71359bd49dc59c863f1a9a4f6fa28d5a8a0). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20850: [SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20850 **[Test build #88347 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88347/testReport)** for PR 20850 at commit [`06e7435`](https://github.com/apache/spark/commit/06e7435c7a9f5278f468b75605c9aedc26d0f304). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20850: [SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20850 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1585/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20850: [SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20850 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org