[GitHub] spark issue #22693: [SPARK-25701][SQL] Supports calculation of table statist...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22693 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22693: [SPARK-25701][SQL] Supports calculation of table ...
GitHub user fjh100456 opened a pull request: https://github.com/apache/spark/pull/22693 [SPARK-25701][SQL] Supports calculation of table statistics from partition's catalog statistics. ## What changes were proposed in this pull request? When determine table statistics, if the `totalSize` of the table is not defined, we fallback to HDFS to get the table statistics when `spark.sql.statistics.fallBackToHdfs` is `true`, otherwise the default value(`spark.sql.defaultSizeInBytes`) will be taken, which will lead to tables without `totalSize` property may not be broadcast(Except parquet). Fortunately, in most case the data is written into the table by a insertion command which will save the data-size in metastore, so it's possible to use metastore to calculate the table statistics. ## How was this patch tested? Add test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/fjh100456/spark StatisticCommit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22693.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22693 commit e610477063b4f326b8261d59b55abce83cbb82e7 Author: fjh100456 Date: 2018-10-11T06:43:52Z [SPARK-25701][SQL] Supports calculation of table statistics from partition's catalog statistics. ## What changes were proposed in this pull request? When obtaining table statistics, if the `totalSize` of the table is not defined, we fallback to HDFS to get the table statistics when `spark.sql.statistics.fallBackToHdfs` is `true`, otherwise the default value(`spark.sql.defaultSizeInBytes`) will be taken. Fortunately, in most case the data is written into the table by a insertion command which will save the data-size in metastore, so it's possible to use metastore to calculate the table statistics. ## How was this patch tested? Add test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22678: [SPARK-25685][BUILD] Allow running tests in Jenki...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22678#discussion_r22445 --- Diff: docs/building-spark.md --- @@ -272,3 +272,31 @@ For SBT, specify a complete scala version using (e.g. 2.12.6): ./build/sbt -Dscala.version=2.12.6 Otherwise, the sbt-pom-reader plugin will use the `scala.version` specified in the spark-parent pom. + +## Running Jenkins tests with enterprise Github + +To run tests with Jenkins: + +./dev/run-tests-jenkins + +If use an individual repository or an enterprise GitHub, export below environment variables before running above command. + +### Related environment variables + + +Variable NameDefaultMeaning + + GITHUB_API_BASE + https://api.github.com/repos/apache/spark + +The GitHub server API URL. It could be pointed to an enterprise GitHub. + + + + SPARK_PROJECT_URL + https://github.com/apache/spark + +The Spark project URL of (enterprise) GitHub. --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22678: [SPARK-25685][BUILD] Allow running tests in Jenki...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22678#discussion_r224333028 --- Diff: docs/building-spark.md --- @@ -272,3 +272,31 @@ For SBT, specify a complete scala version using (e.g. 2.12.6): ./build/sbt -Dscala.version=2.12.6 Otherwise, the sbt-pom-reader plugin will use the `scala.version` specified in the spark-parent pom. + +## Running Jenkins tests with enterprise Github + +To run tests with Jenkins: + +./dev/run-tests-jenkins + +If use an individual repository or an enterprise GitHub, export below environment variables before running above command. + +### Related environment variables + + +Variable NameDefaultMeaning + + GITHUB_API_BASE + https://api.github.com/repos/apache/spark + +The GitHub server API URL. It could be pointed to an enterprise GitHub. --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22678: [SPARK-25685][BUILD] Allow running tests in Jenki...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22678#discussion_r224332828 --- Diff: docs/building-spark.md --- @@ -272,3 +272,31 @@ For SBT, specify a complete scala version using (e.g. 2.12.6): ./build/sbt -Dscala.version=2.12.6 Otherwise, the sbt-pom-reader plugin will use the `scala.version` specified in the spark-parent pom. + +## Running Jenkins tests with enterprise Github --- End diff -- nit: `enterprise Github` -> `GitHub Enterprise` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22678: [SPARK-25685][BUILD] Allow running tests in Jenki...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22678#discussion_r224332984 --- Diff: docs/building-spark.md --- @@ -272,3 +272,31 @@ For SBT, specify a complete scala version using (e.g. 2.12.6): ./build/sbt -Dscala.version=2.12.6 Otherwise, the sbt-pom-reader plugin will use the `scala.version` specified in the spark-parent pom. + +## Running Jenkins tests with enterprise Github + +To run tests with Jenkins: + +./dev/run-tests-jenkins + +If use an individual repository or an enterprise GitHub, export below environment variables before running above command. --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user peter-toth commented on the issue: https://github.com/apache/spark/pull/22318 @srowen, I saw your last comment on https://github.com/peter-toth/spark/tree/SPARK-25150. I submitted this PR to solve that ticket and I believe the description here explains what is the real issue there. I would appreciate your thoughts on this PR, unfortunately it got stuck a bit lately. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22674 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97231/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22674 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22674 **[Test build #97231 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97231/testReport)** for PR 22674 at commit [`3ffa536`](https://github.com/apache/spark/commit/3ffa536f3c29f6655843a4d45c215393f51e23c9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22664: [SPARK-25662][TEST] Refactor DataSourceReadBenchmark to ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22664 Could you add `[SQL]` before `[TEST]`, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22688 **[Test build #97238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97238/testReport)** for PR 22688 at commit [`2a42253`](https://github.com/apache/spark/commit/2a422535451c186546a2ce3da66d422805f7db32). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22688 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22688 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3872/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22664: [SPARK-25662][TEST] Refactor DataSourceReadBenchmark to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22664 **[Test build #97237 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97237/testReport)** for PR 22664 at commit [`7cef8db`](https://github.com/apache/spark/commit/7cef8db25e5839277f9fec3f9585f7669caca405). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22678: [SPARK-25685][BUILD] Allow running tests in Jenki...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22678#discussion_r224327625 --- Diff: docs/building-spark.md --- @@ -272,3 +272,31 @@ For SBT, specify a complete scala version using (e.g. 2.12.6): ./build/sbt -Dscala.version=2.12.6 Otherwise, the sbt-pom-reader plugin will use the `scala.version` specified in the spark-parent pom. + +## Running Jenkins tests with enterprise Github + +To run tests with Jenkins: + +./dev/run-tests-jenkins + +If you use an individual repository or an enterprise GitHub, you should export below environment variables before running above command. + +### Related environment variables + + +variable NameDefaultMeaning --- End diff -- `variable` -> `Variable` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22664: [SPARK-25662][TEST] Refactor DataSourceReadBenchmark to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22664 **[Test build #97236 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97236/testReport)** for PR 22664 at commit [`5bccfc6`](https://github.com/apache/spark/commit/5bccfc6fcd3cfe338c619c4f549ef7b6b038c5b3). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22664: [SPARK-25662][TEST] Refactor DataSourceReadBenchmark to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22664 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97236/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22664: [SPARK-25662][TEST] Refactor DataSourceReadBenchmark to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22664 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22688: [SPARK-25700][SQL] Creates ReadSupport in only Ap...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22688#discussion_r224326923 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala --- @@ -351,6 +351,21 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext { } } } + + test("SPARK-25700: do not read schema when writing in other modes except append mode") { +withTempPath { file => + val cls = classOf[SimpleWriteOnlyDataSource] + val path = file.getCanonicalPath + val df = spark.range(5).select('id as 'i, -'id as 'j) + try { +df.write.format(cls.getName).option("path", path).mode("error").save() +df.write.format(cls.getName).option("path", path).mode("overwrite").save() +df.write.format(cls.getName).option("path", path).mode("ignore").save() + } catch { +case e: SchemaReadAttemptException => fail("Schema read was attempted.", e) + } --- End diff -- Yup --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22664: [SPARK-25662][TEST] Refactor DataSourceReadBenchmark to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22664 **[Test build #97236 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97236/testReport)** for PR 22664 at commit [`5bccfc6`](https://github.com/apache/spark/commit/5bccfc6fcd3cfe338c619c4f549ef7b6b038c5b3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22688: [SPARK-25700][SQL] Creates ReadSupport in only Ap...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22688#discussion_r224326576 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala --- @@ -351,6 +351,21 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext { } } } + + test("SPARK-25700: do not read schema when writing in other modes except append mode") { +withTempPath { file => + val cls = classOf[SimpleWriteOnlyDataSource] + val path = file.getCanonicalPath + val df = spark.range(5).select('id as 'i, -'id as 'j) + try { +df.write.format(cls.getName).option("path", path).mode("error").save() +df.write.format(cls.getName).option("path", path).mode("overwrite").save() +df.write.format(cls.getName).option("path", path).mode("ignore").save() + } catch { +case e: SchemaReadAttemptException => fail("Schema read was attempted.", e) + } --- End diff -- To validate new code path [line 250](https://github.com/apache/spark/pull/22688/files#diff-94fbd986b04087223f53697d4b6cab24R250), could you add `intercept[SchemaReadAttemptException]` and do `append`, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22668: [SPARK-25675] [Spark Job History] Job UI page doe...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22668#discussion_r224323509 --- Diff: core/src/main/scala/org/apache/spark/ui/PagedTable.scala --- @@ -154,9 +150,6 @@ private[ui] trait PagedTable[T] { * }}} */ private[ui] def pageNavigation(page: Int, pageSize: Int, totalPages: Int): Seq[Node] = { -if (totalPages == 1) { - Nil -} else { --- End diff -- One more comment: need to adjust the indent of the following code block. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r224318955 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala --- @@ -108,6 +108,16 @@ object TestingUDT { } } +object TestingValueClass { + case class IntWrapper(i: Int) extends AnyVal + case class StrWrapper(s: String) extends AnyVal + + case class ValueClassData( +intField: Int, +wrappedInt: IntWrapper, +strField: String, +wrappedStr: StrWrapper) --- End diff -- We might need a comment to describe what this class is look like in Java. Seems like it has 2 int fields `intField`, `wrappedInt`, and 2 string fields `strField`, `wrappedStr`. I'm not sure it is the same in Scala 2.12, though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22309 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22309 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97232/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22309 **[Test build #97232 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97232/testReport)** for PR 22309 at commit [`5613217`](https://github.com/apache/spark/commit/5613217771b1929b9f66106468fd2da2c3ea7dec). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r224322470 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,49 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources like Parquet, CSV, JSON, JDBC, we also provide some specific data source for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory. +The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. + + + +[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource) +implements a Spark SQL data source API for loading image data as a DataFrame. + +{% highlight scala %} +scala> spark.read.format("image").load("data/mllib/images/origin") +res1: org.apache.spark.sql.DataFrame = [image: struct] +{% endhighlight %} + + + +[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html) +implements Spark SQL data source API for loading image data as DataFrame. + +{% highlight java %} +Dataset imagesDF = spark.read().format("image").load("data/mllib/images/origin"); --- End diff -- Can we do a simple transformation so that how the image datasource can be utilized? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r224322298 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,49 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources like Parquet, CSV, JSON, JDBC, we also provide some specific data source for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory. +The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. + + + +[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource) +implements a Spark SQL data source API for loading image data as a DataFrame. + +{% highlight scala %} +scala> spark.read.format("image").load("data/mllib/images/origin") +res1: org.apache.spark.sql.DataFrame = [image: struct] +{% endhighlight %} + + + +[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html) +implements Spark SQL data source API for loading image data as DataFrame. + +{% highlight java %} +Dataset imagesDF = spark.read().format("image").load("data/mllib/images/origin"); +{% endhighlight %} + + + --- End diff -- how about SQL syntax? I think we can use `CREATE TABLE tableA USING LOCATION 'data/image.png'` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r224321873 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,49 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources like Parquet, CSV, JSON, JDBC, we also provide some specific data source for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory. +The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. --- End diff -- Shall we describe which image we can load? For instance, I think this delegates to ImageIO in Java which allows to read compressed format like PNG or JPG to raw image representation like BMP so that OpenCS can handles them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r224321949 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,49 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources like Parquet, CSV, JSON, JDBC, we also provide some specific data source for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory. +The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. --- End diff -- I would also describe the schema structure and what each field means. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22675#discussion_r224321446 --- Diff: docs/ml-datasource.md --- @@ -0,0 +1,49 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources like Parquet, CSV, JSON, JDBC, we also provide some specific data source for ML. --- End diff -- `JSON, JDBC` -> `JSON and JDBC` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22668: [SPARK-25675] [Spark Job History] Job UI page doe...
Github user shivusondur commented on a diff in the pull request: https://github.com/apache/spark/pull/22668#discussion_r224318421 --- Diff: core/src/main/scala/org/apache/spark/ui/PagedTable.scala --- @@ -123,10 +123,9 @@ private[ui] trait PagedTable[T] { /** * Return a page navigation. * - * If the totalPages is 1, the page navigation will be empty * - * If the totalPages is more than 1, it will create a page navigation including a group of - * page numbers and a form to submit the page number. + * It will create a page navigation including a group of page numbers and a form --- End diff -- @gengliangwang @felixcheung i have updated according to your suggestion. please check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22685: [SQL][MINOR][Refactor] Refactor on sql/core
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22685#discussion_r224317853 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -96,7 +95,7 @@ case class DataSource( private val caseInsensitiveOptions = CaseInsensitiveMap(options) private val equality = sparkSession.sessionState.conf.resolver - bucketSpec.map { bucket => + bucketSpec.foreach { bucket => --- End diff -- Yea, this is legitimate change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22419: [SPARK-23906][SQL] Add built-in UDF TRUNCATE(numb...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22419#discussion_r224318028 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala --- @@ -1245,3 +1245,27 @@ case class BRound(child: Expression, scale: Expression) with Serializable with ImplicitCastInputTypes { def this(child: Expression) = this(child, Literal(0)) } + +/** + * The number truncated to scale decimal places. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(number, scale) - Returns number truncated to scale decimal places. " + +"If scale is omitted, then number is truncated to 0 places. " + +"scale can be negative to truncate (make zero) scale digits left of the decimal point.", + examples = """ +Examples: + > SELECT _FUNC_(1234567891.1234567891, 4); + 1234567891.1234 + > SELECT _FUNC_(1234567891.1234567891, -4); + 123456 + > SELECT _FUNC_(1234567891.1234567891); + 1234567891 + """) +// scalastyle:on line.size.limit +case class Truncate(child: Expression, scale: Expression) --- End diff -- In that case, its ok to handle the string as date. How about only accepting float, double, and decimal for number truncation? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22676: [SPARK-25684][SQL] Organize header related codes in CSV ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22676 **[Test build #97235 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97235/testReport)** for PR 22676 at commit [`c504356`](https://github.com/apache/spark/commit/c504356b847e183f571a09ce5f808d4a7f229255). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22676: [SPARK-25684][SQL] Organize header related codes in CSV ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22676 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3871/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22676: [SPARK-25684][SQL] Organize header related codes in CSV ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22676 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22676: [SPARK-25684][SQL] Organize header related codes in CSV ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22676 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22594: [SPARK-25674][SQL] If the records are incremented by mor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22594 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97230/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22594: [SPARK-25674][SQL] If the records are incremented by mor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22594 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22594: [SPARK-25674][SQL] If the records are incremented by mor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22594 **[Test build #97230 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97230/testReport)** for PR 22594 at commit [`04eba30`](https://github.com/apache/spark/commit/04eba3019fa8e05b73823c91db48a50c544e8350). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22688 **[Test build #97234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97234/testReport)** for PR 22688 at commit [`ded852c`](https://github.com/apache/spark/commit/ded852c3f99d9fe904a6b54691ac6c170da9a298). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22688 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22688 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3870/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22688: [SPARK-25700][SQL] Creates ReadSupport in only Ap...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22688#discussion_r224316297 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala --- @@ -351,6 +351,21 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext { } } } + + test("SPARK-25700: do not read schema when writing in other modes except append mode") { +withTempPath { file => + val cls = classOf[SimpleWriteOnlyDataSource] + val path = file.getCanonicalPath + val df = spark.range(5).select('id as 'i, -'id as 'j) --- End diff -- The write path looks requiring two columns: https://github.com/apache/spark/blob/e06da95cd9423f55cdb154a2778b0bddf7be984c/sql/core/src/test/scala/org/apache/spark/sql/sources/v2/SimpleWritableDataSource.scala#L214 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22688: [SPARK-25700][SQL] Creates ReadSupport in only Ap...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22688#discussion_r224316130 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala --- @@ -351,6 +351,21 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext { } } } + + test("SPARK-25700: do not read schema when writing in other modes except append mode") { +withTempPath { file => + val cls = classOf[SimpleWriteOnlyDataSource] + val path = file.getCanonicalPath + val df = spark.range(5).select($"id", $"id") --- End diff -- The write path looks requiring two columns: https://github.com/apache/spark/blob/e06da95cd9423f55cdb154a2778b0bddf7be984c/sql/core/src/test/scala/org/apache/spark/sql/sources/v2/SimpleWritableDataSource.scala#L214 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22668: [SPARK-25675] [Spark Job History] Job UI page doe...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22668#discussion_r224316034 --- Diff: core/src/main/scala/org/apache/spark/ui/PagedTable.scala --- @@ -123,10 +123,9 @@ private[ui] trait PagedTable[T] { /** * Return a page navigation. * - * If the totalPages is 1, the page navigation will be empty * - * If the totalPages is more than 1, it will create a page navigation including a group of - * page numbers and a form to submit the page number. + * It will create a page navigation including a group of page numbers and a form --- End diff -- true. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22688 I have no idea why it passes in my local. I fixed the test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22689: [SPARK-25697][CORE]When zstd compression enabled, InProg...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22689 **[Test build #97233 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97233/testReport)** for PR 22689 at commit [`c309f34`](https://github.com/apache/spark/commit/c309f3464522341f286fd4791d7989dcde988cac). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22688 Hm, yea, this was passed in my local so I expected this was flaky but seems I should fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22689: [SPARK-25697][CORE]When zstd compression enabled, InProg...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22689 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22681: [SPARK-25682][k8s] Package example jars in same t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22681#discussion_r224314585 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile --- @@ -18,6 +18,7 @@ FROM openjdk:8-alpine ARG spark_jars=jars +ARG example_jars=examples/jars --- End diff -- could we make this optional? if someone wants to build a smaller image without example --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22688 Seems the same test failed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22466: [SPARK-25464][SQL] Create Database to the location,only ...
Github user sandeep-katta commented on the issue: https://github.com/apache/spark/pull/22466 > The major comments are in the test cases. Could you help clean up the existing test cases? All the comments are fixed and corrected the testcases --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22688 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22688 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97229/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22688 **[Test build #97229 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97229/testReport)** for PR 22688 at commit [`9377bc3`](https://github.com/apache/spark/commit/9377bc35050408512c28f47ca0535b66c4dfcaf8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class SchemaReadAttemptException(m: String) extends RuntimeException(m)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22678: [SPARK-25685][BUILD] Allow running tests in Jenki...
Github user LantaoJin commented on a diff in the pull request: https://github.com/apache/spark/pull/22678#discussion_r224309582 --- Diff: dev/run-tests-jenkins.py --- @@ -39,7 +39,8 @@ def print_err(msg): def post_message_to_github(msg, ghprb_pull_id): print("Attempting to post to Github...") -url = "https://api.github.com/repos/apache/spark/issues/"; + ghprb_pull_id + "/comments" +api_url = os.getenv("GITHUB_SERVER_API_URL", "https://api.github.com/repos/apache/spark";) --- End diff -- Sure. @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22690: [SPARK-19287][CORE][STREAMING] JavaPairRDD flatMapValues...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22690 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22690: [SPARK-19287][CORE][STREAMING] JavaPairRDD flatMapValues...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22690 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97226/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22690: [SPARK-19287][CORE][STREAMING] JavaPairRDD flatMapValues...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22690 **[Test build #97226 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97226/testReport)** for PR 22690 at commit [`a35b54f`](https://github.com/apache/spark/commit/a35b54fbb000665a87998c14ed940316d45d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22612: [SPARK-24958] Add executors' process tree total memory i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22612 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97228/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22612: [SPARK-24958] Add executors' process tree total memory i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22612 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22612: [SPARK-24958] Add executors' process tree total memory i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22612 **[Test build #97228 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97228/testReport)** for PR 22612 at commit [`067b81d`](https://github.com/apache/spark/commit/067b81d24de7999afe5b9660e89d9a2e41de6d21). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22678: [SPARK-25685][BUILD] Allow running tests in Jenkins in e...
Github user LantaoJin commented on the issue: https://github.com/apache/spark/pull/22678 Sorry for closing the conversation mistakenly @dongjoon-hyun . I will update the documentation soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22309 **[Test build #97232 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97232/testReport)** for PR 22309 at commit [`5613217`](https://github.com/apache/spark/commit/5613217771b1929b9f66106468fd2da2c3ea7dec). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22309 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22674 **[Test build #97231 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97231/testReport)** for PR 22674 at commit [`3ffa536`](https://github.com/apache/spark/commit/3ffa536f3c29f6655843a4d45c215393f51e23c9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22674 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22674 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3869/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22309 somehow I lost track of this PR. ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r224300113 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala --- @@ -108,6 +108,16 @@ object TestingUDT { } } +object TestingValueClass { + case class IntWrapper(i: Int) extends AnyVal --- End diff -- does value class must be a case class? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22661#discussion_r224300031 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala --- @@ -19,229 +19,165 @@ package org.apache.spark.sql.execution.benchmark import org.apache.spark.sql.execution.joins._ import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.IntegerType /** * Benchmark to measure performance for aggregate primitives. - * To run this: - * build/sbt "sql/test-only *benchmark.JoinBenchmark" - * - * Benchmarks in this file are skipped in normal builds. + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class --jars + * 2. build/sbt "sql/test:runMain " + * 3. generate result: + * SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/JoinBenchmark-results.txt". + * }}} */ -class JoinBenchmark extends BenchmarkWithCodegen { +object JoinBenchmark extends SqlBasedBenchmark { - ignore("broadcast hash join, long key") { + def broadcastHashJoinLongKey(): Unit = { val N = 20 << 20 val M = 1 << 16 -val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v")) -runBenchmark("Join w long", N) { - val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k")) +val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v")) +codegenBenchmark("Join w long", N) { + val df = spark.range(N).join(dim, (col("id") % M) === col("k")) assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined) df.count() } - -/* -Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5 -Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz -Join w long:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Join w long codegen=false3002 / 3262 7.0 143.2 1.0X -Join w long codegen=true 321 / 371 65.3 15.3 9.3X -*/ } - ignore("broadcast hash join, long key with duplicates") { + + def broadcastHashJoinLongKeyWithDuplicates(): Unit = { val N = 20 << 20 val M = 1 << 16 -val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v")) --- End diff -- Yes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22674 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22692: [SPARK-25598][STREAMING][BUILD] Remove flume connector i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22692 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22692: [SPARK-25598][STREAMING][BUILD] Remove flume connector i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22692 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97221/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22692: [SPARK-25598][STREAMING][BUILD] Remove flume connector i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22692 **[Test build #97221 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97221/testReport)** for PR 22692 at commit [`4b39ac3`](https://github.com/apache/spark/commit/4b39ac3500d1ee6f8b3d93f4822c6e5f36e30e3b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19330 Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19330 @maropu Thanks, and yes I'm still here and I can keep going if this pr is interested. I will update this pr this weekend. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22692: [SPARK-25598][STREAMING][BUILD] Remove flume connector i...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22692 sounds reasonable, also cc @tdas @zsxwing @jose-torres --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...
Github user maryannxue commented on a diff in the pull request: https://github.com/apache/spark/pull/22259#discussion_r224295469 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -47,7 +48,8 @@ case class ScalaUDF( inputTypes: Seq[DataType] = Nil, udfName: Option[String] = None, nullable: Boolean = true, -udfDeterministic: Boolean = true) +udfDeterministic: Boolean = true, +nullableTypes: Seq[Boolean] = Nil) --- End diff -- Yes, the test should not pass after removing `isInstanceOf[KnownNotNull]` condition from `needsNullCheck` test (https://github.com/apache/spark/pull/22259/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2L2160). The idea was to add a `KnownNotNull` node on top of the original node to mark it as null-checked, so the rule won't add redundant null checks even if it is accidentally applied again. I'm not sure about the exact reason why you removed `isInstanceOf[KnownNotNull]` condition in this PR, but I think it should be left there alongside the new nullable type check. After adding the `nullableTypes` parameter in the test, the issue can be reproduced: ``` test("SPARK-24891 Fix HandleNullInputsForUDF rule") { val a = testRelation.output(0) val func = (x: Int, y: Int) => x + y val udf1 = ScalaUDF(func, IntegerType, a :: a :: Nil, nullableTypes = false :: false :: Nil) val udf2 = ScalaUDF(func, IntegerType, a :: udf1 :: Nil, nullableTypes = false :: false :: Nil) val plan = Project(Alias(udf2, "")() :: Nil, testRelation) comparePlans(plan.analyze, plan.analyze.analyze) } ``` BTW, I'm just curious: It looks like `nullableTypes` indicates something opposite to "nullable" used in schema. I would assume when `nullableTypes` is `Seq(false)`, it means this type is not nullable and we need not add the null check, vice versa. Did I miss something here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22594: [SPARK-25674][SQL] If the records are incremented by mor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22594 **[Test build #97230 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97230/testReport)** for PR 22594 at commit [`04eba30`](https://github.com/apache/spark/commit/04eba3019fa8e05b73823c91db48a50c544e8350). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22594: [SPARK-25674][SQL] If the records are incremented by mor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22594 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3868/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22594: [SPARK-25674][SQL] If the records are incremented by mor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22594 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21669: [SPARK-23257][K8S] Kerberos Support for Spark on K8S
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21669 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21669: [SPARK-23257][K8S] Kerberos Support for Spark on K8S
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21669 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97220/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21669: [SPARK-23257][K8S] Kerberos Support for Spark on K8S
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21669 **[Test build #97220 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97220/testReport)** for PR 21669 at commit [`dd95fca`](https://github.com/apache/spark/commit/dd95fcab754e71e9465f4e46818c3cef09e86c8b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22691: [SPARK-24109][CORE] Remove class SnappyOutputStreamWrapp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22691 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97222/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22691: [SPARK-24109][CORE] Remove class SnappyOutputStreamWrapp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22691 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22691: [SPARK-24109][CORE] Remove class SnappyOutputStreamWrapp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22691 **[Test build #97222 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97222/testReport)** for PR 22691 at commit [`8850c7a`](https://github.com/apache/spark/commit/8850c7a7d563cf6bc46a84b7480b4d338d58b80f). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22688 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22688 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3867/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22688 **[Test build #97229 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97229/testReport)** for PR 22688 at commit [`9377bc3`](https://github.com/apache/spark/commit/9377bc35050408512c28f47ca0535b66c4dfcaf8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22688 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22664: [SPARK-25662][TEST] Refactor DataSourceReadBenchmark to ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22664 Hi, @peter-toth . Could you review and merge https://github.com/peter-toth/spark/pull/1 which contains the result on EC2 r3.xlarge? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22688 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22688 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97224/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22688: [SPARK-25700][SQL] Creates ReadSupport in only Append Mo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22688 **[Test build #97224 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97224/testReport)** for PR 22688 at commit [`9377bc3`](https://github.com/apache/spark/commit/9377bc35050408512c28f47ca0535b66c4dfcaf8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class SchemaReadAttemptException(m: String) extends RuntimeException(m)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22689: [SPARK-25697][CORE]When zstd compression enabled, InProg...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22689 @srowen . Yes. We should read only from the finished frames of zstd. When the listener try to read from the unfinished frame, zstd input reader throws an exception (unless we make set continuous true). Currently the behavior is, it reads from the finished frames, but after that it tried to read from the unfinished frame and throws exception while loading the webui. So, the solution should be, we should not parse from the unfinished frame, and load the UI based on only the finish frames. @vanzin has good idea about the history server. Hi @vanzin , could you please give your inputs? Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org