[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19390 **[Test build #83548 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83548/testReport)** for PR 19390 at commit [`dbf7875`](https://github.com/apache/spark/commit/dbf787593d72afe16cadd9731a1dd1cba608451c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19390#discussion_r149367812 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala --- @@ -451,15 +465,20 @@ trait MesosSchedulerUtils extends Logging { } /** Creates a mesos resource for a specific port number. */ - private def createResourcesFromPorts(portsAndRoles: List[(Long, String)]) : List[Resource] = { -portsAndRoles.flatMap{ case (port, role) => - createMesosPortResource(List((port, port)), Some(role))} + private def createResourcesFromPorts( + portsAndResourcesInfo: List[(Long, (String, AllocationInfo, Option[ReservationInfo]))]) +: List[Resource] = { +portsAndResourcesInfo.flatMap{ case (port, rInfo) => + createMesosPortResource(List((port, port)), Option(rInfo._1), --- End diff -- fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19640: [SPARK-16986][CORE][WEB-UI] Support configure his...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19640#discussion_r149365816 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2742,6 +2742,11 @@ private[spark] object Utils extends Logging { } } + def getTimeZone: TimeZone = { +val sparkConf = new SparkConf(false).loadFromSystemProperties(true) --- End diff -- Can we make SparkConf as a input param instead of create a new instance for every function call? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19479 **[Test build #83547 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83547/testReport)** for PR 19479 at commit [`fa338dd`](https://github.com/apache/spark/commit/fa338ddcb655f6e421b1be35fdd8dcd5cd866df0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19684: [SPARK-22461][ML] Move Spark ML model summaries into a d...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19684 @srowen , sorry, I haven't seen it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19684: [SPARK-22461][ML] Move Spark ML model summaries i...
Github user mgaido91 closed the pull request at: https://github.com/apache/spark/pull/19684 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19684: [SPARK-22461][ML] Move Spark ML model summaries into a d...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19684 This duplicates https://github.com/apache/spark/pull/19680 @mgaido91 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19680: [SPARK-22641][ML] Refactor Spark ML model summaries
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19680 Oh, this is tagged to the wrong JIRA. Should be SPARK-22461 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19684: [SPARK-22461][ML] Move Spark ML model summaries i...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/19684 [SPARK-22461][ML] Move Spark ML model summaries into a dedicated package ## What changes were proposed in this pull request? Added a common abstraction (the `Summary` trait) for all the summaries in ML and moved them all to a new `summary` package. ## How was this patch tested? Existing UTs and manual execution of the changed examples You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22461 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19684.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19684 commit a663b7bbcb1591cd123b95478bb9771ed3674135 Author: Marco Gaido Date: 2017-11-07T11:58:27Z [SPARK-22461][ML] Move Spark ML model summaries into a dedicated package --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r149361297 --- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala --- @@ -542,7 +496,55 @@ private[spark] class SecurityManager( * Gets the secret key. * @return the secret key as a String if authentication is enabled, otherwise returns null */ - def getSecretKey(): String = secretKey + def getSecretKey(): String = { +if (isAuthenticationEnabled) { + Option(sparkConf.getenv(ENV_AUTH_SECRET)) +.orElse(sparkConf.getOption(SPARK_AUTH_SECRET_CONF)) +.getOrElse { + throw new IllegalArgumentException( +s"A secret key must be specified via the $SPARK_AUTH_SECRET_CONF config") +} +} else { + null +} + } + + /** + * Initialize the configuration object held by this class for authentication. + * + * If authentication is disabled, do nothing. + * + * In YARN mode, generate a secret key and store it in the configuration object, setting it up to + * also be propagated to executors using an env variable. + * + * In other modes, assert that the auth secret is set in the configuration. + */ + def initializeAuth(): Unit = { +if (!sparkConf.get(NETWORK_AUTH_ENABLED)) { + return +} + +if (sparkConf.get(SparkLauncher.SPARK_MASTER, null) != "yarn") { + require(sparkConf.contains(SPARK_AUTH_SECRET_CONF), +s"A secret key must be specified via the $SPARK_AUTH_SECRET_CONF config.") + return +} + +// In YARN, force creation of a new secret if this is client mode. This ensures each --- End diff -- Is there a reason this _has_ to be unique to YARN? Will this solve the problem (in Mesos currently) where when the Executors [bootstrap](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L193) they do so without security (unless you "bake" the secret and secret config into the container image)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19649 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19649 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83538/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19649 **[Test build #83538 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83538/testReport)** for PR 19649 at commit [`c79c314`](https://github.com/apache/spark/commit/c79c31452d285c3b36b8adda55b9daccd6cea0d4). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class AlterTablePreEvent(` * `case class AlterTableEvent(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19674 @ueshin, now I am getting what you meant by DST. I think you roughly knew this problem but let me describe it more given my debugging. Looks the problem is in `mktime` and sounds `mktime` is platform-dependent (roughly assuming from codes and docs). I made a minimised reproducer: ```python import time import os from datetime import datetime os.environ["TZ"] = "America/Los_Angeles" time.tzset() time.mktime(datetime(2100, 4, 4, 4, 4, 4).timetuple()) ``` My local, it prints: ``` 4110523444.0 ``` On Unbuntu 14.04: ``` 4110519844.0 ``` Jenkins, it prints: ``` 4110519844.0 ``` I am not sure if this is easily fixable within Spark as it looks dependent on Python implementation and/or the underlying C library, up to my knowledge and from my reading some docs. Could you maybe avoid this time within DST for now in your PR? I currently don't have a good idea to fix this with a simple and surgical fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19674 **[Test build #83546 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83546/testReport)** for PR 19674 at commit [`d03f768`](https://github.com/apache/spark/commit/d03f7689d531a1421af2f90b6c7735f0184d8b76). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83546/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83534/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19208 **[Test build #83534 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83534/testReport)** for PR 19208 at commit [`654e4d5`](https://github.com/apache/spark/commit/654e4d580889dcd2fcf7c0bea2060349190faaac). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19674 **[Test build #83546 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83546/testReport)** for PR 19674 at commit [`d03f768`](https://github.com/apache/spark/commit/d03f7689d531a1421af2f90b6c7735f0184d8b76). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19667: [SPARK-21127][SQL][followup] fix a config name typo
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19667 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83537/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19667: [SPARK-21127][SQL][followup] fix a config name typo
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19667 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19667: [SPARK-21127][SQL][followup] fix a config name typo
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19667 **[Test build #83537 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83537/testReport)** for PR 19667 at commit [`55b949e`](https://github.com/apache/spark/commit/55b949e0e039ea981aeb69d6c3699c829071e368). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19683 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19479#discussion_r149344559 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala --- @@ -275,6 +317,118 @@ object ColumnStat extends Logging { avgLen = row.getLong(4), maxLen = row.getLong(5) ) +if (row.isNullAt(6)) { + cs +} else { + val ndvs = row.getArray(6).toLongArray() + assert(percentiles.get.length == ndvs.length + 1) + val endpoints = percentiles.get.map(_.toString.toDouble) + // Construct equi-height histogram + val buckets = ndvs.zipWithIndex.map { case (ndv, i) => +EquiHeightBucket(endpoints(i), endpoints(i + 1), ndv) + } + val nonNullRows = rowCount - cs.nullCount + val ehHistogram = EquiHeightHistogram(nonNullRows.toDouble / ndvs.length, buckets) + cs.copy(histogram = Some(ehHistogram)) +} + } + +} + +/** + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid + * histogram, etc. + * Currently in Spark, we support equi-height histogram since it is good at handling skew + * distribution, and also provides reasonable accuracy in other cases. + * We can add other histograms in the future, which will make estimation logic more complicated. + * This is because we will have to deal with computation between different types of histograms in + * some cases, e.g. for join columns. + */ +trait Histogram + +/** + * Equi-height histogram represents column value distribution by a sequence of buckets. Each bucket + * has a value range and contains approximately the same number of rows. + * @param height number of rows in each bucket + * @param ehBuckets equi-height histogram buckets + */ +case class EquiHeightHistogram(height: Double, ehBuckets: Array[EquiHeightBucket]) + extends Histogram { + + // Only for histogram equality test. + override def equals(other: Any): Boolean = other match { +case otherEHH: EquiHeightHistogram => + height == otherEHH.height && ehBuckets.sameElements(otherEHH.ehBuckets) +case _ => false + } + + override def hashCode(): Int = super.hashCode() +} + +/** + * A bucket in an equi-height histogram. We use double type for lower/higher bound for simplicity. + * @param lo lower bound of the value range in this bucket + * @param hi higher bound of the value range in this bucket + * @param ndv approximate number of distinct values in this bucket + */ +case class EquiHeightBucket(lo: Double, hi: Double, ndv: Long) + +object HistogramSerializer { + // A flag to indicate the type of histogram + val EQUI_HEIGHT_HISTOGRAM_TYPE: Byte = 1 + + /** + * Serializes a given histogram to a string. For advanced statistics like histograms, sketches, + * etc, we don't provide readability for their serialized formats in metastore (as + * string-to-string table properties). This is because it's hard or unnatural for these + * statistics to be human readable. For example, histogram is probably split into multiple + * key-value properties, instead of a single, self-described property. And for + * count-min-sketch, it's essentially unnatural to make it a readable string. + */ + final def serialize(histogram: Histogram): String = histogram match { +case h: EquiHeightHistogram => + // type + numBuckets + height + numBuckets * (lo + hi + ndv) --- End diff -- Thanks for the advice. The default number of buckets is 254. Tests showed that after compression, the serialized length is reduced by more than 50%. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...
GitHub user uzadude opened a pull request: https://github.com/apache/spark/pull/19683 [SPARK-21657][SQL] optimize explode quadratic memory consumpation ## What changes were proposed in this pull request? The issue has been raised in two Jira tickets: [SPARK-21657](https://issues.apache.org/jira/browse/SPARK-21657), [SPARK-16998](https://issues.apache.org/jira/browse/SPARK-16998). Basically, what happens is that in collection generators like explode/inline we create many rows from each row. Currently each exploded row contains also the column on which it was created. This causes, for example, if we have a 10k array in one row that this array will get copy 10k times - to each of the row. this results a qudratic memory consumption. However, it is a common case that the original column gets projected out after the explode, so we can avoid duplicating it. In this solution we propose to identify this situation in the optimizer and turn on a flag for omitting the original column in the generation process. ## How was this patch tested? 1. We added a benchmark test to MiscBenchmark that shows x16 improvement in runtimes. 2. We ran some of the other tests in MiscBenchmark and they show 15% improvements. 3. We ran this code on a specific case from our production data with rows containing arrays of size ~200k and it reduced the runtime from 6 hours to 3 mins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uzadude/spark optimize_explode Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19683.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19683 commit ce7c3694a99584348957dc756234bb667466be4e Author: oraviv Date: 2017-11-07T11:34:21Z [SPARK-21657][SQL] optimize explode quadratic memory consumpation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19657 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19657 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83544/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19657 **[Test build #83544 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83544/testReport)** for PR 19657 at commit [`ca5349b`](https://github.com/apache/spark/commit/ca5349bfc0dae03c2402b104e51c78a841541b09). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19657 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83542/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19657 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19657 **[Test build #83542 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83542/testReport)** for PR 19657 at commit [`31f3bd0`](https://github.com/apache/spark/commit/31f3bd06cc7d2b7bf482eddfe2f2738244cfbca7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19479 **[Test build #83545 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83545/testReport)** for PR 19479 at commit [`804b375`](https://github.com/apache/spark/commit/804b37565b2f5d61edd492d415475a59afec41f5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19479#discussion_r149341693 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala --- @@ -216,65 +218,61 @@ object ColumnStat extends Logging { } } - /** - * Constructs an expression to compute column statistics for a given column. - * - * The expression should create a single struct column with the following schema: - * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long - * - * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and - * as a result should stay in sync with it. - */ - def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = { -def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr => - expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() } -}) -val one = Literal(1, LongType) + private def convertToHistogram(s: String): EquiHeightHistogram = { +val idx = s.indexOf(",") +if (idx <= 0) { + throw new AnalysisException("Failed to parse histogram.") +} +val height = s.substring(0, idx).toDouble +val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r +val buckets = pattern.findAllMatchIn(s).map { m => + EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong) +}.toSeq +EquiHeightHistogram(height, buckets) + } -// the approximate ndv (num distinct value) should never be larger than the number of rows -val numNonNulls = if (col.nullable) Count(col) else Count(one) -val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls)) -val numNulls = Subtract(Count(one), numNonNulls) -val defaultSize = Literal(col.dataType.defaultSize, LongType) +} -def fixedLenTypeStruct(castType: DataType) = { - // For fixed width types, avg size should be the same as max size. - struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize, -defaultSize) -} +/** + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid + * histogram, etc. + * Currently in Spark, we support equi-height histogram since it is good at handling skew + * distribution, and also provides reasonable accuracy in other cases. + * We can add other histograms in the future, which will make estimation logic more complicated. --- End diff -- OK. Removed the trait `Histogram`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19661: [SPARK-22450][Core][Mllib]safely register class f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19661#discussion_r14934 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -178,10 +178,40 @@ class KryoSerializer(conf: SparkConf) kryo.register(Utils.classForName("scala.collection.immutable.Map$EmptyMap$")) kryo.register(classOf[ArrayBuffer[Any]]) +// We can't load those class directly in order to avoid unnecessary jar dependencies. +// We load them safely, ignore it if the class not found. +Seq("org.apache.spark.mllib.linalg.Vector", + "org.apache.spark.mllib.linalg.DenseVector", + "org.apache.spark.mllib.linalg.SparseVector", + "org.apache.spark.mllib.linalg.Matrix", + "org.apache.spark.mllib.linalg.DenseMatrix", + "org.apache.spark.mllib.linalg.SparseMatrix", + "org.apache.spark.ml.linalg.Vector", + "org.apache.spark.ml.linalg.DenseVector", + "org.apache.spark.ml.linalg.SparseVector", + "org.apache.spark.ml.linalg.Matrix", + "org.apache.spark.ml.linalg.DenseMatrix", + "org.apache.spark.ml.linalg.SparseMatrix", + "org.apache.spark.ml.feature.Instance", + "org.apache.spark.ml.feature.OffsetInstance" +).flatMap(safeClassLoader(_)).foreach(kryo.register(_)) --- End diff -- please inline this `safeClassLoader` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19156: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of d...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19156 the SQL part LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83541/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19674 **[Test build #83541 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83541/testReport)** for PR 19674 at commit [`3ef8ec0`](https://github.com/apache/spark/commit/3ef8ec0acb23cd6e10346dbbb6d2d6510e019ac1). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19682: [SPARK-22464] [SQL] No pushdown for Hive metastore parti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19682 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19682: [SPARK-22464] [SQL] No pushdown for Hive metastore parti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19682 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83535/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19682: [SPARK-22464] [SQL] No pushdown for Hive metastore parti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19682 **[Test build #83535 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83535/testReport)** for PR 19682 at commit [`be0e276`](https://github.com/apache/spark/commit/be0e276a371319eb19467707a105d968d685a0c3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83532/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19657 **[Test build #83544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83544/testReport)** for PR 19657 at commit [`ca5349b`](https://github.com/apache/spark/commit/ca5349bfc0dae03c2402b104e51c78a841541b09). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #83532 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83532/testReport)** for PR 16578 at commit [`48a509e`](https://github.com/apache/spark/commit/48a509e8602ed44a4a0fd5268d91d917bb8e0748). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #83543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83543/testReport)** for PR 19651 at commit [`f644c6a`](https://github.com/apache/spark/commit/f644c6a88b4f24376c67028d0e927a2ee49fedbe). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19657 **[Test build #83542 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83542/testReport)** for PR 19657 at commit [`31f3bd0`](https://github.com/apache/spark/commit/31f3bd06cc7d2b7bf482eddfe2f2738244cfbca7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19657: [SPARK-22344][SPARKR] clean up install dir if run...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19657#discussion_r149331874 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -1183,3 +1183,24 @@ env | map ```{r, echo=FALSE} sparkR.session.stop() ``` + +```{r cleanup, include=FALSE} +# clean up if Spark was downloaded +# get0 not supported before R 3.2.0 +sparkDownloaded <- mget(".sparkDownloaded"[1L], --- End diff -- since this needs to go into 2.2, let's not add a public method for now, we could revisit this for 2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19662: [SPARK-22446][SQL][ML] Declare StringIndexerModel...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19662#discussion_r149331633 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -75,6 +75,7 @@ abstract class Expression extends TreeNode[Expression] { * - it relies on some mutable internal state, or * - it relies on some implicit input that is not part of the children expression list. * - it has non-deterministic child or children. + * - it is an UDF that can cause runtime exception on some specific input. --- End diff -- Ok. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r149331329 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala --- @@ -0,0 +1,370 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import java.io.IOException + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.io._ +import org.apache.orc.{OrcFile, TypeDescription} +import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp} +import org.apache.orc.storage.common.`type`.HiveDecimal +import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable} + +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.Resolver +import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow +import org.apache.spark.sql.catalyst.parser.CatalystSqlParser +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +object OrcUtils extends Logging { + + def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = { +val origPath = new Path(pathStr) +val fs = origPath.getFileSystem(conf) +val paths = SparkHadoopUtil.get.listLeafStatuses(fs, origPath) + .filterNot(_.isDirectory) + .map(_.getPath) + .filterNot(_.getName.startsWith("_")) + .filterNot(_.getName.startsWith(".")) +paths + } + + private[orc] def readSchema(file: Path, conf: Configuration): Option[TypeDescription] = { +try { + val fs = file.getFileSystem(conf) + val readerOptions = OrcFile.readerOptions(conf).filesystem(fs) + val reader = OrcFile.createReader(file, readerOptions) + val schema = reader.getSchema + if (schema.getFieldNames.size == 0) { +None + } else { +Some(schema) + } +} catch { + case _: IOException => None +} + } + + private[orc] def readSchema(sparkSession: SparkSession, files: Seq[FileStatus]) + : Option[StructType] = { +val conf = sparkSession.sparkContext.hadoopConfiguration +files.map(_.getPath).flatMap(readSchema(_, conf)).headOption.map { schema => + logDebug(s"Reading schema from file $files, got Hive schema string: $schema") + CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType] +} + } + + private[orc] def getSchemaString(schema: StructType): String = { +schema.fields.map(f => s"${f.name}:${f.dataType.catalogString}").mkString("struct<", ",", ">") + } + + private[orc] def getTypeDescription(dataType: DataType) = dataType match { +case st: StructType => TypeDescription.fromString(getSchemaString(st)) +case _ => TypeDescription.fromString(dataType.catalogString) + } + + /** + * Return a missing schema in a give ORC file. + */ + private[orc] def getMissingSchema( + resolver: Resolver, + dataSchema: StructType, + partitionSchema: StructType, + file: Path, + conf: Configuration): Option[StructType] = { +try { + val fs = file.getFileSystem(conf) + val readerOptions = OrcFile.readerOptions(conf).filesystem(fs) + val reader = OrcFile.createReader(file, readerOptions) + val schema = reader.getSchema + if (schema.getFieldNames.size == 0) { +None + } else { +val orcSchema = if (schema.getFieldNames.asScala.forall(_.startsWith("_col"))) { +
[GitHub] spark pull request #19665: [SPARK-22376][TESTS] Makes dev/run-tests.py scrip...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19665 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19665: [SPARK-22376][TESTS] Makes dev/run-tests.py script compa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19665 Thank you @srowen for reviewing this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19665: [SPARK-22376][TESTS] Makes dev/run-tests.py script compa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19665 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19674 **[Test build #83541 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83541/testReport)** for PR 19674 at commit [`3ef8ec0`](https://github.com/apache/spark/commit/3ef8ec0acb23cd6e10346dbbb6d2d6510e019ac1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 I believe that unordered features will benefit a lot from the idea, however I have two questions: 1. I'm a little confused by 964L in `traverseUnorderedSplits`. Is it a backtracking algorithm? ```scala dfs(binIndex + 1, combNumber, stats) ``` 2. `traverseUnorderedSplits` succeed in decoupling an abstraction from its implementation. Cool! However, I just wonder whether we can write a simpler function, say, remove `seqOp`, `finalizer`, `T` and collect all logic together in one place? Anyway, thanks for your good work, @WeichenXu123. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19651 Right. @HyukjinKwon . I'll follow the final decision on this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19651 @dongjoon-hyun, btw, if I understood correctly, > Note that this PR intentionally removes old ORCFileFormat to demonstrate a complete replacement. We will bring back the old ORCFileFormat and make them switchable in SPARK-20728 we don't necessarily remove the old `sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala` itself when it's ready for merging? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83540/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19674 **[Test build #83540 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83540/testReport)** for PR 19674 at commit [`e64a95e`](https://github.com/apache/spark/commit/e64a95e21a8b06b077369e12a9e28c5d90d507d2). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19674 **[Test build #83540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83540/testReport)** for PR 19674 at commit [`e64a95e`](https://github.com/apache/spark/commit/e64a95e21a8b06b077369e12a9e28c5d90d507d2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19664: [SPARK-22442][SQL] ScalaReflection should produce correc...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19664 good catch! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19664: [SPARK-22442][SQL] ScalaReflection should produce...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19664#discussion_r149316667 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala --- @@ -335,4 +338,17 @@ class ScalaReflectionSuite extends SparkFunSuite { assert(linkedHashMapDeserializer.dataType == ObjectType(classOf[LHMap[_, _]])) } + test("SPARK-22442: Generate correct field names for special characters") { +val serializer = serializerFor[SpecialCharAsFieldData](BoundReference( + 0, ObjectType(classOf[SpecialCharAsFieldData]), nullable = false)) +val deserializer = deserializerFor[SpecialCharAsFieldData] +assert(serializer.dataType(0).name == "field.1") +assert(serializer.dataType(1).name == "field 2") + +val argumentsFields = deserializer.asInstanceOf[NewInstance].arguments.flatMap { _.collect { + case UpCast(u: UnresolvedAttribute, _, _) => u.name +}} +assert(argumentsFields(0) == "`field.1`") --- End diff -- why it has backticks? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19649#discussion_r149315115 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/events.scala --- @@ -110,7 +120,31 @@ case class RenameTableEvent( extends TableEvent /** - * Event fired when a function is created, dropped or renamed. + * Enumeration to indicate which part of table is altered. If a plain alterTable API is called, then + * type will generally be Table. + */ +object AlterTableKind extends Enumeration { + val Table, DataSchema, Stats = Value --- End diff -- I'm OK to use String, but I'd prefer strong type to avoid nasty issues. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19664: [SPARK-22442][SQL] ScalaReflection should produce...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19664#discussion_r149316254 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -214,11 +215,13 @@ case class Invoke( override def eval(input: InternalRow): Any = throw new UnsupportedOperationException("Only code-generated evaluation is supported.") + private lazy val encodedFunctionName = TermName(functionName).encodedName.toString --- End diff -- does `StaticInvoke` have some issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19674 **[Test build #83539 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83539/testReport)** for PR 19674 at commit [`2e4823d`](https://github.com/apache/spark/commit/2e4823d0991de438b82f4c274f19515d5c193075). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83539/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19649#discussion_r149314315 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/events.scala --- @@ -110,7 +120,31 @@ case class RenameTableEvent( extends TableEvent /** - * Event fired when a function is created, dropped or renamed. + * Enumeration to indicate which part of table is altered. If a plain alterTable API is called, then + * type will generally be Table. + */ +object AlterTableKind extends Enumeration { + val Table, DataSchema, Stats = Value --- End diff -- shall we just use string? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19674 **[Test build #83539 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83539/testReport)** for PR 19674 at commit [`2e4823d`](https://github.com/apache/spark/commit/2e4823d0991de438b82f4c274f19515d5c193075). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19649#discussion_r149315470 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/events.scala --- @@ -110,7 +120,31 @@ case class RenameTableEvent( extends TableEvent /** - * Event fired when a function is created, dropped or renamed. + * Enumeration to indicate which part of table is altered. If a plain alterTable API is called, then + * type will generally be Table. + */ +object AlterTableKind extends Enumeration { + val Table, DataSchema, Stats = Value --- End diff -- String is better for backward compatibility, but easier to make mistake. I don't have a strong preference, cc @hvanhovell @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r149315182 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument, SearchArgumentFactory} +import org.apache.orc.storage.ql.io.sarg.SearchArgument.Builder +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types._ + +/** + * Utility functions to convert Spark data source filters to ORC filters. + */ +private[orc] object OrcFilters { + + /** + * Create ORC filter as a SearchArgument instance. + */ + def createFilter(schema: StructType, filters: Seq[Filter]): Option[SearchArgument] = { +val dataTypeMap = schema.map(f => f.name -> f.dataType).toMap + +val convertibleFilters = for { + filter <- filters + _ <- buildSearchArgument(dataTypeMap, filter, SearchArgumentFactory.newBuilder()) --- End diff -- This is a two-step approach which validates each individual filter is convertible. I'll add the comment of [SPARK-12218](https://github.com/apache/spark/commit/8e23d8db7f28a97e2f4394cdf9d4c4260abbd750#diff-6cac9bc2656e3782b0312dceb8c55d47R60). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19682: [SPARK-22464] [SQL] No pushdown for Hive metastor...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19682#discussion_r149314957 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala --- @@ -592,6 +592,19 @@ private[client] class Shim_v0_13 extends Shim_v0_12 { } } + + /** + * An extractor that matches all binary comparison operators except null-safe equality. + * + * null-safe equality is not supported by Hive metastore partition predicate pushdown + */ + object OperatorsInMetastorePartitionFPD { --- End diff -- This name looks weird... maybe just `SpecialBinaryComparison` and explain the detail in the document. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19649 **[Test build #83538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83538/testReport)** for PR 19649 at commit [`c79c314`](https://github.com/apache/spark/commit/c79c31452d285c3b36b8adda55b9daccd6cea0d4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83536/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19674 **[Test build #83536 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83536/testReport)** for PR 19674 at commit [`32d4796`](https://github.com/apache/spark/commit/32d4796c5e029cdbdf54d2625365e774d05b148c). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19674 I can directly add anything if you want to check, @ueshin. Just feel free to let me know, or I also support to open another PR to figure out if you are willing to. Meanwhile, let me print out few things I want to check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149313427 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -631,6 +614,42 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { val expected = Map(0 -> 1.0 / 3.0, 2 -> 2.0 / 3.0) assert(mapToVec(map.toMap) ~== mapToVec(expected) relTol 0.01) } + + test("traverseUnorderedSplits") { + --- End diff -- Since `traverseUnorderedSplits` is a private method, I wonder whether we can check the unorder splits on DecisonTree directly? For example, create a tiny dataset and generate a shallow tree (depth = 1?). I know the test case is difficult (maybe impossible) to design, however it focuses on behavior instead of implementation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r149311408 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala --- @@ -22,9 +22,10 @@ import org.apache.hadoop.fs.Path import org.apache.hadoop.hive.ql.io.orc.{OrcFile, Reader} import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector -import org.apache.spark.deploy.SparkHadoopUtil import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.parser.CatalystSqlParser +import org.apache.spark.sql.execution.datasources.orc.OrcUtils +import org.apache.spark.sql.hive.HiveShim import org.apache.spark.sql.types.StructType private[hive] object OrcFileOperator extends Logging { --- End diff -- `OrcFileOperator` defines functions depending on Hive. We cannot merge these functions into `sql/core`. ``` import org.apache.hadoop.hive.ql.io.orc.{OrcFile, Reader} import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r149310995 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -39,3 +58,134 @@ private[sql] object OrcFileFormat { names.foreach(checkFieldName) } } + +class DefaultSource extends OrcFileFormat + +/** + * New ORC File Format based on Apache ORC 1.4.1 and above. + */ +class OrcFileFormat + extends FileFormat + with DataSourceRegister + with Serializable { + + override def shortName(): String = "orc" + + override def toString: String = "ORC_1.4" + + override def hashCode(): Int = getClass.hashCode() + + override def equals(other: Any): Boolean = other.isInstanceOf[OrcFileFormat] + + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = { +OrcUtils.readSchema(sparkSession, files) + } + + override def prepareWrite( + sparkSession: SparkSession, + job: Job, + options: Map[String, String], + dataSchema: StructType): OutputWriterFactory = { +val orcOptions = new OrcOptions(options, sparkSession.sessionState.conf) + +val conf = job.getConfiguration + +conf.set(MAPRED_OUTPUT_SCHEMA.getAttribute, OrcUtils.getSchemaString(dataSchema)) + +conf.set(COMPRESS.getAttribute, orcOptions.compressionCodec) + +conf.asInstanceOf[JobConf] + .setOutputFormat(classOf[org.apache.orc.mapred.OrcOutputFormat[OrcStruct]]) + +new OutputWriterFactory { + override def newInstance( + path: String, + dataSchema: StructType, + context: TaskAttemptContext): OutputWriter = { +new OrcOutputWriter(path, dataSchema, context) + } + + override def getFileExtension(context: TaskAttemptContext): String = { +val compressionExtension: String = { + val name = context.getConfiguration.get(COMPRESS.getAttribute) + OrcOptions.extensionsForCompressionCodecNames.getOrElse(name, "") +} + +compressionExtension + ".orc" + } +} + } + + override def isSplitable( + sparkSession: SparkSession, + options: Map[String, String], + path: Path): Boolean = { +true + } + + override def buildReaderWithPartitionValues( --- End diff -- Yep. I see. It was because I preferred to be consistent with `ParquetFileFormat` here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r149311068 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala --- @@ -67,4 +67,11 @@ object OrcOptions { "snappy" -> "SNAPPY", "zlib" -> "ZLIB", "lzo" -> "LZO") + + // The extensions for ORC compression codecs + val extensionsForCompressionCodecNames = Map( --- End diff -- It's moved to OrcUtils. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r149310585 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala --- @@ -0,0 +1,370 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import java.io.IOException + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.io._ +import org.apache.orc.{OrcFile, TypeDescription} +import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp} +import org.apache.orc.storage.common.`type`.HiveDecimal +import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable} + +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.Resolver +import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow +import org.apache.spark.sql.catalyst.parser.CatalystSqlParser +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +object OrcUtils extends Logging { + + def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = { +val origPath = new Path(pathStr) +val fs = origPath.getFileSystem(conf) +val paths = SparkHadoopUtil.get.listLeafStatuses(fs, origPath) + .filterNot(_.isDirectory) + .map(_.getPath) + .filterNot(_.getName.startsWith("_")) + .filterNot(_.getName.startsWith(".")) +paths + } + + private[orc] def readSchema(file: Path, conf: Configuration): Option[TypeDescription] = { +try { + val fs = file.getFileSystem(conf) + val readerOptions = OrcFile.readerOptions(conf).filesystem(fs) + val reader = OrcFile.createReader(file, readerOptions) + val schema = reader.getSchema + if (schema.getFieldNames.size == 0) { +None + } else { +Some(schema) + } +} catch { + case _: IOException => None +} + } + + private[orc] def readSchema(sparkSession: SparkSession, files: Seq[FileStatus]) + : Option[StructType] = { +val conf = sparkSession.sparkContext.hadoopConfiguration +files.map(_.getPath).flatMap(readSchema(_, conf)).headOption.map { schema => --- End diff -- Later, I will implement schema merging in a parallel manner like Parquet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r149309798 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala --- @@ -0,0 +1,370 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import java.io.IOException + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.io._ +import org.apache.orc.{OrcFile, TypeDescription} +import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp} +import org.apache.orc.storage.common.`type`.HiveDecimal +import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable} + +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.Resolver +import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow +import org.apache.spark.sql.catalyst.parser.CatalystSqlParser +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +object OrcUtils extends Logging { + + def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = { +val origPath = new Path(pathStr) +val fs = origPath.getFileSystem(conf) +val paths = SparkHadoopUtil.get.listLeafStatuses(fs, origPath) + .filterNot(_.isDirectory) + .map(_.getPath) + .filterNot(_.getName.startsWith("_")) + .filterNot(_.getName.startsWith(".")) +paths + } + + private[orc] def readSchema(file: Path, conf: Configuration): Option[TypeDescription] = { +try { + val fs = file.getFileSystem(conf) + val readerOptions = OrcFile.readerOptions(conf).filesystem(fs) + val reader = OrcFile.createReader(file, readerOptions) + val schema = reader.getSchema + if (schema.getFieldNames.size == 0) { +None + } else { +Some(schema) + } +} catch { + case _: IOException => None +} + } + + private[orc] def readSchema(sparkSession: SparkSession, files: Seq[FileStatus]) + : Option[StructType] = { +val conf = sparkSession.sparkContext.hadoopConfiguration --- End diff -- Sure! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19674 **[Test build #83536 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83536/testReport)** for PR 19674 at commit [`32d4796`](https://github.com/apache/spark/commit/32d4796c5e029cdbdf54d2625365e774d05b148c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19667: [SPARK-21127][SQL][followup] fix a config name typo
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19667 **[Test build #83537 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83537/testReport)** for PR 19667 at commit [`55b949e`](https://github.com/apache/spark/commit/55b949e0e039ea981aeb69d6c3699c829071e368). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19662: [SPARK-22446][SQL][ML] Declare StringIndexerModel...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19662#discussion_r149308545 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -75,6 +75,7 @@ abstract class Expression extends TreeNode[Expression] { * - it relies on some mutable internal state, or * - it relies on some implicit input that is not part of the children expression list. * - it has non-deterministic child or children. + * - it is an UDF that can cause runtime exception on some specific input. --- End diff -- how about `it assumes the input satisfies some certain condition via the child operator`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19662: [SPARK-22446][SQL][ML] Declare StringIndexerModel indexe...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19662 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19674 Mind was: ``` Before ('KST', 'KST') After ('PST', 'PDT') ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83533/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19674 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19674 **[Test build #83533 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83533/testReport)** for PR 19674 at commit [`16e0614`](https://github.com/apache/spark/commit/16e0614300b092688e1d9f09a4575918e5d7358a). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19673: [SPARK-21640][SQL][PYTHON][R][FOLLOWUP] Add errorifexist...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19673 Oh, the test and code looked not printing out. It actually does :) (newline inserted for readability) : ```r > write.df(createDataFrame(iris), path = "/tmp/foo", mode = "abc") ... Error in mode : illegal argument - Unknown save mode: abc. Accepted save modes are 'overwrite', 'append', 'ignore', 'error', 'errorifexists'. ``` Thank you for your review @felixcheung. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19682: [SPARK-22464] [SQL] No pushdown for Hive metastore parti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19682 **[Test build #83535 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83535/testReport)** for PR 19682 at commit [`be0e276`](https://github.com/apache/spark/commit/be0e276a371319eb19467707a105d968d685a0c3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17953: [SPARK-20680][SQL] Spark-sql do not support for v...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17953 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16803: [SPARK-19458][BUILD]load hive jars from local rep...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16803 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17402: [SPARK-7200] Check that memory is not leaked in T...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17402 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14158: [SPARK-13547] [SQL] [WEBUI] Add SQL query in web ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14158 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org