[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72011858 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer { gammaPart = gammad :: gammaPart } Iterator((stat, gammaPart)) -} +}.persist(StorageLevel.MEMORY_AND_DISK) --- End diff -- Oh, I mean, it's not always clear that it's OK to use the memory/disk, but, I think it's justified here. Yes it will serialize. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72011821 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer { gammaPart = gammad :: gammaPart } Iterator((stat, gammaPart)) -} +}.persist(StorageLevel.MEMORY_AND_DISK) val statsSum: BDM[Double] = stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))( _ += _, _ += _) -expElogbetaBc.unpersist() val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) --- End diff -- Yes, I'm wondering why not do this in a distributed way; am I missing why it's only done serially on the driver? is it that the dense matrix class doesn't serialize or serialize well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14336: Don't run over all numFeatures if debug log disab...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14336#discussion_r72011689 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -118,9 +118,11 @@ private[spark] object RandomForest extends Logging { val splits = findSplits(retaggedInput, metadata, seed) timer.stop("findSplits") logDebug("numBins: feature: number of bins") -logDebug(Range(0, metadata.numFeatures).map { featureIndex => - s"\t$featureIndex\t${metadata.numBins(featureIndex)}" -}.mkString("\n")) +if (log.isDebugEnabled) { + logDebug(Range(0, metadata.numFeatures).map { featureIndex => --- End diff -- Also trivial, but I woulda written this as just `(0 until metadata.numFeatures.map(i => s"\t$i\t${metadata.numBins(i)}").mkString("\n")` I know it was like this already. Up to your taste. Does this actually help much though? `logDebug` is already declared to take `(msg: => String)` meaning it's only evaluated lazily. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72011582 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- First of all, I'll take a look again. I need some boot-up time. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72011413 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- For the second one, I will revise again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14336: Don't run over all numFeatures if debug log disab...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14336#discussion_r72011376 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -118,9 +118,11 @@ private[spark] object RandomForest extends Logging { val splits = findSplits(retaggedInput, metadata, seed) timer.stop("findSplits") logDebug("numBins: feature: number of bins") --- End diff -- Trivial, but why not also pull this into the block? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72011394 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- I see. Since this is `SELECT hint`, which one do you mean? - `Hint(Project)` - `Project(Hint)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14333 I don't understand the last change. As far as I can see it can be destroyed inside the loop iteration. It's also possible to reuse the broadcast (declare outside the loop), and unpersist each iteration, and destroy afterwards. But I don't see the need to hold on to all these broadcasts? Yes you make a fair point about recovery. I think your changes are safe in this respect, good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14296: [SPARK-16639][SQL] The query with having conditio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14296#discussion_r72011203 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1207,6 +1207,12 @@ class Analyzer( val alias = Alias(ae, ae.toString)() aggregateExpressions += alias alias.toAttribute + case ne: NamedExpression => ne --- End diff -- but `case e: Expression if grouping.exists(_.semanticEquals(e))` will filter them out right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72011145 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- then should we wrap the whole query plan with `Hint` instead of just the relation? https://github.com/apache/spark/pull/14132#discussion-diff-71820260R343 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14331: [SPARK-16691][SQL] move BucketSpec to catalyst module an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14331 **[Test build #62798 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62798/consoleFull)** for PR 14331 at commit [`069419d`](https://github.com/apache/spark/commit/069419d4048e05f829618d124acc784cc6c01397). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14334: [SPARK-16703][SQL] Fixes window specification SQL format
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14334 BTW can you change the title to say "Remove extra whitespace in SQL generation for window functions"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14297: [SPARK-16660][SQL] CreateViewCommand should not take Cat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14297 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14297: [SPARK-16660][SQL] CreateViewCommand should not take Cat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14297 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62786/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14297: [SPARK-16660][SQL] CreateViewCommand should not take Cat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14297 **[Test build #62786 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62786/consoleFull)** for PR 14297 at commit [`a5cf2b3`](https://github.com/apache/spark/commit/a5cf2b3261bd974b215ceaf82d88181a2f87f446). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14334: [SPARK-16703][SQL] Fixes window specification SQL format
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14334 **[Test build #62797 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62797/consoleFull)** for PR 14334 at commit [`5f051dc`](https://github.com/apache/spark/commit/5f051dc64b795e665a27c36786eb1d5046c8715c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14331: [SPARK-16691][SQL] move BucketSpec to catalyst mo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14331#discussion_r72010784 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -173,8 +190,18 @@ case class CatalogTable( override def toString: String = { val tableProperties = properties.map(p => p._1 + "=" + p._2).mkString("[", ", ", "]") val partitionColumns = partitionColumnNames.map("`" + _ + "`").mkString("[", ", ", "]") -val sortColumns = sortColumnNames.map("`" + _ + "`").mkString("[", ", ", "]") -val bucketColumns = bucketColumnNames.map("`" + _ + "`").mkString("[", ", ", "]") +val bucketStrings = bucketSpec match { + case Some(BucketSpec(numBuckets, bucketColumnNames, sortColumnNames)) => +val bucketColumnsString = bucketColumnNames.map("`" + _ + "`").mkString("[", ", ", "]") +val sortColumnsString = sortColumnNames.map("`" + _ + "`").mkString("[", ", ", "]") --- End diff -- which `quoteIdentifier`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13077: [SPARK-10748] [Mesos] Log error instead of crashi...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/13077#discussion_r72010691 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala --- @@ -502,41 +502,53 @@ private[spark] class MesosClusterScheduler( } else { val offer = offerOption.get val taskId = TaskID.newBuilder().setValue(submission.submissionId).build() -val (remainingResources, cpuResourcesToUse) = - partitionResources(offer.resources, "cpus", driverCpu) -val (finalResources, memResourcesToUse) = - partitionResources(remainingResources.asJava, "mem", driverMem) -val commandInfo = buildDriverCommand(submission) -val appName = submission.schedulerProperties("spark.app.name") -val taskInfo = TaskInfo.newBuilder() - .setTaskId(taskId) - .setName(s"Driver for $appName") - .setSlaveId(offer.slaveId) - .setCommand(commandInfo) - .addAllResources(cpuResourcesToUse.asJava) - .addAllResources(memResourcesToUse.asJava) -offer.resources = finalResources.asJava - submission.schedulerProperties.get("spark.mesos.executor.docker.image").foreach { image => - val container = taskInfo.getContainerBuilder() - val volumes = submission.schedulerProperties -.get("spark.mesos.executor.docker.volumes") -.map(MesosSchedulerBackendUtil.parseVolumesSpec) - val portmaps = submission.schedulerProperties -.get("spark.mesos.executor.docker.portmaps") -.map(MesosSchedulerBackendUtil.parsePortMappingsSpec) - MesosSchedulerBackendUtil.addDockerInfo( -container, image, volumes = volumes, portmaps = portmaps) - taskInfo.setContainer(container.build()) +var commandInfo: CommandInfo = null +try { + commandInfo = buildDriverCommand(submission) +} catch { + case e: SparkException => +afterLaunchCallback(submission.submissionId) +finishedDrivers += new MesosClusterSubmissionState(submission, taskId, + SlaveID.newBuilder().setValue("").build(), None, null, None) +logError(s"Failed to launch the driver with id: ${submission.submissionId}, " + + s"cpu: $driverCpu, mem: $driverMem, reason: ${e.getMessage}") +} +if (commandInfo != null) { --- End diff -- Rather than make a big long if statement, is it simpler and clearer to just return from the method in the catch block? then you can even ... ``` val commandInfo = try { buildDriverCommand(submission) } ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14339 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14339 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62787/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14339 **[Test build #62787 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62787/consoleFull)** for PR 14339 at commit [`f5c566c`](https://github.com/apache/spark/commit/f5c566cf2609dfbbc56b76276becd9976b34f2dd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14334: [SPARK-16703][SQL] Fixes window specification SQL format
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14334 @rxin @dongjoon-hyun Thanks! JIRA ticket added and updated the golden files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72010521 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- Hi, @cloud-fan . Thank you again. Yes. That's true. So, at my first commit of this PR, I actually implemented to attach just the concrete `BroadcastHint` to the relations in `AstBuilder.scala`. But, there were some advices to - Move out the logic from the parser module (for preventing future updates on parser module) - Generalize the hint syntax like the current code. Since a general `Hint` is attached on a logical plan not a table relation, we can substitute any other plans (filter/project/aggregator), too. The following is my previous comment about this situation. The child of hint is `JOIN`. > Yep. I grasp your point. First, the given relation is just a logical plan, not a table. In the following case, the relation is Join. > SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0, parquet_t1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14257: [SPARK-16621][SQL][WIP] Use a stable ID generation metho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14257 **[Test build #62796 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62796/consoleFull)** for PR 14257 at commit [`06fb99c`](https://github.com/apache/spark/commit/06fb99c9485f353c166197d75f1f39a74cde3262). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14296: [SPARK-16639][SQL] The query with having conditio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14296#discussion_r72009959 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1207,6 +1207,12 @@ class Analyzer( val alias = Alias(ae, ae.toString)() aggregateExpressions += alias alias.toAttribute + case ne: NamedExpression => ne --- End diff -- I think we don't need to replace named expr (attributes, alias)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14332: [SPARK-16694] [CORE] Use for/foreach rather than ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14332#discussion_r72009839 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala --- @@ -54,14 +54,13 @@ object DataFrameExample { } } -parser.parse(args, defaultParams).map { params => - run(params) -}.getOrElse { - sys.exit(1) +parser.parse(args, defaultParams) match { + case Some(params) => run(params) + case _ => sys.exit(1) } --- End diff -- Yes, `Option` has `foreach` as well and would be fine if the use case here were also to "do _X_ if the `Option` exists" but here we also need "... or do _Y_ if it doesn't". I'm not sure it's that `Unit` is special or not, but that it signals there is no meaningful result, which highlights the real issue: the only purpose is the _side effect_. The difference between transforming a return value with `fold` or whatever and then exiting with it, and using these constructs to exit, is precisely the difference here. You are arguing that side-effects are neither here nor there when thinking about this from a theoretical _functional programming_ perspective? can't agree with that at all! Transform without side effects? makes total sense for FP idioms. "Transforms" with no result and only side effects? not so much. The blog you link makes the point that `fold` is concise, but less readable, and I agree, and I think that's a negative. The Scala API doc shows `match` in an example where side-effects are desired (`println`) and not for transformation. How about `if-else`? I do not particularly think `match` is necessary. It's a good discussion but for consistency and readability I think this change (or to `if-else`) is positive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14331: [SPARK-16691][SQL] move BucketSpec to catalyst module an...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14331 Overall LGTM, a few minor comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14331: [SPARK-16691][SQL] move BucketSpec to catalyst mo...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14331#discussion_r72009811 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala --- @@ -20,10 +20,11 @@ package org.apache.spark.sql.sources import java.io.File import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.catalog.BucketSpec import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.physical.HashPartitioning import org.apache.spark.sql.execution.DataSourceScanExec -import org.apache.spark.sql.execution.datasources.{BucketSpec, DataSourceStrategy} +import org.apache.spark.sql.execution.datasources.{DataSourceStrategy} --- End diff -- Nit: Remove the braces. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14331: [SPARK-16691][SQL] move BucketSpec to catalyst mo...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14331#discussion_r72009802 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -110,6 +110,25 @@ case class CatalogTablePartition( /** + * A container for bucketing information. + * Bucketing is a technology for decomposing data sets into more manageable parts, and the number + * of buckets is fixed so it does not fluctuate with data. + * + * @param numBuckets number of buckets. + * @param bucketColumnNames the names of the columns that used to generate the bucket id. + * @param sortColumnNames the names of the columns that used to sort data in each bucket. + */ +case class BucketSpec( +numBuckets: Int, +bucketColumnNames: Seq[String], +sortColumnNames: Seq[String]) { + if (numBuckets <= 0) { +throw new AnalysisException(s"Expected positive number of buckets, but got `$numBuckets`.") + } +} + + --- End diff -- Nit: Remove extra newline. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14331: [SPARK-16691][SQL] move BucketSpec to catalyst mo...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14331#discussion_r72009804 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -173,8 +190,18 @@ case class CatalogTable( override def toString: String = { val tableProperties = properties.map(p => p._1 + "=" + p._2).mkString("[", ", ", "]") val partitionColumns = partitionColumnNames.map("`" + _ + "`").mkString("[", ", ", "]") -val sortColumns = sortColumnNames.map("`" + _ + "`").mkString("[", ", ", "]") -val bucketColumns = bucketColumnNames.map("`" + _ + "`").mkString("[", ", ", "]") +val bucketStrings = bucketSpec match { + case Some(BucketSpec(numBuckets, bucketColumnNames, sortColumnNames)) => +val bucketColumnsString = bucketColumnNames.map("`" + _ + "`").mkString("[", ", ", "]") +val sortColumnsString = sortColumnNames.map("`" + _ + "`").mkString("[", ", ", "]") --- End diff -- Use `quoteIdentifier` instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14340: [SPARK-16534][Streaming][Kafka] Add Python API support f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14340 **[Test build #62795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62795/consoleFull)** for PR 14340 at commit [`38a89f6`](https://github.com/apache/spark/commit/38a89f68dac408453ad8cdb871d2b64f23d758e1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14342: [SPARK-16685] Remove audit-release scripts.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14342 **[Test build #62794 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62794/consoleFull)** for PR 14342 at commit [`c11042e`](https://github.com/apache/spark/commit/c11042e287dae9622a3d3ead8ea40833ad3c6078). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14296: [SPARK-16639][SQL] The query with having conditio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14296#discussion_r72009643 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1207,6 +1207,12 @@ class Analyzer( val alias = Alias(ae, ae.toString)() aggregateExpressions += alias alias.toAttribute + case ne: NamedExpression => ne + case e: Expression if grouping.exists(_.semanticEquals(e)) && + !ResolveGroupingAnalytics.hasGroupingFunction(e) => --- End diff -- It will be greate if we can figure it out and add comment here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14296: [SPARK-16639][SQL] The query with having conditio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14296#discussion_r72009599 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -46,6 +46,14 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { Row("one", 6) :: Row("three", 3) :: Nil) } + test("having condition contains grouping column") { +Seq(("one", 1), ("two", 2), ("three", 3), ("one", 5)).toDF("k", "v") --- End diff -- wrap it with `withTempView` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14296: [SPARK-16639][SQL] The query with having conditio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14296#discussion_r72009571 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1207,6 +1207,12 @@ class Analyzer( val alias = Alias(ae, ae.toString)() aggregateExpressions += alias alias.toAttribute + case ne: NamedExpression => ne --- End diff -- why this case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14342: [SPARK-16685] Remove audit-release scripts.
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/14342 [SPARK-16685] Remove audit-release scripts. ## What changes were proposed in this pull request? This patch removes dev/audit-release. It was initially created to do basic release auditing, but we haven't run them for the last one year+. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-16685 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14342.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14342 commit c11042e287dae9622a3d3ead8ea40833ad3c6078 Author: Reynold XinDate: 2016-07-25T05:13:34Z Remove audit-release scripts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14333 **[Test build #62793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62793/consoleFull)** for PR 14333 at commit [`01f4d3a`](https://github.com/apache/spark/commit/01f4d3ab2e4c591b7b27b1ed92fbe07e1fc876d1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r72009385 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer( /** * Pushes projects down beneath Sample to enable column pruning with sampling. + * This rule is only doable when the projects don't add new attributes. */ object PushProjectThroughSample extends Rule[LogicalPlan] { --- End diff -- yah. I will update this. At least one optimizer test uses this rule. The test should be changed too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14339 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14339 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62785/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14339 **[Test build #62785 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62785/consoleFull)** for PR 14339 at commit [`cd5d04a`](https://github.com/apache/spark/commit/cd5d04a2661dc2e56b517478200a074ac075dec1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14313 **[Test build #62792 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62792/consoleFull)** for PR 14313 at commit [`a385318`](https://github.com/apache/spark/commit/a3853182b375539bd329fb10464c1a746d4eaa47). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13077: [SPARK-10748] [Mesos] Log error instead of crashing Spar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13077 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62783/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14337: [SPARK-16699][SQL] Fix performance bug in hash ag...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14337 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13077: [SPARK-10748] [Mesos] Log error instead of crashing Spar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13077 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14337: [SPARK-16699][SQL] Fix performance bug in hash aggregate...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14337 @ooq next time please open a pull request against master branch rather than 2.0. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13077: [SPARK-10748] [Mesos] Log error instead of crashing Spar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13077 **[Test build #62783 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62783/consoleFull)** for PR 13077 at commit [`ae85f81`](https://github.com/apache/spark/commit/ae85f8154e506c223a018d9c04c58967a82fa580). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14341: [Minor][Doc][SQL] Fix two documents regarding size in by...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14341 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14337: [SPARK-16699][SQL] Fix performance bug in hash aggregate...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14337 Merging in master/2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14339 **[Test build #62791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62791/consoleFull)** for PR 14339 at commit [`a3bb8d8`](https://github.com/apache/spark/commit/a3bb8d821b19c85261e1601fb4a2b92dcfc1814f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r72008673 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer( /** * Pushes projects down beneath Sample to enable column pruning with sampling. + * This rule is only doable when the projects don't add new attributes. */ object PushProjectThroughSample extends Rule[LogicalPlan] { --- End diff -- the last case in `ColumnPruning`, it will generate a new `Project` under `Sample` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14341: [Minor][Doc][SQL] Fix two documents regarding size in by...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14341 cc @cloud-fan @liancheng Just minor document changes. Please review if this change is proper. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14339 LGTM, cc @liancheng to take another look --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r72008576 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer( /** * Pushes projects down beneath Sample to enable column pruning with sampling. + * This rule is only doable when the projects don't add new attributes. */ object PushProjectThroughSample extends Rule[LogicalPlan] { --- End diff -- Not sure which part you mean? I don't see `ColumnPruning` handling `Sample`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14339: [SPARK-16698][SQL] Field names having dots should...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14339#discussion_r72008560 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2982,4 +2982,19 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { """.stripMargin), Nil) } } + + test("SPARK-16674: field names containing dots for both fields and partitioned fields") { +withTempPath { path => + val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) "a" else "b")) +.toDF("col.1", "col.2", "part.col1", "part.col2") + data.write +.format("parquet") +.partitionBy("part.col1", "part.col2") +.save(path.getCanonicalPath) + val copyData = spark.read.format("parquet").load(path.getCanonicalPath) --- End diff -- `copyData` -> `readBack`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14341: [Minor][Doc][SQL] Fix two documents regarding size in by...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14341 **[Test build #62790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62790/consoleFull)** for PR 14341 at commit [`2b555f4`](https://github.com/apache/spark/commit/2b555f4393482c16424dfb743f86f56105d8fe7d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14313 LGTM except some style comments, thanks for working on it! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r72008447 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -322,46 +322,133 @@ private[sql] class JDBCRDD( } } - // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that - // we don't have to potentially poke around in the Metadata once for every - // row. - // Is there a better way to do this? I'd rather be using a type that - // contains only the tags I define. - abstract class JDBCConversion - case object BooleanConversion extends JDBCConversion - case object DateConversion extends JDBCConversion - case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion - case object DoubleConversion extends JDBCConversion - case object FloatConversion extends JDBCConversion - case object IntegerConversion extends JDBCConversion - case object LongConversion extends JDBCConversion - case object BinaryLongConversion extends JDBCConversion - case object StringConversion extends JDBCConversion - case object TimestampConversion extends JDBCConversion - case object BinaryConversion extends JDBCConversion - case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion + // A `JDBCConversion` is responsible for converting and setting a value from `ResultSet` + // into a field for `MutableRow`. The last argument `Int` means the index for the + // value to be set in the row and also used for the value to retrieve from `ResultSet`. + private type JDBCConversion = (ResultSet, MutableRow, Int) => Unit /** - * Maps a StructType to a type tag list. + * Maps a StructType to conversions for each type. */ def getConversions(schema: StructType): Array[JDBCConversion] = schema.fields.map(sf => getConversions(sf.dataType, sf.metadata)) private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match { -case BooleanType => BooleanConversion -case DateType => DateConversion -case DecimalType.Fixed(p, s) => DecimalConversion(p, s) -case DoubleType => DoubleConversion -case FloatType => FloatConversion -case IntegerType => IntegerConversion -case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion -case StringType => StringConversion -case TimestampType => TimestampConversion -case BinaryType => BinaryConversion -case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata)) +case BooleanType => + (rs: ResultSet, row: MutableRow, pos: Int) => +row.setBoolean(pos, rs.getBoolean(pos + 1)) + +case DateType => + (rs: ResultSet, row: MutableRow, pos: Int) => +// DateTimeUtils.fromJavaDate does not handle null value, so we need to check it. +val dateVal = rs.getDate(pos + 1) +if (dateVal != null) { + row.setInt(pos, DateTimeUtils.fromJavaDate(dateVal)) +} else { + row.update(pos, null) +} + +// When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal +// object returned by ResultSet.getBigDecimal is not correctly matched to the table +// schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale. +// If inserting values like 1 into a column with NUMBER(12, 2) type, you get through +// a BigDecimal object with scale as 0. But the dataframe schema has correct type as +// DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then +// retrieve it, you will get wrong result 199.99. +// So it is needed to set precision and scale for Decimal based on JDBC metadata. +case DecimalType.Fixed(p, s) => + (rs: ResultSet, row: MutableRow, pos: Int) => +val decimal = + nullSafeConvert[java.math.BigDecimal](rs.getBigDecimal(pos + 1), d => Decimal(d, p, s)) +row.update(pos, decimal) + +case DoubleType => + (rs: ResultSet, row: MutableRow, pos: Int) => +row.setDouble(pos, rs.getDouble(pos + 1)) + +case FloatType => + (rs: ResultSet, row: MutableRow, pos: Int) => +row.setFloat(pos, rs.getFloat(pos + 1)) + +case IntegerType => + (rs: ResultSet, row: MutableRow, pos: Int) => +row.setInt(pos, rs.getInt(pos + 1)) + +case LongType if metadata.contains("binarylong") => + (rs: ResultSet, row: MutableRow, pos: Int) => +val bytes = rs.getBytes(pos + 1) +var ans = 0L +var j = 0 +
[GitHub] spark pull request #14341: [Minor][Doc][SQL] Fix two documents regarding siz...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/14341 [Minor][Doc][SQL] Fix two documents regarding size in bytes ## What changes were proposed in this pull request? Fix two places in SQLConf documents regarding size in bytes and statistics. ## How was this patch tested? No. Just change document. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 fix-doc-size-in-bytes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14341.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14341 commit 2b555f4393482c16424dfb743f86f56105d8fe7d Author: Liang-Chi HsiehDate: 2016-07-25T04:41:27Z Fix doc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r72008403 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -322,46 +322,133 @@ private[sql] class JDBCRDD( } } - // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that - // we don't have to potentially poke around in the Metadata once for every - // row. - // Is there a better way to do this? I'd rather be using a type that - // contains only the tags I define. - abstract class JDBCConversion - case object BooleanConversion extends JDBCConversion - case object DateConversion extends JDBCConversion - case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion - case object DoubleConversion extends JDBCConversion - case object FloatConversion extends JDBCConversion - case object IntegerConversion extends JDBCConversion - case object LongConversion extends JDBCConversion - case object BinaryLongConversion extends JDBCConversion - case object StringConversion extends JDBCConversion - case object TimestampConversion extends JDBCConversion - case object BinaryConversion extends JDBCConversion - case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion + // A `JDBCConversion` is responsible for converting and setting a value from `ResultSet` --- End diff -- `JDBCConversion` seems not a good name now, do you have better any ideas? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r72008303 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer( /** * Pushes projects down beneath Sample to enable column pruning with sampling. + * This rule is only doable when the projects don't add new attributes. */ object PushProjectThroughSample extends Rule[LogicalPlan] { --- End diff -- ah, looks like `ColumnPruning` already handles it, can we just remove this rule? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r72008210 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer( /** * Pushes projects down beneath Sample to enable column pruning with sampling. + * This rule is only doable when the projects don't add new attributes. */ object PushProjectThroughSample extends Rule[LogicalPlan] { --- End diff -- should we merge this rule into `ColumnPruning`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14340: [SPARK-16534][Streaming][Kafka] Add Python API support f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14340 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14340: [SPARK-16534][Streaming][Kafka] Add Python API support f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14340 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62784/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14340: [SPARK-16534][Streaming][Kafka] Add Python API support f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14340 **[Test build #62784 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62784/consoleFull)** for PR 14340 at commit [`dcca337`](https://github.com/apache/spark/commit/dcca3371da0c73bbcadb4331a2869e0303bd8ae8). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class Kafka010StreamTests(PySparkStreamingTestCase):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72008134 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- how does hive handle sql hints? This rule looks a little complex to me, according to the fact that hints can only be applied to table relation. It will be great if we can wrap `UnresolvedRelation` with `Hint` in the parser, and then it will be very easy to resolve hints at analyzer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14327 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62782/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14327 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14327 **[Test build #62782 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62782/consoleFull)** for PR 14327 at commit [`31a6f6f`](https://github.com/apache/spark/commit/31a6f6fef63b0f130738a6fc7a5f628e60947cb9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14313 @cloud-fan I just addressed your comments. I added another argument in `JDBCConversion` for `MutableRow` so that we can avoid type-boxing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14313: [SPARK-16674][SQL] Avoid per-record type dispatch in JDB...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14313 **[Test build #62789 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62789/consoleFull)** for PR 14313 at commit [`8ac66b1`](https://github.com/apache/spark/commit/8ac66b17f81a2fdc0866df26889b5e2fcc634c51). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r72007301 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -407,84 +496,8 @@ private[sql] class JDBCRDD( var i = 0 while (i < conversions.length) { val pos = i + 1 - conversions(i) match { -case BooleanConversion => mutableRow.setBoolean(i, rs.getBoolean(pos)) -case DateConversion => - // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it. - val dateVal = rs.getDate(pos) - if (dateVal != null) { -mutableRow.setInt(i, DateTimeUtils.fromJavaDate(dateVal)) - } else { -mutableRow.update(i, null) - } -// When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal -// object returned by ResultSet.getBigDecimal is not correctly matched to the table -// schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale. -// If inserting values like 1 into a column with NUMBER(12, 2) type, you get through -// a BigDecimal object with scale as 0. But the dataframe schema has correct type as -// DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then -// retrieve it, you will get wrong result 199.99. -// So it is needed to set precision and scale for Decimal based on JDBC metadata. --- End diff -- Fxied. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14327 **[Test build #62788 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62788/consoleFull)** for PR 14327 at commit [`5c4e7ff`](https://github.com/apache/spark/commit/5c4e75bd438ad7abbb37dc84cff9e5036567). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14327 Besides, according to the JIRA, this may need to be backported to 1.6. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14327 cc @cloud-fan @liancheng @yhuai Please review this change. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14339 **[Test build #62787 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62787/consoleFull)** for PR 14339 at commit [`f5c566c`](https://github.com/apache/spark/commit/f5c566cf2609dfbbc56b76276becd9976b34f2dd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14331: [SPARK-16691][SQL] move BucketSpec to catalyst module an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14331 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62777/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14331: [SPARK-16691][SQL] move BucketSpec to catalyst module an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14331 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14331: [SPARK-16691][SQL] move BucketSpec to catalyst module an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14331 **[Test build #62777 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62777/consoleFull)** for PR 14331 at commit [`cc63f01`](https://github.com/apache/spark/commit/cc63f01b7362aa8c34d82ef0941bc8919e5f90b7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14331: [SPARK-16691][SQL] move BucketSpec to catalyst module an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14331 **[Test build #62776 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62776/consoleFull)** for PR 14331 at commit [`59321d0`](https://github.com/apache/spark/commit/59321d003c1394c47dba5894d9363fc9ea65a803). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14297: [SPARK-16660][SQL] CreateViewCommand should not take Cat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14297 **[Test build #62786 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62786/consoleFull)** for PR 14297 at commit [`a5cf2b3`](https://github.com/apache/spark/commit/a5cf2b3261bd974b215ceaf82d88181a2f87f446). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14331: [SPARK-16691][SQL] move BucketSpec to catalyst module an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14331 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14331: [SPARK-16691][SQL] move BucketSpec to catalyst module an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14331 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62776/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14204: [SPARK-16520] [WEBUI] Link executors to corresponding wo...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14204 In my test, the column does not exist. On Sun, Jul 24, 2016 at 6:41 PM -0700, "Tao Lin"wrote: @yhuai Does the column exist? Are all of the cells in the column blank? That's weird. It was OK when I committed. I'll check it out later. â You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14339 Sure! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14297: [SPARK-16660][SQL] CreateViewCommand should not take Cat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14297 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62775/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r72005471 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -407,84 +496,8 @@ private[sql] class JDBCRDD( var i = 0 while (i < conversions.length) { val pos = i + 1 - conversions(i) match { -case BooleanConversion => mutableRow.setBoolean(i, rs.getBoolean(pos)) -case DateConversion => - // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it. - val dateVal = rs.getDate(pos) - if (dateVal != null) { -mutableRow.setInt(i, DateTimeUtils.fromJavaDate(dateVal)) - } else { -mutableRow.update(i, null) - } -// When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal -// object returned by ResultSet.getBigDecimal is not correctly matched to the table -// schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale. -// If inserting values like 1 into a column with NUMBER(12, 2) type, you get through -// a BigDecimal object with scale as 0. But the dataframe schema has correct type as -// DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then -// retrieve it, you will get wrong result 199.99. -// So it is needed to set precision and scale for Decimal based on JDBC metadata. -case DecimalConversion(p, s) => - val decimalVal = rs.getBigDecimal(pos) - if (decimalVal == null) { -mutableRow.update(i, null) - } else { -mutableRow.update(i, Decimal(decimalVal, p, s)) - } -case DoubleConversion => mutableRow.setDouble(i, rs.getDouble(pos)) -case FloatConversion => mutableRow.setFloat(i, rs.getFloat(pos)) -case IntegerConversion => mutableRow.setInt(i, rs.getInt(pos)) -case LongConversion => mutableRow.setLong(i, rs.getLong(pos)) -// TODO(davies): use getBytes for better performance, if the encoding is UTF-8 -case StringConversion => mutableRow.update(i, UTF8String.fromString(rs.getString(pos))) -case TimestampConversion => - val t = rs.getTimestamp(pos) - if (t != null) { -mutableRow.setLong(i, DateTimeUtils.fromJavaTimestamp(t)) - } else { -mutableRow.update(i, null) - } -case BinaryConversion => mutableRow.update(i, rs.getBytes(pos)) -case BinaryLongConversion => - val bytes = rs.getBytes(pos) - var ans = 0L - var j = 0 - while (j < bytes.size) { -ans = 256 * ans + (255 & bytes(j)) -j = j + 1 - } - mutableRow.setLong(i, ans) -case ArrayConversion(elementConversion) => - val array = rs.getArray(pos).getArray - if (array != null) { -val data = elementConversion match { - case TimestampConversion => -array.asInstanceOf[Array[java.sql.Timestamp]].map { timestamp => - nullSafeConvert(timestamp, DateTimeUtils.fromJavaTimestamp) -} - case StringConversion => -array.asInstanceOf[Array[java.lang.String]] - .map(UTF8String.fromString) - case DateConversion => -array.asInstanceOf[Array[java.sql.Date]].map { date => - nullSafeConvert(date, DateTimeUtils.fromJavaDate) -} - case DecimalConversion(p, s) => -array.asInstanceOf[Array[java.math.BigDecimal]].map { decimal => - nullSafeConvert[java.math.BigDecimal](decimal, d => Decimal(d, p, s)) -} - case BinaryLongConversion => -throw new IllegalArgumentException(s"Unsupported array element conversion $i") - case _: ArrayConversion => -throw new IllegalArgumentException("Nested arrays unsupported") - case _ => array.asInstanceOf[Array[Any]] -} -mutableRow.update(i, new GenericArrayData(data)) - } else { -mutableRow.update(i, null) -
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r72005487 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -322,46 +322,135 @@ private[sql] class JDBCRDD( } } - // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that - // we don't have to potentially poke around in the Metadata once for every - // row. - // Is there a better way to do this? I'd rather be using a type that - // contains only the tags I define. - abstract class JDBCConversion - case object BooleanConversion extends JDBCConversion - case object DateConversion extends JDBCConversion - case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion - case object DoubleConversion extends JDBCConversion - case object FloatConversion extends JDBCConversion - case object IntegerConversion extends JDBCConversion - case object LongConversion extends JDBCConversion - case object BinaryLongConversion extends JDBCConversion - case object StringConversion extends JDBCConversion - case object TimestampConversion extends JDBCConversion - case object BinaryConversion extends JDBCConversion - case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion + // A `JDBCConversion` is responsible for converting a value from `ResultSet` + // to a value in a field for `InternalRow`. + private type JDBCConversion = (ResultSet, Int) => Any + + // This `ArrayElementConversion` is responsible for converting elements in + // an array from `ResultSet`. + private type ArrayElementConversion = (Object) => Any /** - * Maps a StructType to a type tag list. + * Maps a StructType to conversions for each type. */ def getConversions(schema: StructType): Array[JDBCConversion] = schema.fields.map(sf => getConversions(sf.dataType, sf.metadata)) private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match { -case BooleanType => BooleanConversion -case DateType => DateConversion -case DecimalType.Fixed(p, s) => DecimalConversion(p, s) -case DoubleType => DoubleConversion -case FloatType => FloatConversion -case IntegerType => IntegerConversion -case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion -case StringType => StringConversion -case TimestampType => TimestampConversion -case BinaryType => BinaryConversion -case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata)) +case BooleanType => + (rs: ResultSet, pos: Int) => rs.getBoolean(pos) + +case DateType => + (rs: ResultSet, pos: Int) => +// DateTimeUtils.fromJavaDate does not handle null value, so we need to check it. +val dateVal = rs.getDate(pos) +if (dateVal != null) { + DateTimeUtils.fromJavaDate(dateVal) +} else { + null +} + +case DecimalType.Fixed(p, s) => + (rs: ResultSet, pos: Int) => +val decimalVal = rs.getBigDecimal(pos) +if (decimalVal != null) { + Decimal(decimalVal, p, s) +} else { + null +} + +case DoubleType => + (rs: ResultSet, pos: Int) => rs.getDouble(pos) + +case FloatType => + (rs: ResultSet, pos: Int) => rs.getFloat(pos) + +case IntegerType => + (rs: ResultSet, pos: Int) => rs.getInt(pos) + +case LongType if metadata.contains("binarylong") => + (rs: ResultSet, pos: Int) => +val bytes = rs.getBytes(pos) +var ans = 0L +var j = 0 +while (j < bytes.size) { + ans = 256 * ans + (255 & bytes(j)) + j = j + 1 +} +ans + +case LongType => + (rs: ResultSet, pos: Int) => rs.getLong(pos) + +case StringType => + (rs: ResultSet, pos: Int) => +// TODO(davies): use getBytes for better performance, if the encoding is UTF-8 +UTF8String.fromString(rs.getString(pos)) + +case TimestampType => + (rs: ResultSet, pos: Int) => +val t = rs.getTimestamp(pos) +if (t != null) { + DateTimeUtils.fromJavaTimestamp(t) +} else { + null +} + +case BinaryType => + (rs: ResultSet, pos: Int) => rs.getBytes(pos) + +case ArrayType(et, _) => + val elementConversion: ArrayElementConversion = getArrayElementConversion(et, metadata) + (rs: ResultSet, pos: Int) => +
[GitHub] spark issue #14297: [SPARK-16660][SQL] CreateViewCommand should not take Cat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14297 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14297: [SPARK-16660][SQL] CreateViewCommand should not take Cat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14297 **[Test build #62775 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62775/consoleFull)** for PR 14297 at commit [`5072959`](https://github.com/apache/spark/commit/50729594c1956d185bde4c2b41891ce1567cfd5a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r72005398 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -322,46 +322,135 @@ private[sql] class JDBCRDD( } } - // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that - // we don't have to potentially poke around in the Metadata once for every - // row. - // Is there a better way to do this? I'd rather be using a type that - // contains only the tags I define. - abstract class JDBCConversion - case object BooleanConversion extends JDBCConversion - case object DateConversion extends JDBCConversion - case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion - case object DoubleConversion extends JDBCConversion - case object FloatConversion extends JDBCConversion - case object IntegerConversion extends JDBCConversion - case object LongConversion extends JDBCConversion - case object BinaryLongConversion extends JDBCConversion - case object StringConversion extends JDBCConversion - case object TimestampConversion extends JDBCConversion - case object BinaryConversion extends JDBCConversion - case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion + // A `JDBCConversion` is responsible for converting a value from `ResultSet` + // to a value in a field for `InternalRow`. + private type JDBCConversion = (ResultSet, Int) => Any + + // This `ArrayElementConversion` is responsible for converting elements in + // an array from `ResultSet`. + private type ArrayElementConversion = (Object) => Any --- End diff -- looks like this type alias is not very useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r72005309 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -322,46 +322,135 @@ private[sql] class JDBCRDD( } } - // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that - // we don't have to potentially poke around in the Metadata once for every - // row. - // Is there a better way to do this? I'd rather be using a type that - // contains only the tags I define. - abstract class JDBCConversion - case object BooleanConversion extends JDBCConversion - case object DateConversion extends JDBCConversion - case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion - case object DoubleConversion extends JDBCConversion - case object FloatConversion extends JDBCConversion - case object IntegerConversion extends JDBCConversion - case object LongConversion extends JDBCConversion - case object BinaryLongConversion extends JDBCConversion - case object StringConversion extends JDBCConversion - case object TimestampConversion extends JDBCConversion - case object BinaryConversion extends JDBCConversion - case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion + // A `JDBCConversion` is responsible for converting a value from `ResultSet` + // to a value in a field for `InternalRow`. + private type JDBCConversion = (ResultSet, Int) => Any + + // This `ArrayElementConversion` is responsible for converting elements in + // an array from `ResultSet`. + private type ArrayElementConversion = (Object) => Any /** - * Maps a StructType to a type tag list. + * Maps a StructType to conversions for each type. */ def getConversions(schema: StructType): Array[JDBCConversion] = schema.fields.map(sf => getConversions(sf.dataType, sf.metadata)) private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match { -case BooleanType => BooleanConversion -case DateType => DateConversion -case DecimalType.Fixed(p, s) => DecimalConversion(p, s) -case DoubleType => DoubleConversion -case FloatType => FloatConversion -case IntegerType => IntegerConversion -case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion -case StringType => StringConversion -case TimestampType => TimestampConversion -case BinaryType => BinaryConversion -case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata)) +case BooleanType => + (rs: ResultSet, pos: Int) => rs.getBoolean(pos) + +case DateType => + (rs: ResultSet, pos: Int) => +// DateTimeUtils.fromJavaDate does not handle null value, so we need to check it. +val dateVal = rs.getDate(pos) +if (dateVal != null) { + DateTimeUtils.fromJavaDate(dateVal) +} else { + null +} + +case DecimalType.Fixed(p, s) => + (rs: ResultSet, pos: Int) => +val decimalVal = rs.getBigDecimal(pos) +if (decimalVal != null) { --- End diff -- use `nullSafeConvert`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r72005232 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -407,84 +496,8 @@ private[sql] class JDBCRDD( var i = 0 while (i < conversions.length) { val pos = i + 1 - conversions(i) match { -case BooleanConversion => mutableRow.setBoolean(i, rs.getBoolean(pos)) -case DateConversion => - // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it. - val dateVal = rs.getDate(pos) - if (dateVal != null) { -mutableRow.setInt(i, DateTimeUtils.fromJavaDate(dateVal)) - } else { -mutableRow.update(i, null) - } -// When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal -// object returned by ResultSet.getBigDecimal is not correctly matched to the table -// schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale. -// If inserting values like 1 into a column with NUMBER(12, 2) type, you get through -// a BigDecimal object with scale as 0. But the dataframe schema has correct type as -// DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then -// retrieve it, you will get wrong result 199.99. -// So it is needed to set precision and scale for Decimal based on JDBC metadata. --- End diff -- Oh, my mistake. I will add this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] fix KMeans bcNewCenters unpersi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14333 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] fix KMeans bcNewCenters unpersi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14333 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62779/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] fix KMeans bcNewCenters unpersi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14333 **[Test build #62779 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62779/consoleFull)** for PR 14333 at commit [`ecd15b2`](https://github.com/apache/spark/commit/ecd15b2b7bac21a9e37f998747a752903530). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r72005172 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -407,84 +496,8 @@ private[sql] class JDBCRDD( var i = 0 while (i < conversions.length) { val pos = i + 1 - conversions(i) match { -case BooleanConversion => mutableRow.setBoolean(i, rs.getBoolean(pos)) -case DateConversion => - // DateTimeUtils.fromJavaDate does not handle null value, so we need to check it. - val dateVal = rs.getDate(pos) - if (dateVal != null) { -mutableRow.setInt(i, DateTimeUtils.fromJavaDate(dateVal)) - } else { -mutableRow.update(i, null) - } -// When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal -// object returned by ResultSet.getBigDecimal is not correctly matched to the table -// schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale. -// If inserting values like 1 into a column with NUMBER(12, 2) type, you get through -// a BigDecimal object with scale as 0. But the dataframe schema has correct type as -// DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then -// retrieve it, you will get wrong result 199.99. -// So it is needed to set precision and scale for Decimal based on JDBC metadata. --- End diff -- why do we remove these comments? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r72005085 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -322,46 +322,135 @@ private[sql] class JDBCRDD( } } - // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that - // we don't have to potentially poke around in the Metadata once for every - // row. - // Is there a better way to do this? I'd rather be using a type that - // contains only the tags I define. - abstract class JDBCConversion - case object BooleanConversion extends JDBCConversion - case object DateConversion extends JDBCConversion - case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion - case object DoubleConversion extends JDBCConversion - case object FloatConversion extends JDBCConversion - case object IntegerConversion extends JDBCConversion - case object LongConversion extends JDBCConversion - case object BinaryLongConversion extends JDBCConversion - case object StringConversion extends JDBCConversion - case object TimestampConversion extends JDBCConversion - case object BinaryConversion extends JDBCConversion - case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion + // A `JDBCConversion` is responsible for converting a value from `ResultSet` + // to a value in a field for `InternalRow`. + private type JDBCConversion = (ResultSet, Int) => Any --- End diff -- also explain what's the 2 arguments in the comment? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14339 can we test it in `SQLQuerySuite` or something? I think `FileSourceStrategySuite` should be used to test the strategy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14339: [SPARK-16698][SQL] Field names having dots should be all...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14339 **[Test build #62785 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62785/consoleFull)** for PR 14339 at commit [`cd5d04a`](https://github.com/apache/spark/commit/cd5d04a2661dc2e56b517478200a074ac075dec1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org