[GitHub] SparkQA commented on issue #23521: [SPARK-26604][CORE] Register channel for stream request
SparkQA commented on issue #23521: [SPARK-26604][CORE] Register channel for stream request URL: https://github.com/apache/spark/pull/23521#issuecomment-454030918 **[Test build #101187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101187/testReport)** for PR 23521 at commit [`9082f01`](https://github.com/apache/spark/commit/9082f01b9fec637399a1cf43eb55141cc0dcee6b). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
SparkQA removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-454030918 **[Test build #101187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101187/testReport)** for PR 23521 at commit [`9082f01`](https://github.com/apache/spark/commit/9082f01b9fec637399a1cf43eb55141cc0dcee6b). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-454032368 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] viirya commented on a change in pull request #22807: [SPARK-25811][PySpark] Raise a proper error when unsafe cast is detected by PyArrow
viirya commented on a change in pull request #22807: [SPARK-25811][PySpark] Raise a proper error when unsafe cast is detected by PyArrow URL: https://github.com/apache/spark/pull/22807#discussion_r247521892 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -1331,6 +1331,16 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PANDAS_ARROW_SAFE_TYPE_CONVERSION = +buildConf("spark.sql.execution.pandas.arrowSafeTypeConversion") + .internal() + .doc("When true, enabling Arrow do safe type conversion check when converting" + +"Pandas.Series to Arrow Array during serialization. Arrow will raise errors " + +"when detecting unsafe type conversion. When false, disabling Arrow's type " + +"check and do type conversions anyway.") + .booleanConf + .createWithDefault(true) Review comment: Thanks @kszucs This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #23486: [SPARK-26457] Show hadoop configurations in HistoryServer environment tab
srowen commented on a change in pull request #23486: [SPARK-26457] Show hadoop configurations in HistoryServer environment tab URL: https://github.com/apache/spark/pull/23486#discussion_r247521171 ## File path: core/src/main/scala/org/apache/spark/SparkEnv.scala ## @@ -442,9 +445,13 @@ object SparkEnv extends Logging { val addedJarsAndFiles = (addedJars ++ addedFiles).map((_, "Added By User")) val classPaths = (addedJarsAndFiles ++ classPathEntries).sorted +// Add Hadoop properties, it will not ignore configs including in Spark. Some spark +// conf starting with "spark.hadoop" may overwrite it. +val hadoopProperties = hadoopConf.asScala.map(entry => (entry.getKey, entry.getValue)).toSeq Review comment: I think `.asScala.toSeq` works directly too, no big deal This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #23445: [SPARK-26530]Validate heartheat arguments in SparkSubmitArguments
srowen commented on a change in pull request #23445: [SPARK-26530]Validate heartheat arguments in SparkSubmitArguments URL: https://github.com/apache/spark/pull/23445#discussion_r247524504 ## File path: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ## @@ -301,6 +301,28 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S if (proxyUser != null && principal != null) { error("Only one of --proxy-user or --principal can be provided.") } + +// Ensure heartbeat arguments is valid. +val executorTimeoutMs = + Utils.timeStringAsMs(sparkProperties.getOrElse( +"spark.storage.blockManagerSlaveTimeoutMs", +s"${Utils.timeStringAsSeconds( + sparkProperties.getOrElse("spark.network.timeout", "120s"))}s")) +val executorHeartbeatIntervalMs = Utils.timeStringAsMs( + sparkProperties.getOrElse("spark.executor.heartbeatInterval", "10s")) +val checkTimeoutIntervalMs = + Utils.timeStringAsSeconds(sparkProperties.getOrElse( +"spark.network.timeoutInterval", +s"${Utils.timeStringAsMs(sparkProperties.getOrElse( + "spark.storage.blockManagerTimeoutIntervalMs", "60s"))}ms")) * 1000 +if (checkTimeoutIntervalMs >= executorTimeoutMs) { + error(s"Incorrect heartbeat arguments, checkTimeoutIntervalMs should " + +s"less than executorTimeoutMs.") +} +if (executorHeartbeatIntervalMs >= checkTimeoutIntervalMs) { Review comment: Yeah good point; this is what makes me wonder if this is the right place to validate this. Similar checks aren't done here in SparkSubmitArguments. For example, if these properties are mainly used in `HeartbeatReceiver`, that could be the right place to validate them. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories
srowen commented on a change in pull request #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories URL: https://github.com/apache/spark/pull/23447#discussion_r247528412 ## File path: core/src/main/scala/org/apache/spark/internal/config/package.scala ## @@ -187,6 +188,105 @@ package object config { .checkValue(_ >= 0, "The off-heap memory size must not be negative") .createWithDefault(0) + private[spark] val MEMORY_STORAGE_FRACTION = ConfigBuilder("spark.memory.storageFraction") +.doc("Amount of storage memory immune to eviction, expressed as a fraction of the " + + "size of the region set aside by spark.memory.fraction. The higher this is, the " + + "less working memory may be available to execution and tasks may spill to disk more " + + "often. Leaving this at the default value is recommended. ") +.doubleConf +.createWithDefault(0.5) + + private[spark] val MEMORY_FRACTION = ConfigBuilder("spark.memory.fraction") +.doc("Fraction of (heap space - 300MB) used for execution and storage. The " + + "lower this is, the more frequently spills and cached data eviction occur. " + + "The purpose of this config is to set aside memory for internal metadata, " + + "user data structures, and imprecise size estimation in the case of sparse, " + + "unusually large records. Leaving this at the default value is recommended. ") +.doubleConf +.createWithDefault(0.6) + + private[spark] val MEMORY_USE_LEGACY_MODE = ConfigBuilder("spark.memory.useLegacyMode") Review comment: This needs to be removed now; the legacy mode was deleted recently. Same with `spark.shuffle.memoryFraction`, `spark.storage.memoryFraction`, `spark.storage.unrollFraction` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] mgaido91 commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
mgaido91 commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247539199 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala ## @@ -289,13 +294,13 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { /** * Returns a copy of this node where `f` has been applied to all the nodes children. */ - def mapChildren(f: BaseType => BaseType): BaseType = { -if (children.nonEmpty) { + def mapChildren(f: BaseType => BaseType, forceCopy: Boolean = false): BaseType = { Review comment: I see, my point is that this is a very critical method, so I'd prefer to leave it unchanged. Do we really need that makeDeepCopy? If so, why do we need it? I hope we can rid of it somehow... This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] mgaido91 commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
mgaido91 commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247537242 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala ## @@ -225,6 +226,57 @@ case class FilterExec(condition: Expression, child: SparkPlan) override def outputPartitioning: Partitioning = child.outputPartitioning } +/** Physical plan for RecursiveTable. */ +case class RecursiveTableExec( +name: String, +anchorTerm: SparkPlan, +recursiveTerm: SparkPlan) extends SparkPlan { + override def children: Seq[SparkPlan] = Seq(anchorTerm, recursiveTerm) + + override def output: Seq[Attribute] = anchorTerm.output + + override val doNotPrepareInAdvance: Seq[SparkPlan] = Seq(recursiveTerm) + + override protected def doExecute(): RDD[InternalRow] = { +var temp = anchorTerm.execute().map(_.copy()).cache() +var tempCount = temp.count() +var result = temp +var level = 0 +do { + if (level > conf.getConf(SQLConf.RECURSION_LEVEL_LIMIT)) { Review comment: shall we get it into a val member of the node in order to avoid getting it multiple times during execution? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] mgaido91 commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
mgaido91 commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247537386 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala ## @@ -225,6 +226,57 @@ case class FilterExec(condition: Expression, child: SparkPlan) override def outputPartitioning: Partitioning = child.outputPartitioning } +/** Physical plan for RecursiveTable. */ +case class RecursiveTableExec( +name: String, +anchorTerm: SparkPlan, +recursiveTerm: SparkPlan) extends SparkPlan { + override def children: Seq[SparkPlan] = Seq(anchorTerm, recursiveTerm) + + override def output: Seq[Attribute] = anchorTerm.output + + override val doNotPrepareInAdvance: Seq[SparkPlan] = Seq(recursiveTerm) + + override protected def doExecute(): RDD[InternalRow] = { +var temp = anchorTerm.execute().map(_.copy()).cache() +var tempCount = temp.count() +var result = temp +var level = 0 +do { + if (level > conf.getConf(SQLConf.RECURSION_LEVEL_LIMIT)) { +throw new SparkException(s"Recursion level limit reached but query hasn't exhausted, try " + Review comment: nit: no `s` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23533: [CORE][MINOR] Fix some typos about MemoryMode
SparkQA commented on issue #23533: [CORE][MINOR] Fix some typos about MemoryMode URL: https://github.com/apache/spark/pull/23533#issuecomment-454047894 **[Test build #4513 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4513/testReport)** for PR 23533 at commit [`4975d5a`](https://github.com/apache/spark/commit/4975d5a0c1e2a2681c8244d87044eeae873bbb7c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala.
HyukjinKwon commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala. URL: https://github.com/apache/spark/pull/23534#discussion_r247553642 ## File path: python/pyspark/sql/dataframe.py ## @@ -109,15 +109,18 @@ def stat(self): @ignore_unicode_prefix @since(1.3) def toJSON(self, use_unicode=True): -"""Converts a :class:`DataFrame` into a :class:`RDD` of string. +"""Converts a :class:`DataFrame` into a :class:`DataFrame` of JSON string. -Each row is turned into a JSON document as one element in the returned RDD. +Each row is turned into a JSON document as one element in the returned DataFrame. >>> df.toJSON().first() -u'{"age":2,"name":"Alice"}' +Row(value=u'{"age":2,"name":"Alice"}') """ -rdd = self._jdf.toJSON() -return RDD(rdd.toJavaRDD(), self._sc, UTF8Deserializer(use_unicode)) +jdf = self._jdf.toJSON() +if self.sql_ctx._conf.pysparkDataFrameToJSONShouldReturnDataFrame(): +return DataFrame(jdf, self.sql_ctx) +else: +return RDD(jdf.toJavaRDD(), self._sc, UTF8Deserializer(use_unicode)) Review comment: Also, returning RDD and DataFrame are different. Returning RDD was the previous way even in Scala / Java side whereas DataFrame is not. Keeping RDD as return has a legacy reason but returning DataFrame is basically we allow another incomplete way for `toJSON`. Workaround is pretty easy anyway. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] imatiach-msft commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees
imatiach-msft commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees URL: https://github.com/apache/spark/pull/21632#discussion_r246996910 ## File path: project/MimaExcludes.scala ## @@ -241,7 +241,10 @@ object MimaExcludes { // [SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction) ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.expressions.UserDefinedFunction$"), - ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.sql.expressions.UserDefinedFunction") + ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.sql.expressions.UserDefinedFunction"), + +// [SPARK-19591][ML] Add sample weights to decision trees + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.mllib.tree.configuration.Strategy.this") Review comment: here is an example of the error: ``` [error] val strategy = new Strategy(algo, impurity, maxDepth, numClasses, maxBins, [error]^ [error] /home/ilya/ilmat-spark/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala:170: overloaded method constructor Strategy with alternatives: [error] (algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses: Int,maxBins: Int,quantileCalculationStrategy: org.apache.spark.mllib.tree.configuration.QuantileStrategy.QuantileStrategy,categoricalFeaturesInfo: Map[Int,Int],minInstancesPerNode: Int,minInfoGain: Double,maxMemoryInMB: Int,subsamplingRate: Double,useNodeIdCache: Boolean,checkpointInterval: Int)org.apache.spark.mllib.tree.configuration.Strategy [error] (algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses: Int,maxBins: Int,quantileCalculationStrategy: org.apache.spark.mllib.tree.configuration.QuantileStrategy.QuantileStrategy,categoricalFeaturesInfo: Map[Int,Int],minInstancesPerNode: Int,minInfoGain: Double,maxMemoryInMB: Int,subsamplingRate: Double,useNodeIdCache: Boolean,checkpointInterval: Int,minWeightFractionPerNode: Double)org.apache.spark.mllib.tree.configuration.Strategy [error] cannot be applied to (org.apache.spark.mllib.tree.configuration.Algo.Value, org.apache.spark.mllib.tree.impurity.Impurity, Int, Int, Int, org.apache.spark.mllib.tree.configuration.QuantileStrategy.Value, Map[Int,Int]) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
AmplabJenkins removed a comment on issue #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23519#issuecomment-454070451 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] gatorsmile closed pull request #23535: [SPARK-26613][SQL] Add another rename table grammar for spark sql
gatorsmile closed pull request #23535: [SPARK-26613][SQL] Add another rename table grammar for spark sql URL: https://github.com/apache/spark/pull/23535 As this is a foreign pull request (from a fork), the diff has been sent to your commit mailing list, comm...@spark.apache.org This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] gatorsmile commented on a change in pull request #23535: [SPARK-26613][SQL] Add another rename table grammar for spark sql
gatorsmile commented on a change in pull request #23535: [SPARK-26613][SQL] Add another rename table grammar for spark sql URL: https://github.com/apache/spark/pull/23535#discussion_r247579058 ## File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ## @@ -105,6 +105,7 @@ statement ADD COLUMNS '(' columns=colTypeList ')' #addTableColumns | ALTER (TABLE | VIEW) from=tableIdentifier RENAME TO to=tableIdentifier #renameTable +| RENAME (TABLE | VIEW) from=tableIdentifier TO to=tableIdentifier #renameTable Review comment: Thanks for your contribution! I know, MySQL has both this syntax support and ALTER TABLE ... RENAME... In the current stage, we are not randomly adding more syntax to Spark SQL for avoiding extra complexity. In the future, we might support different dialects. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #23499: [SPARK-26254][CORE] Extract Hive + Kafka dependencies from Core.
srowen commented on a change in pull request #23499: [SPARK-26254][CORE] Extract Hive + Kafka dependencies from Core. URL: https://github.com/apache/spark/pull/23499#discussion_r247534502 ## File path: pom.xml ## @@ -101,6 +101,7 @@ examples repl launcher +external/kafka-0-10-token-provider Review comment: Pardon if I'm not understanding, but is the new module specific to Kafka? what depends on it? I don't see that change here. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23533: [CORE][MINOR] Fix some typos about MemoryMode
AmplabJenkins removed a comment on issue #23533: [CORE][MINOR] Fix some typos about MemoryMode URL: https://github.com/apache/spark/pull/23533#issuecomment-454047251 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #23399: [SPARK-26497][DOCS][KUBERNETES] Show users where the pre-packaged SparkR and PySpark Dockerfiles
srowen commented on a change in pull request #23399: [SPARK-26497][DOCS][KUBERNETES] Show users where the pre-packaged SparkR and PySpark Dockerfiles URL: https://github.com/apache/spark/pull/23399#discussion_r247538200 ## File path: bin/docker-image-tool.sh ## @@ -208,8 +208,16 @@ Options: -f file Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark. -p file (Optional) Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark. Skips building PySpark docker image if not specified. +A sample PySpark Dockerfile is at: +kubernetes/dockerfiles/spark/bindings/python/Dockerfile +File must be relative the build path, in dev this is copied from: Review comment: Super nit: ; or . instead of the , here This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23533: [CORE][MINOR] Fix some typos about MemoryMode
AmplabJenkins commented on issue #23533: [CORE][MINOR] Fix some typos about MemoryMode URL: https://github.com/apache/spark/pull/23533#issuecomment-454047251 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247536393 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala ## @@ -289,13 +294,13 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { /** * Returns a copy of this node where `f` has been applied to all the nodes children. */ - def mapChildren(f: BaseType => BaseType): BaseType = { -if (children.nonEmpty) { + def mapChildren(f: BaseType => BaseType, forceCopy: Boolean = false): BaseType = { Review comment: Well, I need this for `makeDeepCopy()` and didn't want to copy the logic. Some of the plan nodes can't be reused in iterations so I think the most straightforward way is to deep copy the tree. Any suggestion? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23531: [SPARK-24497][SQL] Support recursive SQL query
AmplabJenkins removed a comment on issue #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#issuecomment-454049332 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23531: [SPARK-24497][SQL] Support recursive SQL query
AmplabJenkins removed a comment on issue #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#issuecomment-454049348 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7040/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] mgaido91 commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
mgaido91 commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247546282 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala ## @@ -289,13 +294,13 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { /** * Returns a copy of this node where `f` has been applied to all the nodes children. */ - def mapChildren(f: BaseType => BaseType): BaseType = { -if (children.nonEmpty) { + def mapChildren(f: BaseType => BaseType, forceCopy: Boolean = false): BaseType = { Review comment: ok, now I see the issue here. But I think this is not the best way to go for that. I think this is the same reason why you are adding some exclusions in the prepare method (please correct me if I am wrong). I'd rather introduce a `clean` method in `SparkPlan` which by default does nothing, but in case of broadcasts cleans the broadcast and hence force recomputation at the next step. What do you think? cc @cloud-fan @dongjoon-hyun @gatorsmile fot their opinion on this too This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247552872 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala ## @@ -289,13 +294,13 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { /** * Returns a copy of this node where `f` has been applied to all the nodes children. */ - def mapChildren(f: BaseType => BaseType): BaseType = { -if (children.nonEmpty) { + def mapChildren(f: BaseType => BaseType, forceCopy: Boolean = false): BaseType = { Review comment: You are right @mgaido91, `BroadcastExchangeExec` and similar nodes cause some issues there. We can't prepare them until we have the previous iteration results (or the anchor) This is why I introduced `doNotPrepareInAdvance`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23510: [SPARK-26590][CORE] make fetch-block-to-disk backward compatible
AmplabJenkins commented on issue #23510: [SPARK-26590][CORE] make fetch-block-to-disk backward compatible URL: https://github.com/apache/spark/pull/23510#issuecomment-454064133 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101181/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23510: [SPARK-26590][CORE] make fetch-block-to-disk backward compatible
AmplabJenkins commented on issue #23510: [SPARK-26590][CORE] make fetch-block-to-disk backward compatible URL: https://github.com/apache/spark/pull/23510#issuecomment-454064124 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] Deegue commented on a change in pull request #23506: [SPARK-26577][SQL] Add input optimizer when reading Hive table by SparkSQL
Deegue commented on a change in pull request #23506: [SPARK-26577][SQL] Add input optimizer when reading Hive table by SparkSQL URL: https://github.com/apache/spark/pull/23506#discussion_r247558744 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ## @@ -311,6 +309,36 @@ class HadoopTableReader( // Only take the value (skip the key) because Hive works only with values. rdd.map(_._2) } + + /** + * If `spark.sql.hive.fileInputFormat.enabled` is true, this function will optimize the input + * method(including format and the size of splits) while reading Hive tables. + */ + private def getAndOptimizeInput( +inputClassName: String): Class[InputFormat[Writable, Writable]] = { + +var ifc = Utils.classForName(inputClassName) + .asInstanceOf[java.lang.Class[InputFormat[Writable, Writable]]] +if (conf.getConf(HiveUtils.HIVE_FILE_INPUT_FORMAT_ENABLED)) { + hadoopConf.set("mapreduce.input.fileinputformat.split.maxsize", Review comment: Thank you for your review! We named it as `spark.sql.hive.fileInputFormat.split.maxsize` now and it can take effect by replacing `mapreduce.input.fileinputformat.split.maxsize` of hadoopConf. Because we are not able to change the configurations of hadoop when use SparkSQL. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
SparkQA commented on issue #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23519#issuecomment-454027099 **[Test build #101185 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101185/testReport)** for PR 23519 at commit [`ac2ec82`](https://github.com/apache/spark/commit/ac2ec82c7dc152acd6b200dc14ba8afea6abf5ad). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] gengliangwang commented on a change in pull request #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path
gengliangwang commented on a change in pull request #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path URL: https://github.com/apache/spark/pull/23383#discussion_r247515861 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala ## @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.hadoop.fs.{BlockLocation, FileStatus, LocatedFileStatus, Path} + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.datasources._ + +object PartitionedFileUtil { Review comment: I have renamed `FilePartitionUtil` as `FilePartition`, also put the class `FilePartition` into the same file. I think they are much more clear now. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] ueshin commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala.
ueshin commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala. URL: https://github.com/apache/spark/pull/23534#discussion_r247518316 ## File path: python/pyspark/sql/dataframe.py ## @@ -109,15 +109,18 @@ def stat(self): @ignore_unicode_prefix @since(1.3) def toJSON(self, use_unicode=True): -"""Converts a :class:`DataFrame` into a :class:`RDD` of string. +"""Converts a :class:`DataFrame` into a :class:`DataFrame` of JSON string. -Each row is turned into a JSON document as one element in the returned RDD. +Each row is turned into a JSON document as one element in the returned DataFrame. >>> df.toJSON().first() -u'{"age":2,"name":"Alice"}' +Row(value=u'{"age":2,"name":"Alice"}') """ -rdd = self._jdf.toJSON() -return RDD(rdd.toJavaRDD(), self._sc, UTF8Deserializer(use_unicode)) +jdf = self._jdf.toJSON() +if self.sql_ctx._conf.pysparkDataFrameToJSONShouldReturnDataFrame(): +return DataFrame(jdf, self.sql_ctx) +else: +return RDD(jdf.toJavaRDD(), self._sc, UTF8Deserializer(use_unicode)) Review comment: That sounds interesting. Maybe we should fix `DataFrameReader.csv` and `DataFrameReader.json` to accept DataFrame of string in Python, regardless of the discussion here. Let me try. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path
SparkQA commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path URL: https://github.com/apache/spark/pull/23383#issuecomment-454029022 **[Test build #101186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101186/testReport)** for PR 23383 at commit [`7e5a158`](https://github.com/apache/spark/commit/7e5a158773bf05dec7aa837bd2ff8c9116f7da00). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #23445: [SPARK-26530]Validate heartheat arguments in SparkSubmitArguments
srowen commented on a change in pull request #23445: [SPARK-26530]Validate heartheat arguments in SparkSubmitArguments URL: https://github.com/apache/spark/pull/23445#discussion_r247522677 ## File path: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ## @@ -301,6 +301,28 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S if (proxyUser != null && principal != null) { error("Only one of --proxy-user or --principal can be provided.") } + +// Ensure heartbeat arguments is valid. +val executorTimeoutMs = + Utils.timeStringAsMs(sparkProperties.getOrElse( +"spark.storage.blockManagerSlaveTimeoutMs", +s"${Utils.timeStringAsSeconds( + sparkProperties.getOrElse("spark.network.timeout", "120s"))}s")) +val executorHeartbeatIntervalMs = Utils.timeStringAsMs( + sparkProperties.getOrElse("spark.executor.heartbeatInterval", "10s")) +val checkTimeoutIntervalMs = + Utils.timeStringAsSeconds(sparkProperties.getOrElse( +"spark.network.timeoutInterval", +s"${Utils.timeStringAsMs(sparkProperties.getOrElse( + "spark.storage.blockManagerTimeoutIntervalMs", "60s"))}ms")) * 1000 Review comment: OK, yeah that's why we need to centralize the definition of the config. This should wait until the related PR to update all these configs' declaration goes in, then can use those. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-454032378 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101187/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23518: [SPARK-26600] Update spark-submit usage message
AmplabJenkins removed a comment on issue #23518: [SPARK-26600] Update spark-submit usage message URL: https://github.com/apache/spark/pull/23518#issuecomment-453457279 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23518: [SPARK-26600] Update spark-submit usage message
SparkQA commented on issue #23518: [SPARK-26600] Update spark-submit usage message URL: https://github.com/apache/spark/pull/23518#issuecomment-454033005 **[Test build #4511 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4511/testReport)** for PR 23518 at commit [`930f242`](https://github.com/apache/spark/commit/930f242f366428515be8f747e4130e0eec036580). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
SparkQA commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-454037188 **[Test build #101188 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101188/testReport)** for PR 23521 at commit [`6b028a9`](https://github.com/apache/spark/commit/6b028a9a17e0c60fed11354d49c9dbe325202e36). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-454040546 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-454040554 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7039/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-454040554 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7039/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-454040546 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories
AmplabJenkins commented on issue #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories URL: https://github.com/apache/spark/pull/23447#issuecomment-454043387 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101179/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23531: [SPARK-24497][SQL] Support recursive SQL query
SparkQA commented on issue #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#issuecomment-454043473 **[Test build #101190 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101190/testReport)** for PR 23531 at commit [`a69473f`](https://github.com/apache/spark/commit/a69473fa2c14880b9b6a0fd6d862e8012dd8360c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories
AmplabJenkins removed a comment on issue #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories URL: https://github.com/apache/spark/pull/23447#issuecomment-454043374 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories
AmplabJenkins removed a comment on issue #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories URL: https://github.com/apache/spark/pull/23447#issuecomment-454043387 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101179/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories
AmplabJenkins commented on issue #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories URL: https://github.com/apache/spark/pull/23447#issuecomment-454043374 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23396: [SPARK-26397][SQL] Driver side only metrics support
AmplabJenkins commented on issue #23396: [SPARK-26397][SQL] Driver side only metrics support URL: https://github.com/apache/spark/pull/23396#issuecomment-454046243 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23531: [SPARK-24497][SQL] Support recursive SQL query
AmplabJenkins commented on issue #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#issuecomment-454049348 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7040/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23531: [SPARK-24497][SQL] Support recursive SQL query
AmplabJenkins commented on issue #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#issuecomment-454049332 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247552872 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala ## @@ -289,13 +294,13 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { /** * Returns a copy of this node where `f` has been applied to all the nodes children. */ - def mapChildren(f: BaseType => BaseType): BaseType = { -if (children.nonEmpty) { + def mapChildren(f: BaseType => BaseType, forceCopy: Boolean = false): BaseType = { Review comment: You are right @mgaido91, `BroadcastExchangeExec` and similar nodes cause some issues there. We can't prepare them until we have the result of the previous iteration (or the result of the anchor). This is why I introduced `doNotPrepareInAdvance`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23510: [SPARK-26590][CORE] make fetch-block-to-disk backward compatible
AmplabJenkins removed a comment on issue #23510: [SPARK-26590][CORE] make fetch-block-to-disk backward compatible URL: https://github.com/apache/spark/pull/23510#issuecomment-454064124 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23510: [SPARK-26590][CORE] make fetch-block-to-disk backward compatible
AmplabJenkins removed a comment on issue #23510: [SPARK-26590][CORE] make fetch-block-to-disk backward compatible URL: https://github.com/apache/spark/pull/23510#issuecomment-454064133 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101181/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
AmplabJenkins removed a comment on issue #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23519#issuecomment-454070468 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101185/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
SparkQA commented on issue #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23519#issuecomment-454025284 **[Test build #101184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101184/testReport)** for PR 23519 at commit [`9669569`](https://github.com/apache/spark/commit/9669569dddb4c4d36026cef8b0e79440c445e8ab). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #23538: [MINOR][BUILD] Remove binary license/notice files in a source release for branch-2.4+ only
srowen commented on a change in pull request #23538: [MINOR][BUILD] Remove binary license/notice files in a source release for branch-2.4+ only URL: https://github.com/apache/spark/pull/23538#discussion_r247520255 ## File path: dev/create-release/release-build.sh ## @@ -176,10 +176,14 @@ if [[ "$1" == "package" ]]; then # Source and binary tarballs echo "Packaging release source tarballs" cp -r spark spark-$SPARK_VERSION - # For source release, exclude copy of binary license/notice - rm spark-$SPARK_VERSION/LICENSE-binary - rm spark-$SPARK_VERSION/NOTICE-binary - rm -r spark-$SPARK_VERSION/licenses-binary + + # For source release in v2.4+, exclude copy of binary license/notice + if [[ $SPARK_VERSION > "2.4" ]]; then Review comment: This seems OK, but do the `rm`s cause a problem? if they fail or print an error, add `-f`? But this is fine too. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] viirya commented on issue #23521: [SPARK-26604][CORE] Register channel for stream request
viirya commented on issue #23521: [SPARK-26604][CORE] Register channel for stream request URL: https://github.com/apache/spark/pull/23521#issuecomment-454030949 @cloud-fan Since `StreamManager` doesn't register channel but only `OneForOneStreamManager` does it, I remove `registerChannel` from `StreamManager`. When `OneForOneStreamManager` goes to serve chunk or stream request, it will register channel for the stream. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Register channel for stream request
AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Register channel for stream request URL: https://github.com/apache/spark/pull/23521#issuecomment-454031389 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7038/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path
AmplabJenkins commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path URL: https://github.com/apache/spark/pull/23383#issuecomment-454031436 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7037/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on issue #22807: [SPARK-25811][PySpark] Raise a proper error when unsafe cast is detected by PyArrow
srowen commented on issue #22807: [SPARK-25811][PySpark] Raise a proper error when unsafe cast is detected by PyArrow URL: https://github.com/apache/spark/pull/22807#issuecomment-454036714 So, in these situations, we agree some error should occur. The current one is an overflow error -- right? or does it silently overfloat? Non-arrow UDFs seem to do the latter. Based on that, I'm not clear the Arrow behavior should be different, especially if it already matches that behavior, and is what Arrow users expect. No, this isn't a case for a flag. That is a failure to decide, punted to the user. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23396: [SPARK-26397][SQL] Driver side only metrics support
SparkQA commented on issue #23396: [SPARK-26397][SQL] Driver side only metrics support URL: https://github.com/apache/spark/pull/23396#issuecomment-454041371 **[Test build #101189 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101189/testReport)** for PR 23396 at commit [`1019762`](https://github.com/apache/spark/commit/10197623309aac89c095f4f765f8ff1ed28a9452). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247536393 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala ## @@ -289,13 +294,13 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { /** * Returns a copy of this node where `f` has been applied to all the nodes children. */ - def mapChildren(f: BaseType => BaseType): BaseType = { -if (children.nonEmpty) { + def mapChildren(f: BaseType => BaseType, forceCopy: Boolean = false): BaseType = { Review comment: Well, I need this for `makeDeepCopy()` and didn't want to copy the logic. Any suggestion? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247536528 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -1530,6 +1689,19 @@ class Analyzer( object GlobalAggregates extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators { case Project(projectList, child) if containsAggregates(projectList) => + +def traversePlanAndCheck(plan: LogicalPlan, inRecursiveTable: Boolean = false): Unit = + plan match { +case RecursiveReference(name, _) if !inRecursiveTable => + throw new AnalysisException(s"Recursive reference ${name} can't be used in an " + +s"aggregate") +case RecursiveTable(_, _, recursiveTerm, _) => + traversePlanAndCheck(recursiveTerm, true) Review comment: thanks, fixed This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] mgaido91 commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
mgaido91 commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247442316 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -1625,6 +1625,11 @@ object SQLConf { "a SparkConf entry.") .booleanConf .createWithDefault(true) + + val RECURSION_LEVEL_LIMIT = buildConf("spark.sql.recursion.level.limit") Review comment: nit: probably we can make this internal This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247536393 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala ## @@ -289,13 +294,13 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { /** * Returns a copy of this node where `f` has been applied to all the nodes children. */ - def mapChildren(f: BaseType => BaseType): BaseType = { -if (children.nonEmpty) { + def mapChildren(f: BaseType => BaseType, forceCopy: Boolean = false): BaseType = { Review comment: Well, I need this for `makeDeepCopy()` and didn't want to copy the logic. Some of the plan nodes can't be reused in iterations so I think the most straightforward way is to deep copy the subtree. Any suggestion? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23531: [SPARK-24497][SQL] Support recursive SQL query
SparkQA commented on issue #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#issuecomment-454052123 **[Test build #101191 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101191/testReport)** for PR 23531 at commit [`f6d8c17`](https://github.com/apache/spark/commit/f6d8c17aa6eb4f406ccb326c5a530c15cfdc1a83). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23531: [SPARK-24497][SQL] Support recursive SQL query
AmplabJenkins removed a comment on issue #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#issuecomment-454057027 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23531: [SPARK-24497][SQL] Support recursive SQL query
AmplabJenkins commented on issue #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#issuecomment-454057027 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23531: [SPARK-24497][SQL] Support recursive SQL query
AmplabJenkins removed a comment on issue #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#issuecomment-454057036 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7042/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path
SparkQA commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path URL: https://github.com/apache/spark/pull/23383#issuecomment-454069631 **[Test build #101192 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101192/testReport)** for PR 23383 at commit [`01b3ca9`](https://github.com/apache/spark/commit/01b3ca97a675ff3059a9b16ebde99ee929384a2a). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees
srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees URL: https://github.com/apache/spark/pull/21632#discussion_r247562160 ## File path: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ## @@ -167,7 +167,7 @@ class DecisionTreeSuite extends SparkFunSuite with MLlibTestSparkContext { val rdd = sc.parallelize(arr) val strategy = new Strategy(Classification, Entropy, maxDepth = 3, numClasses = 2, maxBins = 100) -val metadata = DecisionTreeMetadata.buildMetadata(rdd.map(_.asML), strategy) +val metadata = DecisionTreeMetadata.buildMetadata(rdd.map(_.asML.toInstance), strategy) Review comment: Rather than call `.toInstance` everywhere this is called, why not overload `buildMetadata` to accept an RDD of either one? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees
srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees URL: https://github.com/apache/spark/pull/21632#discussion_r247563305 ## File path: mllib/src/test/scala/org/apache/spark/ml/tree/impl/TreeTests.scala ## @@ -18,37 +18,46 @@ package org.apache.spark.ml.tree.impl import scala.collection.JavaConverters._ +import scala.util.Random import org.apache.spark.{SparkContext, SparkFunSuite} import org.apache.spark.api.java.JavaRDD import org.apache.spark.ml.attribute.{AttributeGroup, NominalAttribute, NumericAttribute} -import org.apache.spark.ml.feature.LabeledPoint +import org.apache.spark.ml.feature.{Instance, LabeledPoint} import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.tree._ +import org.apache.spark.mllib.util.TestingUtils._ import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, SparkSession} private[ml] object TreeTests extends SparkFunSuite { /** * Convert the given data to a DataFrame, and set the features and label metadata. + * * @param data Dataset. Categorical features and labels must already have 0-based indices. * This must be non-empty. * @param categoricalFeatures Map: categorical feature index to number of distinct values * @param numClasses Number of classes label can take. If 0, mark as continuous. * @return DataFrame with metadata */ def setMetadata( - data: RDD[LabeledPoint], + data: RDD[_], Review comment: Being test code, this isn't a big deal, but this could also be overloaded with two versions that accept RDDs of different types, and then call to a common private function to proceed from there. I don't feel strongly about it, just dislike seeing `[_]` in a method signature This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path
AmplabJenkins commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path URL: https://github.com/apache/spark/pull/23383#issuecomment-454070024 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7043/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees
srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees URL: https://github.com/apache/spark/pull/21632#discussion_r247563937 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/LabeledPoint.scala ## @@ -37,4 +37,9 @@ case class LabeledPoint(@Since("2.0.0") label: Double, @Since("2.0.0") features: override def toString: String = { s"($label,$features)" } + + private[spark] def toInstance(weight: Double): Instance = { +Instance(label, weight, features) Review comment: OK, I'm persuaded. Maybe One Day we can clean up the relationship between these classes a bit more but this isn't particularly worse. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Register channel for stream request
AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Register channel for stream request URL: https://github.com/apache/spark/pull/23521#issuecomment-454031383 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Register channel for stream request
AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Register channel for stream request URL: https://github.com/apache/spark/pull/23521#issuecomment-454031389 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7038/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path
AmplabJenkins removed a comment on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path URL: https://github.com/apache/spark/pull/23383#issuecomment-454031436 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7037/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path
AmplabJenkins commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path URL: https://github.com/apache/spark/pull/23383#issuecomment-454031424 Build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Register channel for stream request
AmplabJenkins removed a comment on issue #23521: [SPARK-26604][CORE] Register channel for stream request URL: https://github.com/apache/spark/pull/23521#issuecomment-454031383 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247535248 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -205,30 +206,184 @@ class Analyzer( CleanupAliases) ) + object ResolveRecursiveReferneces extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = { + val recursiveTables = plan.collect { +case rt @ RecursiveTable(name, _, _, _) if rt.anchorResolved => name -> rt + }.toMap + + plan.resolveOperatorsUp { +case UnresolvedRecursiveReference(name) if recursiveTables.contains(name) => + RecursiveReference(name, recursiveTables(name).output) +case other => other + } +} + } + /** * Analyze cte definitions and substitute child plan with analyzed cte definitions. */ object CTESubstitution extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { - case With(child, relations) => + case With(child, relations, allowRecursion) => substituteCTE(child, relations.foldLeft(Seq.empty[(String, LogicalPlan)]) { case (resolved, (name, relation)) => -resolved :+ name -> executeSameContext(substituteCTE(relation, resolved)) -}) +val recursiveTableName = if (allowRecursion) Some(name) else None +resolved :+ + name -> executeSameContext(substituteCTE(relation, resolved, recursiveTableName)) +}, None) case other => other } -def substituteCTE(plan: LogicalPlan, cteRelations: Seq[(String, LogicalPlan)]): LogicalPlan = { - plan resolveOperatorsDown { -case u: UnresolvedRelation => - cteRelations.find(x => resolver(x._1, u.tableIdentifier.table)) -.map(_._2).getOrElse(u) -case other => - // This cannot be done in ResolveSubquery because ResolveSubquery does not know the CTE. - other transformExpressions { -case e: SubqueryExpression => - e.withNewPlan(substituteCTE(e.plan, cteRelations)) +def substituteCTE( +plan: LogicalPlan, +cteRelations: Seq[(String, LogicalPlan)], +recursiveTableName: Option[String]): LogicalPlan = { + def substitute( + plan: LogicalPlan, + inSubQuery: Boolean = false): (LogicalPlan, Boolean) = { +val references = mutable.Set.empty[UnresolvedRecursiveReference] + +def newReference(recursiveTableName: String) = { + val recursiveReference = UnresolvedRecursiveReference(recursiveTableName) + references += recursiveReference + + recursiveReference +} + +val newPlan = plan resolveOperatorsDown { + case u: UnresolvedRelation => +val table = u.tableIdentifier.table + +val recursiveReference = recursiveTableName.find(resolver(_, table)).map { name => + if (inSubQuery) { +throw new AnalysisException( + s"Recursive reference ${name} can't be used in a subquery") + } + + newReference(name) +} + +recursiveReference + .orElse(cteRelations.find(x => resolver(x._1, table)).map(_._2)) + .getOrElse(u) + + case other => +// This cannot be done in ResolveSubquery because ResolveSubquery does not know the CTE. +other transformExpressions { + case e: SubqueryExpression => e.withNewPlan(substitute(e.plan, true)._1) +} +} + +(newPlan, !references.isEmpty) + } + + plan match { +case SubqueryAlias(name, u: Union) if recursiveTableName.isDefined => + def combineUnions(union: Union): Seq[LogicalPlan] = union.children.flatMap { +case u: Union => combineUnions(u) +case o => Seq(o) } + + val substitutedTerms = combineUnions(u).map(substitute(_)) + val (anchorTerms, recursiveTerms) = substitutedTerms.partition(!_._2) + + if (!recursiveTerms.isEmpty) { +if (anchorTerms.isEmpty) { + throw new AnalysisException(s"There should be at least 1 anchor term defined in a " + +s"recursive query $name") +} + +case class PlanTraverseStatus( Review comment: Indeed this was a bit ugly. Changed as you suggested. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] srowen commented on issue #23535: [SPARK-26613][SQL] Add another rename table grammar for spark sql
srowen commented on issue #23535: [SPARK-26613][SQL] Add another rename table grammar for spark sql URL: https://github.com/apache/spark/pull/23535#issuecomment-454044393 Is it standard SQL or something Hive does? otherwise I don't think we'd generally add it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #23412: [SPARK-26477][CORE] Use ConfigEntry for hardcoded configs for unsafe category
srowen commented on a change in pull request #23412: [SPARK-26477][CORE] Use ConfigEntry for hardcoded configs for unsafe category URL: https://github.com/apache/spark/pull/23412#discussion_r247537355 ## File path: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java ## @@ -59,21 +60,22 @@ public UnsafeSorterSpillReader( File file, BlockId blockId) throws IOException { assert (file.length() > 0); +final ConfigEntry bufferSizeConfigEntry = +package$.MODULE$.UNSAFE_SORTER_SPILL_READER_BUFFER_SIZE(); Review comment: This can't be `ConfigEntry` because the Scala side has the generic type as a `long`? It seems like it ends up a `Long` anyway as you're able to `.get()` and cast it to `long`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247537146 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala ## @@ -591,7 +591,8 @@ object FoldablePropagation extends Rule[LogicalPlan] { } else { CleanupAliases(plan.transformUp { // We can only propagate foldables for a subset of unary nodes. -case u: UnaryNode if foldableMap.nonEmpty && canPropagateFoldables(u) => +case u: UnaryNode if foldableMap.nonEmpty && canPropagateFoldables(u) Review comment: I reverted this part and changed `ResolveRecursiveReferneces` to use new attributes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23396: [SPARK-26397][SQL] Driver side only metrics support
AmplabJenkins removed a comment on issue #23396: [SPARK-26397][SQL] Driver side only metrics support URL: https://github.com/apache/spark/pull/23396#issuecomment-454046255 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7041/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23396: [SPARK-26397][SQL] Driver side only metrics support
AmplabJenkins commented on issue #23396: [SPARK-26397][SQL] Driver side only metrics support URL: https://github.com/apache/spark/pull/23396#issuecomment-454046255 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7041/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on issue #23507: [SPARK-26576][SQL] Broadcast hint not applied to partitioned table
srowen commented on issue #23507: [SPARK-26576][SQL] Broadcast hint not applied to partitioned table URL: https://github.com/apache/spark/pull/23507#issuecomment-454030131 It's possible it merely uncovered a different problem, in the test or main code, that only exists in 2.3. How important is the change for 2.3? given that there is a release imminent, would it be OK to revert it now in 2.3, or does that lose an important fix? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path
AmplabJenkins removed a comment on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path URL: https://github.com/apache/spark/pull/23383#issuecomment-454031424 Build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
SparkQA commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-454032343 **[Test build #101187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101187/testReport)** for PR 23521 at commit [`9082f01`](https://github.com/apache/spark/commit/9082f01b9fec637399a1cf43eb55141cc0dcee6b). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-454032368 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
AmplabJenkins commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-454032378 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101187/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala.
srowen commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala. URL: https://github.com/apache/spark/pull/23534#discussion_r247525774 ## File path: python/pyspark/sql/dataframe.py ## @@ -109,15 +109,18 @@ def stat(self): @ignore_unicode_prefix @since(1.3) def toJSON(self, use_unicode=True): -"""Converts a :class:`DataFrame` into a :class:`RDD` of string. +"""Converts a :class:`DataFrame` into a :class:`DataFrame` of JSON string. -Each row is turned into a JSON document as one element in the returned RDD. +Each row is turned into a JSON document as one element in the returned DataFrame. >>> df.toJSON().first() -u'{"age":2,"name":"Alice"}' +Row(value=u'{"age":2,"name":"Alice"}') """ -rdd = self._jdf.toJSON() -return RDD(rdd.toJavaRDD(), self._sc, UTF8Deserializer(use_unicode)) +jdf = self._jdf.toJSON() +if self.sql_ctx._conf.pysparkDataFrameToJSONShouldReturnDataFrame(): +return DataFrame(jdf, self.sql_ctx) +else: +return RDD(jdf.toJavaRDD(), self._sc, UTF8Deserializer(use_unicode)) Review comment: I agree it's not clear which one is more consistent. I do not think we should add a config -- config options are, as someone once said, often a failure of design, that punts unresolved questions to the user. I think I'd slightly prefer keeping an RDD here if there's no compelling reason to significantly change the Pyspark API. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories
SparkQA commented on issue #23447: [SPARK-26462][CORE] Use ConfigEntry for hardcoded configs for execution categories URL: https://github.com/apache/spark/pull/23447#issuecomment-454042522 **[Test build #101179 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101179/testReport)** for PR 23447 at commit [`d0ecdb6`](https://github.com/apache/spark/commit/d0ecdb6b03b452eee6942b5513bb2ad220c35616). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r247535564 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -47,6 +48,41 @@ case class Subquery(child: LogicalPlan) extends OrderPreservingUnaryNode { override def output: Seq[Attribute] = child.output } +case class RecursiveTable( +name: String, +anchorTerm: LogicalPlan, +recursiveTerm: LogicalPlan, +levelLimit: Int) extends LogicalPlan { Review comment: thanks, removed This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23416: [SPARK-26463][CORE] Use ConfigEntry for hardcoded configs for scheduler categories.
SparkQA commented on issue #23416: [SPARK-26463][CORE] Use ConfigEntry for hardcoded configs for scheduler categories. URL: https://github.com/apache/spark/pull/23416#issuecomment-454045040 **[Test build #4512 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4512/testReport)** for PR 23416 at commit [`af8e003`](https://github.com/apache/spark/commit/af8e003e7fc8a5b0f6a4db724387f26ed8f0fb43). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23533: [CORE][MINOR] Fix some typos about MemoryMode
AmplabJenkins removed a comment on issue #23533: [CORE][MINOR] Fix some typos about MemoryMode URL: https://github.com/apache/spark/pull/23533#issuecomment-454047258 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101180/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23510: [SPARK-26590][CORE] make fetch-block-to-disk backward compatible
SparkQA commented on issue #23510: [SPARK-26590][CORE] make fetch-block-to-disk backward compatible URL: https://github.com/apache/spark/pull/23510#issuecomment-454063335 **[Test build #101181 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101181/testReport)** for PR 23510 at commit [`6e1a05b`](https://github.com/apache/spark/commit/6e1a05b088cabddc986e0a3d8e126095f0f7b32b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees
srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees URL: https://github.com/apache/spark/pull/21632#discussion_r247562734 ## File path: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ## @@ -268,4 +269,20 @@ object MLTestingUtils extends SparkFunSuite { assert(newDatasetF.schema(featuresColName).dataType.equals(new ArrayType(FloatType, false))) (newDataset, newDatasetD, newDatasetF) } + + def modelPredictionEquals[M <: PredictionModel[_, M]]( Review comment: This is asserting that some subset are within tolerances... is it necessary to loosen it this much? I haven't read carefully; why might some be out of tolerances and that's OK? what about a looser tolerance instead? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
SparkQA commented on issue #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23519#issuecomment-454070170 **[Test build #101185 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101185/testReport)** for PR 23519 at commit [`ac2ec82`](https://github.com/apache/spark/commit/ac2ec82c7dc152acd6b200dc14ba8afea6abf5ad). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees
srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees URL: https://github.com/apache/spark/pull/21632#discussion_r247561861 ## File path: mllib/src/main/scala/org/apache/spark/ml/tree/impl/DecisionTreeMetadata.scala ## @@ -115,7 +122,11 @@ private[spark] object DecisionTreeMetadata extends Logging { } require(numFeatures > 0, s"DecisionTree requires number of features > 0, " + s"but was given an empty features vector") -val numExamples = input.count() +val (numExamples, weightSum) = input.aggregate((0L, 0.0))( + seqOp = (cw, instance) => (cw._1 + 1L, cw._2 + instance.weight), Review comment: This can be taken further: `((count, weight), instance) =>` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees
srowen commented on a change in pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees URL: https://github.com/apache/spark/pull/21632#discussion_r247560135 ## File path: project/MimaExcludes.scala ## @@ -241,7 +241,10 @@ object MimaExcludes { // [SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction) ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.expressions.UserDefinedFunction$"), - ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.sql.expressions.UserDefinedFunction") + ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.sql.expressions.UserDefinedFunction"), + +// [SPARK-19591][ML] Add sample weights to decision trees + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.mllib.tree.configuration.Strategy.this") Review comment: Hm, yeah, the problem is that both constructors define a bunch of optional arguments, and this call doesn't specify those optional argument. So both constructors apply. The 'fix' would be to make the backwards-compatible constructor not take any optional arguments. It was already source-compatible without this additional constructor, but would make it binary-compatible too. I'm on the fence about it. It's not something apps would normally specify directly, but it's also pretty trivial to write in this additional constructor. If it works, I'd keep the backwards-compatible constructor. Little benefit, very little cost. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path
AmplabJenkins commented on issue #23383: [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path URL: https://github.com/apache/spark/pull/23383#issuecomment-454070005 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org