[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82465 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82465/testReport)** for PR 19436 at commit [`6710141`](https://github.com/apache/spark/commit/6710141767a2df92898af319bc4ef87f9110f911). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82465/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGene...
Github user rekhajoshm closed the pull request at: https://github.com/apache/spark/pull/19418 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/13440 @srowen @thunterdb any more thoughts on this? how about @sethah @yanboliang @jkbradley? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82466/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19418 I think we don't want to add such defensive condition and avoid the logtrace. Without that, we don't know the problem is happened. We should identify the issue and fix it if it is really there, instead of hiding the error. Btw, you still don't get correct result even by avoiding the error, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18732 **[Test build #82466 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82466/testReport)** for PR 18732 at commit [`fa88c88`](https://github.com/apache/spark/commit/fa88c881a2fa3a7bb49af882ee9c482314184ff1). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19436#discussion_r142850005 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala --- @@ -394,7 +394,11 @@ case class FlatMapGroupsInRExec( override def producedAttributes: AttributeSet = AttributeSet(outputObjAttr) override def requiredChildDistribution: Seq[Distribution] = -ClusteredDistribution(groupingAttributes) :: Nil +if (groupingAttributes.isEmpty) { + AllTuples :: Nil --- End diff -- You mean empty grouping attributes == all tuples? Yeah, I think no grouping attributes means all tuples are in the one group. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...
Github user rekhajoshm commented on the issue: https://github.com/apache/spark/pull/19418 @viirya You are correct, i am on latest master, and i did not get it yet. This PR was to have a defensive condition. As, if this happens only under certain unique data/flow, this is the one place to certainly avoid the whole stack of logtrace. As I understand, the whole slew of log seems to be the painpoint, nerves the user, and especially since the job nonetheless finishes successfully.thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19436#discussion_r142849580 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala --- @@ -394,7 +394,11 @@ case class FlatMapGroupsInRExec( override def producedAttributes: AttributeSet = AttributeSet(outputObjAttr) override def requiredChildDistribution: Seq[Distribution] = -ClusteredDistribution(groupingAttributes) :: Nil +if (groupingAttributes.isEmpty) { + AllTuples :: Nil --- End diff -- should empty == all? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "keyWithIn...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19435 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82462/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "keyWithIn...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19435 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "keyWithIn...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19435 **[Test build #82462 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82462/testReport)** for PR 19435 at commit [`c62ba65`](https://github.com/apache/spark/commit/c62ba65fcc952de4b244695f32dc41783a2c1631). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19418 From the stacktrace posted in the JIRA, the problematic code is: /* 151 */ comp = agg_bufValue.compare(smj_value3); `agg_bufValue` is a `long` but `smj_value3` is a `UTF8String`. It looks strange because we should not to compare two variable in different types at all. Looks like you don't have re-producible codes for this issue. So I don't think you found the root cause. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19436 @HyukjinKwon Yeah, wait me few minutes. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82463/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18732 **[Test build #82463 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82463/testReport)** for PR 18732 at commit [`657942b`](https://github.com/apache/spark/commit/657942b1b4080c30fa5c60bcd700c862fb571465). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19436 LGTM mind opening a JIRA? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGene...
Github user rekhajoshm commented on a diff in the pull request: https://github.com/apache/spark/pull/19418#discussion_r142848020 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -697,7 +697,12 @@ class CodegenContext { } """ s"${addNewFunction(compareFunc, funcCode)}($c1, $c2)" -case other if other.isInstanceOf[AtomicType] => s"$c1.compare($c2)" +case other if other.isInstanceOf[AtomicType] => + if (s"$c1".getClass.getMethods.map(_.getName).filter(_ matches "compare").size == 1) { --- End diff -- @viirya i was putting a note to that effect :-) This would not work, but trying out something, especially the consistency betn local and jenkins test behavior.I needed "other" to work.Anyhow, mainly looking to check if atomictypes conv/long/String can happen by different attempts. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGene...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19418#discussion_r142847702 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -697,7 +697,12 @@ class CodegenContext { } """ s"${addNewFunction(compareFunc, funcCode)}($c1, $c2)" -case other if other.isInstanceOf[AtomicType] => s"$c1.compare($c2)" +case other if other.isInstanceOf[AtomicType] => + if (s"$c1".getClass.getMethods.map(_.getName).filter(_ matches "compare").size == 1) { --- End diff -- You just try to get class from a `String` object. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19418 **[Test build #82471 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82471/testReport)** for PR 19418 at commit [`3f8ce9a`](https://github.com/apache/spark/commit/3f8ce9a99acc1d5ae22004f4f18d94df8966cbcd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19424 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82464/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19424 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19424 **[Test build #82464 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82464/testReport)** for PR 19424 at commit [`df1d1c5`](https://github.com/apache/spark/commit/df1d1c5150ba1787accd95e621d62d6adf215e60). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82470 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82470/testReport)** for PR 19436 at commit [`6710141`](https://github.com/apache/spark/commit/6710141767a2df92898af319bc4ef87f9110f911). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19436 Ok. The added test works to verify this is an issue. See the test result of https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82468/testReport. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18732 **[Test build #82469 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82469/testReport)** for PR 18732 at commit [`e4efb32`](https://github.com/apache/spark/commit/e4efb3281008a2b450f9013aeb8f1ac9cf4ffa9e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142845456 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self._df) + +def apply(self, udf): +""" +Maps each group of the current :class:`DataFrame` using a pandas udf and returns the result +as a :class:`DataFrame`. + +The user-defined function should take a `pandas.DataFrame` and return another +`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` to the user-function and +the returned`pandas.DataFrame` are combined as a :class:`DataFrame`. The returned +`pandas.DataFrame` can be arbitrary length and its schema should match the returnType of +the pandas udf. + +:param udf: A wrapped function returned by `pandas_udf` + +>>> df = spark.createDataFrame( +... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], +... ("id", "v")) +>>> @pandas_udf(returnType=df.schema) +... def normalize(pdf): +... v = pdf.v +... return pdf.assign(v=(v - v.mean()) / v.std()) +>>> df.groupby('id').apply(normalize).show() # doctest: +SKIP --- End diff -- Fixed. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82468/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82468 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82468/testReport)** for PR 19436 at commit [`e0af5d6`](https://github.com/apache/spark/commit/e0af5d69bc3a9c615941e96d1709f1ed58bae727). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGene...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19418#discussion_r142844714 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -697,7 +697,14 @@ class CodegenContext { } """ s"${addNewFunction(compareFunc, funcCode)}($c1, $c2)" -case other if other.isInstanceOf[AtomicType] => s"$c1.compare($c2)" +case other if other.isInstanceOf[AtomicType] => + s""" +if ($c1.getClass.getMethods.map(_.getName).filter(_ matches "compare").size == 1) { + $c1.compare($c2) --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19418 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19418 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82467/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19418 **[Test build #82467 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82467/testReport)** for PR 19418 at commit [`773b429`](https://github.com/apache/spark/commit/773b429991510864878928b2518ff14825e1f865). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19432: [SPARK-22203][SQL]Add job description for file li...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19432 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19432 Thanks! Merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82468 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82468/testReport)** for PR 19436 at commit [`e0af5d6`](https://github.com/apache/spark/commit/e0af5d69bc3a9c615941e96d1709f1ed58bae727). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGene...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19418#discussion_r142841674 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -697,7 +697,14 @@ class CodegenContext { } """ s"${addNewFunction(compareFunc, funcCode)}($c1, $c2)" -case other if other.isInstanceOf[AtomicType] => s"$c1.compare($c2)" +case other if other.isInstanceOf[AtomicType] => + s""" +if ($c1.getClass.getMethods.map(_.getName).filter(_ matches "compare").size == 1) { + $c1.compare($c2) --- End diff -- hmm, `genComp` seems to return a java statement evaluated to boolean, are you sure returning a block like this work? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142841543 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self._df) + +def apply(self, udf): +""" +Maps each group of the current :class:`DataFrame` using a pandas udf and returns the result +as a :class:`DataFrame`. + +The user-defined function should take a `pandas.DataFrame` and return another +`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` to the user-function and +the returned`pandas.DataFrame` are combined as a :class:`DataFrame`. The returned +`pandas.DataFrame` can be arbitrary length and its schema should match the returnType of +the pandas udf. + +:param udf: A wrapped function returned by `pandas_udf` + +>>> df = spark.createDataFrame( +... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], +... ("id", "v")) +>>> @pandas_udf(returnType=df.schema) +... def normalize(pdf): +... v = pdf.v +... return pdf.assign(v=(v - v.mean()) / v.std()) +>>> df.groupby('id').apply(normalize).show() # doctest: +SKIP --- End diff -- nit: the spaces around `#` are wrong. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19418 **[Test build #82467 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82467/testReport)** for PR 19418 at commit [`773b429`](https://github.com/apache/spark/commit/773b429991510864878928b2518ff14825e1f865). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142840490 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -519,3 +519,18 @@ case class CoGroup( outputObjAttr: Attribute, left: LogicalPlan, right: LogicalPlan) extends BinaryNode with ObjectProducer + +case class FlatMapGroupsInPandas( --- End diff -- Doc added. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18732 **[Test build #82466 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82466/testReport)** for PR 18732 at commit [`fa88c88`](https://github.com/apache/spark/commit/fa88c881a2fa3a7bb49af882ee9c482314184ff1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82465 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82465/testReport)** for PR 19436 at commit [`6710141`](https://github.com/apache/spark/commit/6710141767a2df92898af319bc4ef87f9110f911). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribut...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/19436 [SQL][WIP] Fix FlatMapGroupsInR's child distribution when grouping attributes are empty ## What changes were proposed in this pull request? Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. WIP: Not quite sure if this will cause problem in R side. Add test to see Jenkins test result. ## How was this patch tested? Added test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 fix-flatmapinr-distribution Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19436.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19436 commit 6710141767a2df92898af319bc4ef87f9110f911 Author: Liang-Chi HsiehDate: 2017-10-05T03:08:00Z Fix FlatMapGroupsInR's child distribution when grouping attributes are empty. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19424 **[Test build #82464 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82464/testReport)** for PR 19424 at commit [`df1d1c5`](https://github.com/apache/spark/commit/df1d1c5150ba1787accd95e621d62d6adf215e60). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142839010 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -26,6 +26,25 @@ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.types.StructType +private class BatchIterator[T](iter: Iterator[T], batchSize: Int) + extends Iterator[Iterator[T]] { + + override def hasNext: Boolean = iter.hasNext + + override def next(): Iterator[T] = { --- End diff -- Sorry I pushed a bit late. The comment is added now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19424: [SPARK-22197][SQL] push down operators to data so...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19424#discussion_r142838899 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala --- @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.spark.sql.catalyst.expressions.{And, AttributeMap, AttributeSet, Expression, ExpressionSet} +import org.apache.spark.sql.catalyst.planning.PhysicalOperation +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources +import org.apache.spark.sql.sources.v2.reader._ + +/** + * Pushes down various operators to the underlying data source for better performance. We classify + * operators into different layers, operators in the same layer are orderless, i.e. the query result + * won't change if we switch the operators within a layer(e.g. we can switch the order of predicates + * and required columns). The operators in layer N can only be pushed down if operators in layer N-1 + * that above the data source relation are all pushed down. As an example, you can't push down limit + * if a filter below limit is not pushed down. + * + * Current operator push down layers: + * layer 1: predicates, required columns. + */ +object PushDownOperatorsToDataSource extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = { --- End diff -- yea it's an optimizer rule run before planner --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18732 **[Test build #82463 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82463/testReport)** for PR 18732 at commit [`657942b`](https://github.com/apache/spark/commit/657942b1b4080c30fa5c60bcd700c862fb571465). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18732 I pushed a new commit addressing the comments. Let me scan through the comments again. I think there are some comments around worker.py not being addressed yet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142836611 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -26,6 +26,25 @@ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.types.StructType +private class BatchIterator[T](iter: Iterator[T], batchSize: Int) + extends Iterator[Iterator[T]] { + + override def hasNext: Boolean = iter.hasNext + + override def next(): Iterator[T] = { --- End diff -- I didn't see the comment added? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19416: [SPARK-22187][SS] Update unsaferow format for sav...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19416 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "ke...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/19435#discussion_r142836474 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala --- @@ -291,7 +291,7 @@ class SymmetricHashJoinStateManager( } /** A wrapper around a [[StateStore]] that stores [(key, index) -> value]. */ - private class KeyWithIndexToValueStore extends StateStoreHandler(KeyWithIndexToValuesType) { + private class KeyWithIndexToValueStore extends StateStoreHandler(KeyWithIndexToValueType) { --- End diff -- dropped the 's' from '...Values...' --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "keyWithIn...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19435 **[Test build #82462 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82462/testReport)** for PR 19435 at commit [`c62ba65`](https://github.com/apache/spark/commit/c62ba65fcc952de4b244695f32dc41783a2c1631). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142836297 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -519,3 +519,18 @@ case class CoGroup( outputObjAttr: Attribute, left: LogicalPlan, right: LogicalPlan) extends BinaryNode with ObjectProducer + +case class FlatMapGroupsInPandas( +groupingAttributes: Seq[Attribute], +functionExpr: Expression, +output: Seq[Attribute], +child: LogicalPlan) extends UnaryNode { + /** + * This is needed because output attributes is considered `reference` when + * passed through the constructor. + * + * Without this, catalyst will complain that output attributes are missing + * from the input. + */ + override val producedAttributes = AttributeSet(output) --- End diff -- Ok. I see. Should be fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "ke...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/19435 [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "keyWithIndexToValue" ## What changes were proposed in this pull request? This PR changes `keyWithIndexToNumValues` to `keyWithIndexToValue`. There will be folders named with this `keyWithIndexToNumValues`. So if we ever want to fix this, let's fix it now. ## How was this patch tested? existing unit test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark keyWithIndex Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19435.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19435 commit c62ba65fcc952de4b244695f32dc41783a2c1631 Author: Liwei LinDate: 2017-10-05T02:21:02Z Fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142836245 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -519,3 +519,18 @@ case class CoGroup( outputObjAttr: Attribute, left: LogicalPlan, right: LogicalPlan) extends BinaryNode with ObjectProducer + +case class FlatMapGroupsInPandas( +groupingAttributes: Seq[Attribute], +functionExpr: Expression, +output: Seq[Attribute], +child: LogicalPlan) extends UnaryNode { + /** + * This is needed because output attributes is considered `reference` when --- End diff -- nit: `reference` -> `references`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142835260 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,66 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self) --- End diff -- `return GroupedData(jgd, self)` -> `return GroupedData(jgd, self._df)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142833499 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends LDAOptimizer { val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta) val alpha = this.alpha.asBreeze val gammaShape = this.gammaShape +val optimizeDocConcentration = this.optimizeDocConcentration +// We calculate logphat in the same pass as other statistics, but we only need +// it if we are optimizing docConcentration +val logphatPartOptionBase = () => if (optimizeDocConcentration) Some(BDV.zeros[Double](k)) + else None -val stats: RDD[(BDM[Double], List[BDV[Double]])] = batch.mapPartitions { docs => +val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs => val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0) val stat = BDM.zeros[Double](k, vocabSize) - var gammaPart = List[BDV[Double]]() + val logphatPartOption = logphatPartOptionBase() + var nonEmptyDocCount : Long = 0L nonEmptyDocs.foreach { case (_, termCounts: Vector) => +nonEmptyDocCount += 1 val (gammad, sstats, ids) = OnlineLDAOptimizer.variationalTopicInference( termCounts, expElogbetaBc.value, alpha, gammaShape, k) -stat(::, ids) := stat(::, ids).toDenseMatrix + sstats -gammaPart = gammad :: gammaPart +stat(::, ids) := stat(::, ids) + sstats +logphatPartOption.foreach(_ += LDAUtils.dirichletExpectation(gammad)) } - Iterator((stat, gammaPart)) -}.persist(StorageLevel.MEMORY_AND_DISK) -val statsSum: BDM[Double] = stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))( - _ += _, _ += _) -val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( - stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) -stats.unpersist() -expElogbetaBc.destroy(false) -val batchResult = statsSum *:* expElogbeta.t + Iterator((stat, logphatPartOption, nonEmptyDocCount)) +} + +val elementWiseSum = (u : (BDM[Double], Option[BDV[Double]], Long), + v : (BDM[Double], Option[BDV[Double]], Long)) => { + u._1 += v._1 + u._2.foreach(_ += v._2.get) + (u._1, u._2, u._3 + v._3) +} + +val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN : Long) = stats + .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))( +elementWiseSum, elementWiseSum + ) +val batchResult = statsSum *:* expElogbeta.t // Note that this is an optimization to avoid batch.count -updateLambda(batchResult, (miniBatchFraction * corpusSize).ceil.toInt) -if (optimizeDocConcentration) updateAlpha(gammat) +val batchSize = (miniBatchFraction * corpusSize).ceil.toInt +updateLambda(batchResult, batchSize) + +logphatOption.foreach(_ /= batchSize.toDouble) --- End diff -- agree. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142833374 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends LDAOptimizer { val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta) val alpha = this.alpha.asBreeze val gammaShape = this.gammaShape +val optimizeDocConcentration = this.optimizeDocConcentration +// We calculate logphat in the same pass as other statistics, but we only need +// it if we are optimizing docConcentration +val logphatPartOptionBase = () => if (optimizeDocConcentration) Some(BDV.zeros[Double](k)) + else None -val stats: RDD[(BDM[Double], List[BDV[Double]])] = batch.mapPartitions { docs => +val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs => val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0) val stat = BDM.zeros[Double](k, vocabSize) - var gammaPart = List[BDV[Double]]() + val logphatPartOption = logphatPartOptionBase() + var nonEmptyDocCount : Long = 0L nonEmptyDocs.foreach { case (_, termCounts: Vector) => +nonEmptyDocCount += 1 val (gammad, sstats, ids) = OnlineLDAOptimizer.variationalTopicInference( termCounts, expElogbetaBc.value, alpha, gammaShape, k) -stat(::, ids) := stat(::, ids).toDenseMatrix + sstats -gammaPart = gammad :: gammaPart +stat(::, ids) := stat(::, ids) + sstats +logphatPartOption.foreach(_ += LDAUtils.dirichletExpectation(gammad)) } - Iterator((stat, gammaPart)) -}.persist(StorageLevel.MEMORY_AND_DISK) -val statsSum: BDM[Double] = stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))( - _ += _, _ += _) -val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( - stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) -stats.unpersist() -expElogbetaBc.destroy(false) -val batchResult = statsSum *:* expElogbeta.t + Iterator((stat, logphatPartOption, nonEmptyDocCount)) +} + +val elementWiseSum = (u : (BDM[Double], Option[BDV[Double]], Long), + v : (BDM[Double], Option[BDV[Double]], Long)) => { + u._1 += v._1 + u._2.foreach(_ += v._2.get) + (u._1, u._2, u._3 + v._3) +} + +val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN : Long) = stats + .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))( +elementWiseSum, elementWiseSum + ) +val batchResult = statsSum *:* expElogbeta.t // Note that this is an optimization to avoid batch.count -updateLambda(batchResult, (miniBatchFraction * corpusSize).ceil.toInt) -if (optimizeDocConcentration) updateAlpha(gammat) +val batchSize = (miniBatchFraction * corpusSize).ceil.toInt --- End diff -- Sure, since we're talking about consistency with old LDA. It's fine to keep using batchSize here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/18924 Yes, I think local test is enough for both correctness and performance. For consistency with old LDA, just some manual local test would be sufficient. You may well just use the LDA example and switch the Spark jar files between Spark 2.2 and your branch. And I think the case with empty document worth special attention. The same for the performance test. You may just post the result after your local test. I'm OK as long as there's no noticeable regression. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142831316 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends LDAOptimizer { val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta) val alpha = this.alpha.asBreeze val gammaShape = this.gammaShape +val optimizeDocConcentration = this.optimizeDocConcentration +// We calculate logphat in the same pass as other statistics, but we only need +// it if we are optimizing docConcentration +val logphatPartOptionBase = () => if (optimizeDocConcentration) Some(BDV.zeros[Double](k)) + else None -val stats: RDD[(BDM[Double], List[BDV[Double]])] = batch.mapPartitions { docs => +val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs => val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0) val stat = BDM.zeros[Double](k, vocabSize) - var gammaPart = List[BDV[Double]]() + val logphatPartOption = logphatPartOptionBase() + var nonEmptyDocCount : Long = 0L nonEmptyDocs.foreach { case (_, termCounts: Vector) => +nonEmptyDocCount += 1 val (gammad, sstats, ids) = OnlineLDAOptimizer.variationalTopicInference( termCounts, expElogbetaBc.value, alpha, gammaShape, k) -stat(::, ids) := stat(::, ids).toDenseMatrix + sstats -gammaPart = gammad :: gammaPart +stat(::, ids) := stat(::, ids) + sstats +logphatPartOption.foreach(_ += LDAUtils.dirichletExpectation(gammad)) } - Iterator((stat, gammaPart)) -}.persist(StorageLevel.MEMORY_AND_DISK) -val statsSum: BDM[Double] = stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))( - _ += _, _ += _) -val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( - stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) -stats.unpersist() -expElogbetaBc.destroy(false) -val batchResult = statsSum *:* expElogbeta.t + Iterator((stat, logphatPartOption, nonEmptyDocCount)) +} + +val elementWiseSum = (u : (BDM[Double], Option[BDV[Double]], Long), + v : (BDM[Double], Option[BDV[Double]], Long)) => { + u._1 += v._1 + u._2.foreach(_ += v._2.get) + (u._1, u._2, u._3 + v._3) +} + +val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN : Long) = stats + .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))( +elementWiseSum, elementWiseSum + ) +val batchResult = statsSum *:* expElogbeta.t // Note that this is an optimization to avoid batch.count -updateLambda(batchResult, (miniBatchFraction * corpusSize).ceil.toInt) -if (optimizeDocConcentration) updateAlpha(gammat) +val batchSize = (miniBatchFraction * corpusSize).ceil.toInt --- End diff -- this may wait. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19432 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19432 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82460/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19432 **[Test build #82460 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82460/testReport)** for PR 19432 at commit [`aaf41dd`](https://github.com/apache/spark/commit/aaf41dd02b8f2109a84d36768fdf1802f7817961). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @jkbradley, thank you! - Correctness: in order to test the equivalence of two versions of `submitMiniBatch` I have to bring both of them into the scope... One solution would be to derive a class `OldOnlineLDAOptimizer` from `OnlineLDAOptimizer` and override `submitMiniBatch` but the class is final. What's the preferred approach? - Sure. Sounds good. Should I add test-case reporting the CPU time or should I rather define an `App`? Should I add the code to the PR or just report the results here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142826379 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends LDAOptimizer { val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta) val alpha = this.alpha.asBreeze val gammaShape = this.gammaShape +val optimizeDocConcentration = this.optimizeDocConcentration +// We calculate logphat in the same pass as other statistics, but we only need +// it if we are optimizing docConcentration +val logphatPartOptionBase = () => if (optimizeDocConcentration) Some(BDV.zeros[Double](k)) + else None -val stats: RDD[(BDM[Double], List[BDV[Double]])] = batch.mapPartitions { docs => +val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs => val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0) val stat = BDM.zeros[Double](k, vocabSize) - var gammaPart = List[BDV[Double]]() + val logphatPartOption = logphatPartOptionBase() + var nonEmptyDocCount : Long = 0L nonEmptyDocs.foreach { case (_, termCounts: Vector) => +nonEmptyDocCount += 1 val (gammad, sstats, ids) = OnlineLDAOptimizer.variationalTopicInference( termCounts, expElogbetaBc.value, alpha, gammaShape, k) -stat(::, ids) := stat(::, ids).toDenseMatrix + sstats -gammaPart = gammad :: gammaPart +stat(::, ids) := stat(::, ids) + sstats +logphatPartOption.foreach(_ += LDAUtils.dirichletExpectation(gammad)) } - Iterator((stat, gammaPart)) -}.persist(StorageLevel.MEMORY_AND_DISK) -val statsSum: BDM[Double] = stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))( - _ += _, _ += _) -val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( - stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) -stats.unpersist() -expElogbetaBc.destroy(false) -val batchResult = statsSum *:* expElogbeta.t + Iterator((stat, logphatPartOption, nonEmptyDocCount)) +} + +val elementWiseSum = (u : (BDM[Double], Option[BDV[Double]], Long), + v : (BDM[Double], Option[BDV[Double]], Long)) => { + u._1 += v._1 + u._2.foreach(_ += v._2.get) + (u._1, u._2, u._3 + v._3) +} + +val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN : Long) = stats + .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))( +elementWiseSum, elementWiseSum + ) +val batchResult = statsSum *:* expElogbeta.t // Note that this is an optimization to avoid batch.count -updateLambda(batchResult, (miniBatchFraction * corpusSize).ceil.toInt) -if (optimizeDocConcentration) updateAlpha(gammat) +val batchSize = (miniBatchFraction * corpusSize).ceil.toInt +updateLambda(batchResult, batchSize) + +logphatOption.foreach(_ /= batchSize.toDouble) --- End diff -- That sounds right to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142826326 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +462,54 @@ final class OnlineLDAOptimizer extends LDAOptimizer { val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta) val alpha = this.alpha.asBreeze val gammaShape = this.gammaShape - -val stats: RDD[(BDM[Double], List[BDV[Double]])] = batch.mapPartitions { docs => +val optimizeDocConcentration = this.optimizeDocConcentration +// If and only if optimizeDocConcentration is set true, +// we calculate logphat in the same pass as other statistics. +// No calculation of loghat happens otherwise. +val logphatPartOptionBase = () => if (optimizeDocConcentration) { + Some(BDV.zeros[Double](k)) +} else { + None +} + +val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs => val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0) val stat = BDM.zeros[Double](k, vocabSize) - var gammaPart = List[BDV[Double]]() + val logphatPartOption = logphatPartOptionBase() + var nonEmptyDocCount : Long = 0L nonEmptyDocs.foreach { case (_, termCounts: Vector) => +nonEmptyDocCount += 1 val (gammad, sstats, ids) = OnlineLDAOptimizer.variationalTopicInference( termCounts, expElogbetaBc.value, alpha, gammaShape, k) -stat(::, ids) := stat(::, ids).toDenseMatrix + sstats -gammaPart = gammad :: gammaPart +stat(::, ids) := stat(::, ids) + sstats +logphatPartOption.foreach(_ += LDAUtils.dirichletExpectation(gammad)) } - Iterator((stat, gammaPart)) -}.persist(StorageLevel.MEMORY_AND_DISK) -val statsSum: BDM[Double] = stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))( - _ += _, _ += _) -val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( - stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) -stats.unpersist() -expElogbetaBc.destroy(false) -val batchResult = statsSum *:* expElogbeta.t + Iterator((stat, logphatPartOption, nonEmptyDocCount)) +} + +val elementWiseSum = (u : (BDM[Double], Option[BDV[Double]], Long), + v : (BDM[Double], Option[BDV[Double]], Long)) => { + u._1 += v._1 + u._2.foreach(_ += v._2.get) + (u._1, u._2, u._3 + v._3) +} + +val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats + .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))( +elementWiseSum, elementWiseSum + ) +val batchResult = statsSum *:* expElogbeta.t // Note that this is an optimization to avoid batch.count -updateLambda(batchResult, (miniBatchFraction * corpusSize).ceil.toInt) -if (optimizeDocConcentration) updateAlpha(gammat) +val batchSize = (miniBatchFraction * corpusSize).ceil.toInt +updateLambda(batchResult, batchSize) + +logphatOption.foreach(_ /= nonEmptyDocsN.toDouble) --- End diff -- Good point about dividing by 0, @hhbyyh . We should probably just check nonEmptyDocsN to see if it's 0, and if it is, skip all of these updates. That's related to but actually separate from the follow-up SPARK-22111. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19424 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19424 **[Test build #82461 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82461/testReport)** for PR 19424 at commit [`a9f2c57`](https://github.com/apache/spark/commit/a9f2c57d1acdb10c1860aa75b9f102b7b9afbca8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19424 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82461/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19434: [SPARK-21785][SQL]Support create table from a parquet fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19434 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19434: [SPARK-21785][SQL]Support create table from a par...
GitHub user CrazyJacky opened a pull request: https://github.com/apache/spark/pull/19434 [SPARK-21785][SQL]Support create table from a parquet file schema ## Support create table from a parquet file schema As described in jira: ```sql CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE 'PARQUET' '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION '/user/test/def/'; ``` this is a very ugly fix and I would like someone to help to review and refine. and it only supports create hive table. ## Tested by test case and tested about build the runnable distribution ```scala test("create table like parquet") { val f = getClass.getClassLoader. getResource("test-data/dec-in-fixed-len.parquet").getPath val v1 = """ |create table if not exists db1.table1 like 'parquet' """.stripMargin.concat("'" + f + "'").concat( """ |stored as sequencefile |location '/tmp/table1' """.stripMargin ) val (desc, allowExisting) = extractTableDesc(v1) assert(allowExisting) assert(desc.identifier.database == Some("db1")) assert(desc.identifier.table == "table1") assert(desc.tableType == CatalogTableType.EXTERNAL) assert(desc.schema == new StructType() .add("fixed_len_dec", "decimal(10,2)")) assert(desc.bucketSpec.isEmpty) assert(desc.viewText.isEmpty) assert(desc.viewDefaultDatabase.isEmpty) assert(desc.viewQueryColumnNames.isEmpty) assert(desc.storage.locationUri == Some(new URI("/tmp/table1"))) assert(desc.storage.inputFormat == Some("org.apache.hadoop.mapred.SequenceFileInputFormat")) assert(desc.storage.outputFormat == Some("org.apache.hadoop.mapred.SequenceFileOutputFormat")) assert(desc.storage.serde == Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")) } ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/jacshen/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19434.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19434 commit 6b23cb8ff5a778f4f1b4ca4f218cbe8c4e422101 Author: ShenDate: 2017-10-04T20:35:03Z Add support to create table which schema is reading from a given parquet file commit 877a57ec439db4e688c71568ddd312bdc2a50cec Author: jacshen Date: 2017-10-04T20:37:08Z Merge branch 'master' of https://github.com/apache/spark commit a22c39e795ab4a730d0277c4162cdfadd37dbf22 Author: jacshen Date: 2017-10-04T21:21:02Z Add support to create table which schema is reading from a given parquet file --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19412: [SPARK-22142][BUILD][STREAMING] Move Flume support behin...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19412 I think the argument for it if anything is that it's a) deprecated, so should kinda be optional to build, and b) this would simply be consistent with how other external/* modules are handled. For Spark 2.x yes there isn't and shouldn't be an actual change to the outputs. There's a legitimate separate question here about whether it should be deprecated? my sense is yes, to leave the option to remove it in Spark 3.0, which would probably follow 2.3. I recall something about flume-ng uses an old version of Netty and it's the thing blocking updating it for all of Spark, but I may be misremembering the detail there. Yeah this passed a Maven test build as well as `dev/run-tests` now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user ArtRand commented on the issue: https://github.com/apache/spark/pull/19272 @kalvinnchau I'm running Hadoop 2.6 on a DC/OS cluster with Mesos 1.4.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19424 **[Test build #82461 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82461/testReport)** for PR 19424 at commit [`a9f2c57`](https://github.com/apache/spark/commit/a9f2c57d1acdb10c1860aa75b9f102b7b9afbca8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19392: [SPARK-22169][SQL] support byte length literal as identi...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19392 hmm, it's not a bug fix but a nice-to-have feature, do we want this in spark 2.2? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 @WeichenXu123 would you be able to take an initial look at this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user ajbozarth commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r142782246 --- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js --- @@ -0,0 +1,474 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +$(document).ajaxStop($.unblockUI); +$(document).ajaxStart(function () { +$.blockUI({message: 'Loading Tasks Page...'}); +}); + +$.extend( $.fn.dataTable.ext.type.order, { +"file-size-pre": ConvertDurationString, + +"file-size-asc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? -1 : ((a > b) ? 1 : 0)); +}, + +"file-size-desc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? 1 : ((a > b) ? -1 : 0)); +} +} ); + +function createTemplateURI(appId) { +var words = document.baseURI.split('/'); +var ind = words.indexOf("proxy"); +if (ind > 0) { +var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + '/static/stagespage-template.html'; +return baseURI; +} +ind = words.indexOf("history"); +if(ind > 0) { +var baseURI = words.slice(0, ind).join('/') + '/static/stagespage-template.html'; +return baseURI; +} +return location.origin + "/static/stagespage-template.html"; +} + +// This function will only parse the URL under certain formate +// e.g. https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0 +function StageEndPoint(appId) { +var words = document.baseURI.split('/'); +var words2 = document.baseURI.split('?'); +var ind = words.indexOf("proxy"); +if (ind > 0) { +var appId = words[ind + 1]; +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +var newBaseURI = words.slice(0, ind + 2).join('/'); +return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + stageId; +} +ind = words.indexOf("history"); +if (ind > 0) { +var appId = words[ind + 1]; +var attemptId = words[ind + 2]; +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +var newBaseURI = words.slice(0, ind).join('/'); +if (isNaN(attemptId) || attemptId == "0") { +return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + stageId; +} else { +return newBaseURI + "/api/v1/applications/" + appId + "/" + attemptId + "/stages/" + stageId; +} +} +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +return location.origin + "/api/v1/applications/" + appId + "/stages/" + stageId; +} + +function sortNumber(a,b) { +return a - b; +} + +function quantile(array, percentile) { +index = percentile/100. * (array.length-1); +if (Math.floor(index) == index) { + result = array[index]; +} else { +var i = Math.floor(index); +fraction = index - i; +result = array[i]; +} +return result; +} + +$(document).ready(function () { +$.extend($.fn.dataTable.defaults, { +stateSave: true, +lengthMenu: [[20, 40, 60, 100, -1], [20, 40, 60, 100, "All"]], +pageLength: 20 +}); + +$("#showAdditionalMetrics").append( +"" + +"" + +" Show Additional Metrics" + +"" + +"" + +" Select All" + +" Scheduler Delay" + +" Task Deserialization Time" + +" Shuffle Read Blocked Time" + +" Shuffle Remote Reads" + +
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user ajbozarth commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r142781908 --- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js --- @@ -46,3 +46,64 @@ function formatBytes(bytes, type) { var i = Math.floor(Math.log(bytes) / Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + sizes[i]; } + +function formatLogsCells(execLogs, type) { +if (type !== 'display') return Object.keys(execLogs); +if (!execLogs) return; +var result = ''; +$.each(execLogs, function (logName, logUrl) { +result += '' + logName + '' +}); +return result; +} + +function getStandAloneAppId(cb) { --- End diff -- So I know you just copied this function over, but why doesn't this just return the appId rather than taking in a function to run on an appId? It seems to me the later would make more sense. If changing it makes sense to you as well we should update it here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19433 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib][WIP] Add local tree training...
GitHub user smurching opened a pull request: https://github.com/apache/spark/pull/19433 [SPARK-3162] [MLlib][WIP] Add local tree training for decision tree regressors ## What changes were proposed in this pull request? WIP, DO NOT MERGE ### Overview This PR adds local tree training for decision tree regressors as a first step for addressing [SPARK-3162](https://issues.apache.org/jira/browse/SPARK-3162) (train decision trees locally when possible). See [this design doc](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit) for a detailed description of the proposed changes. Distributed training logic has been refactored but only minimally modified; the local tree training implementation leverages existing distributed training logic for computing impurities and splits. This shared logic has been refactored into `...Utils` objects (e.g. `SplitUtils.scala`, `ImpurityUtils.scala`). ### How to Review Each commit in this PR adds non-overlapping functionality, so the PR should be reviewable commit-by-commit. Changes introduced by each commit: 1. Adds new data structures for local tree training (`FeatureVector`, `TrainingInfo`) & associated unit tests (`LocalTreeDataSuite`) 2. Adds shared utility methods for computing splits/impurities (`SplitUtils`, `ImpurityUtils`, `AggUpdateUtils`), largely copied from existing distributed training code in `RandomForest.scala`. 3. Unit tests for split/impurity utility methods (`TreeSplitUtilsSuite`) 4. Updates distributed training code in `RandomForest.scala` to depend on the utility methods introduced in 2. 5. Adds local tree training logic (`LocalDecisionTree`) 6. Local tree unit/integration tests (`LocalTreeUnitSuite`, `LocalTreeIntegrationSuite`) ## How was this patch tested? No existing tests were modified. The following new tests were added (also described above): * Unit tests for new data structures specific to local tree training (`LocalTreeDataSuite`, `LocalTreeUtilsSuite`) * Unit tests for impurity/split utility methods (`TreeSplitUtilsSuite`) * Unit tests for local tree training logic (`LocalTreeUnitSuite`) * Integration tests verifying that local & distributed tree training produce the same trees (`LocalTreeIntegrationSuite`) (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/smurching/spark pr-splitup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19433.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19433 commit 219a12001383017e70f10cd7c785272e70e64b28 Author: Sid MurchingDate: 2017-10-04T20:55:35Z Add data structures for local tree training & associated tests (in LocalTreeDataSuite): * TrainingInfo: primary local tree training data structure, contains all information required to describe state of algorithm at any point during learning * FeatureVector: Stores data for an individual feature as an Array[Int] commit 710714395c966f664af7f7b62226336675ec2ea7 Author: Sid Murching Date: 2017-10-04T20:57:30Z Add utility methods used for impurity and split calculations during both local & distributed training: * AggUpdateUtils: Helper methods for updating sufficient stats for a given node * ImpurityUtils: Helper methods for impurity-related calcluations during node split decisions * SplitUtils: Helper methods for choosing splits given sufficient stats NOTE: Both ImpurityUtils and SplitUtils primarily contain code taken from RandomForest.scala, with slight modifications. Tests for SplitUtils are contained in the next commit. commit 49bf0ae9b275264e757de573f81b816437be77e7 Author: Sid Murching Date: 2017-10-04T21:36:15Z Add test suites for utility methods used during best-split computation: * TreeSplitUtilsSuite: Test suite for SplitUtils * TreeTests: Add utility method (getMetadata) for TreeSplitUtilsSuite Also add methods used by these tests in LocalDecisionTree.scala, RandomForest.scala commit bc54b165849202269b80bbac1a84afb857e87e31 Author: Sid Murching Date: 2017-10-04T21:48:33Z Update RandomForest.scala to use new utility methods for impurity/split calculations commit 6a68a5cc6a6b7087163bbe5681ad41aef5e3fd0a Author: Sid Murching
[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19432 **[Test build #82460 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82460/testReport)** for PR 19432 at commit [`aaf41dd`](https://github.com/apache/spark/commit/aaf41dd02b8f2109a84d36768fdf1802f7817961). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19424: [SPARK-22197][SQL] push down operators to data so...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19424#discussion_r142800670 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala --- @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.spark.sql.catalyst.expressions.{And, AttributeMap, AttributeSet, Expression, ExpressionSet} +import org.apache.spark.sql.catalyst.planning.PhysicalOperation +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources +import org.apache.spark.sql.sources.v2.reader._ + +/** + * Pushes down various operators to the underlying data source for better performance. We classify + * operators into different layers, operators in the same layer are orderless, i.e. the query result + * won't change if we switch the operators within a layer(e.g. we can switch the order of predicates + * and required columns). The operators in layer N can only be pushed down if operators in layer N-1 + * that above the data source relation are all pushed down. As an example, you can't push down limit + * if a filter below limit is not pushed down. --- End diff -- > As an example, given a LIMIT has a FILTER child, you can't push down LIMIT if FILTER is not completely pushed down. When both are pushed down, the data source should execute FILTER before LIMIT. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19424: [SPARK-22197][SQL] push down operators to data so...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19424#discussion_r142801593 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala --- @@ -32,13 +32,12 @@ import org.apache.spark.sql.types.StructType case class DataSourceV2ScanExec( --- End diff -- ``` /** * Physical plan node for scanning data from a data source. */ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19424: [SPARK-22197][SQL] push down operators to data so...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19424#discussion_r142806719 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala --- @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.spark.sql.catalyst.expressions.{And, AttributeMap, AttributeSet, Expression, ExpressionSet} +import org.apache.spark.sql.catalyst.planning.PhysicalOperation +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources +import org.apache.spark.sql.sources.v2.reader._ + +/** + * Pushes down various operators to the underlying data source for better performance. We classify + * operators into different layers, operators in the same layer are orderless, i.e. the query result + * won't change if we switch the operators within a layer(e.g. we can switch the order of predicates + * and required columns). The operators in layer N can only be pushed down if operators in layer N-1 + * that above the data source relation are all pushed down. As an example, you can't push down limit + * if a filter below limit is not pushed down. + * + * Current operator push down layers: + * layer 1: predicates, required columns. + */ +object PushDownOperatorsToDataSource extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = { --- End diff -- This is an optimizer rule? The input is a `LogicalPlan` and the output is still a `LogicalPlan`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19432 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19432: [SPARK-22203][SQL]Add job description for file li...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/19432 [SPARK-22203][SQL]Add job description for file listing Spark jobs ## What changes were proposed in this pull request? The user may be confused about some 1-tasks jobs. We can add a job description for these jobs so that the user can figure it out. ## How was this patch tested? The new unit test. Before: https://user-images.githubusercontent.com/1000778/31202567-f78d15c0-a917-11e7-841e-11b8bf8f0032.png;> After: https://user-images.githubusercontent.com/1000778/31202576-fc01e356-a917-11e7-9c2b-7bf80b153adb.png;> You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-22203 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19432.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19432 commit aaf41dd02b8f2109a84d36768fdf1802f7817961 Author: Shixiong ZhuDate: 2017-10-04T21:58:32Z Add job description for file listing Spark jobs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19428: [SPARK-22131][MESOS] Mesos driver secrets
Github user susanxhuynh closed the pull request at: https://github.com/apache/spark/pull/19428 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142801623 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -26,6 +26,25 @@ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.types.StructType +private class BatchIterator[T](iter: Iterator[T], batchSize: Int) + extends Iterator[Iterator[T]] { + + override def hasNext: Boolean = iter.hasNext + + override def next(): Iterator[T] = { --- End diff -- Added. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19108: [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19108 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19108: [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19108 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82459/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19108: [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19108 **[Test build #82459 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82459/testReport)** for PR 19108 at commit [`62a8fcd`](https://github.com/apache/spark/commit/62a8fcd29da6d81981f29dfc3f6e3cb77c7c6fc3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142796899 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -519,3 +519,18 @@ case class CoGroup( outputObjAttr: Attribute, left: LogicalPlan, right: LogicalPlan) extends BinaryNode with ObjectProducer + +case class FlatMapGroupsInPandas( +groupingAttributes: Seq[Attribute], +functionExpr: Expression, +output: Seq[Attribute], +child: LogicalPlan) extends UnaryNode { + /** + * This is needed because output attributes is considered `reference` when + * passed through the constructor. + * + * Without this, catalyst will complain that output attributes are missing + * from the input. + */ + override val producedAttributes = AttributeSet(output) --- End diff -- This is one of the trick bit. It's because of this code: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L135 Because of `productIterator` will return all member variables, including `output`, `references` of the tree node will include all output attributes, and it will complain about missing input: ``` def missingInput: AttributeSet = references -- inputSet -- producedAttributes ``` I think my solution here isn't great but I don't know the best way of deal with this. If someone with deeper catalyst knowledge can suggest, I am happy to give rid of this bit.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18966 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18966 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82455/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18966 **[Test build #82455 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82455/testReport)** for PR 18966 at commit [`a489938`](https://github.com/apache/spark/commit/a489938b3f128558df31c97a32e196620c9fd475). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org