[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82465 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82465/testReport)**
 for PR 19436 at commit 
[`6710141`](https://github.com/apache/spark/commit/6710141767a2df92898af319bc4ef87f9110f911).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82465/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGene...

2017-10-04 Thread rekhajoshm
Github user rekhajoshm closed the pull request at:

https://github.com/apache/spark/pull/19418


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2017-10-04 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/13440
  
@srowen @thunterdb any more thoughts on this?
how about @sethah @yanboliang @jkbradley?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18732
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82466/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...

2017-10-04 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19418
  
I think we don't want to add such defensive condition and avoid the 
logtrace. Without that, we don't know the problem is happened. We should 
identify the issue and fix it if it is really there, instead of hiding the 
error. Btw, you still don't get correct result even by avoiding the error, 
right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18732
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18732
  
**[Test build #82466 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82466/testReport)**
 for PR 18732 at commit 
[`fa88c88`](https://github.com/apache/spark/commit/fa88c881a2fa3a7bb49af882ee9c482314184ff1).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...

2017-10-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19436#discussion_r142850005
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala ---
@@ -394,7 +394,11 @@ case class FlatMapGroupsInRExec(
   override def producedAttributes: AttributeSet = 
AttributeSet(outputObjAttr)
 
   override def requiredChildDistribution: Seq[Distribution] =
-ClusteredDistribution(groupingAttributes) :: Nil
+if (groupingAttributes.isEmpty) {
+  AllTuples :: Nil
--- End diff --

You mean empty grouping attributes == all tuples? Yeah, I think no grouping 
attributes means all tuples are in the one group.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...

2017-10-04 Thread rekhajoshm
Github user rekhajoshm commented on the issue:

https://github.com/apache/spark/pull/19418
  
@viirya You are correct, i am on latest master, and i did not get it yet. 
This PR was to have a defensive condition. As, if this happens only under 
certain unique data/flow, this is the one place to certainly  avoid the whole 
stack of logtrace. As I understand, the whole slew of log seems to be the 
painpoint, nerves the user, and especially since the job nonetheless finishes 
successfully.thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...

2017-10-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19436#discussion_r142849580
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala ---
@@ -394,7 +394,11 @@ case class FlatMapGroupsInRExec(
   override def producedAttributes: AttributeSet = 
AttributeSet(outputObjAttr)
 
   override def requiredChildDistribution: Seq[Distribution] =
-ClusteredDistribution(groupingAttributes) :: Nil
+if (groupingAttributes.isEmpty) {
+  AllTuples :: Nil
--- End diff --

should empty == all?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "keyWithIn...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19435
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82462/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "keyWithIn...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19435
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "keyWithIn...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19435
  
**[Test build #82462 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82462/testReport)**
 for PR 19435 at commit 
[`c62ba65`](https://github.com/apache/spark/commit/c62ba65fcc952de4b244695f32dc41783a2c1631).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...

2017-10-04 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19418
  
From the stacktrace posted in the JIRA, the problematic code is:

 /* 151 */   comp = agg_bufValue.compare(smj_value3);

`agg_bufValue` is a `long` but `smj_value3` is a `UTF8String`. It looks 
strange because we should not to compare two variable in different types at 
all. 

Looks like you don't have re-producible codes for this issue. So I don't 
think you found the root cause.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...

2017-10-04 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19436
  
@HyukjinKwon Yeah, wait me few minutes. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18732
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18732
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82463/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18732
  
**[Test build #82463 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82463/testReport)**
 for PR 18732 at commit 
[`657942b`](https://github.com/apache/spark/commit/657942b1b4080c30fa5c60bcd700c862fb571465).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...

2017-10-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19436
  
LGTM mind opening a JIRA?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGene...

2017-10-04 Thread rekhajoshm
Github user rekhajoshm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19418#discussion_r142848020
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -697,7 +697,12 @@ class CodegenContext {
   }
 """
   s"${addNewFunction(compareFunc, funcCode)}($c1, $c2)"
-case other if other.isInstanceOf[AtomicType] => s"$c1.compare($c2)"
+case other if other.isInstanceOf[AtomicType] =>
+  if (s"$c1".getClass.getMethods.map(_.getName).filter(_ matches 
"compare").size == 1) {
--- End diff --

@viirya i was putting a note to that effect :-) This would not work, but 
trying out something, especially the consistency betn local and jenkins test 
behavior.I needed "other" to work.Anyhow, mainly looking to check if 
atomictypes conv/long/String can happen by different attempts.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGene...

2017-10-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19418#discussion_r142847702
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -697,7 +697,12 @@ class CodegenContext {
   }
 """
   s"${addNewFunction(compareFunc, funcCode)}($c1, $c2)"
-case other if other.isInstanceOf[AtomicType] => s"$c1.compare($c2)"
+case other if other.isInstanceOf[AtomicType] =>
+  if (s"$c1".getClass.getMethods.map(_.getName).filter(_ matches 
"compare").size == 1) {
--- End diff --

You just try to get class from a `String` object.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19418
  
**[Test build #82471 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82471/testReport)**
 for PR 19418 at commit 
[`3f8ce9a`](https://github.com/apache/spark/commit/3f8ce9a99acc1d5ae22004f4f18d94df8966cbcd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19424
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82464/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19424
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19424
  
**[Test build #82464 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82464/testReport)**
 for PR 19424 at commit 
[`df1d1c5`](https://github.com/apache/spark/commit/df1d1c5150ba1787accd95e621d62d6adf215e60).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82470 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82470/testReport)**
 for PR 19436 at commit 
[`6710141`](https://github.com/apache/spark/commit/6710141767a2df92898af319bc4ef87f9110f911).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...

2017-10-04 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19436
  
Ok. The added test works to verify this is an issue. See the test result of 
 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82468/testReport.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18732
  
**[Test build #82469 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82469/testReport)**
 for PR 18732 at commit 
[`e4efb32`](https://github.com/apache/spark/commit/e4efb3281008a2b450f9013aeb8f1ac9cf4ffa9e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142845456
  
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col)
 else:
 jgd = self._jgd.pivot(pivot_col, values)
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self._df)
+
+def apply(self, udf):
+"""
+Maps each group of the current :class:`DataFrame` using a pandas 
udf and returns the result
+as a :class:`DataFrame`.
+
+The user-defined function should take a `pandas.DataFrame` and 
return another
+`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` 
to the user-function and
+the returned`pandas.DataFrame` are combined as a 
:class:`DataFrame`. The returned
+`pandas.DataFrame` can be arbitrary length and its schema should 
match the returnType of
+the pandas udf.
+
+:param udf: A wrapped function returned by `pandas_udf`
+
+>>> df = spark.createDataFrame(
+... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+... ("id", "v"))
+>>> @pandas_udf(returnType=df.schema)
+... def normalize(pdf):
+... v = pdf.v
+... return pdf.assign(v=(v - v.mean()) / v.std())
+>>> df.groupby('id').apply(normalize).show() #  doctest: +SKIP
--- End diff --

Fixed. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19436
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82468/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82468 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82468/testReport)**
 for PR 19436 at commit 
[`e0af5d6`](https://github.com/apache/spark/commit/e0af5d69bc3a9c615941e96d1709f1ed58bae727).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGene...

2017-10-04 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19418#discussion_r142844714
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -697,7 +697,14 @@ class CodegenContext {
   }
 """
   s"${addNewFunction(compareFunc, funcCode)}($c1, $c2)"
-case other if other.isInstanceOf[AtomicType] => s"$c1.compare($c2)"
+case other if other.isInstanceOf[AtomicType] =>
+  s"""
+if ($c1.getClass.getMethods.map(_.getName).filter(_ matches 
"compare").size == 1) {
+  $c1.compare($c2)
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19418
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19418
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82467/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19418
  
**[Test build #82467 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82467/testReport)**
 for PR 19418 at commit 
[`773b429`](https://github.com/apache/spark/commit/773b429991510864878928b2518ff14825e1f865).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19432: [SPARK-22203][SQL]Add job description for file li...

2017-10-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19432


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...

2017-10-04 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19432
  
Thanks! Merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82468 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82468/testReport)**
 for PR 19436 at commit 
[`e0af5d6`](https://github.com/apache/spark/commit/e0af5d69bc3a9c615941e96d1709f1ed58bae727).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGene...

2017-10-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19418#discussion_r142841674
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -697,7 +697,14 @@ class CodegenContext {
   }
 """
   s"${addNewFunction(compareFunc, funcCode)}($c1, $c2)"
-case other if other.isInstanceOf[AtomicType] => s"$c1.compare($c2)"
+case other if other.isInstanceOf[AtomicType] =>
+  s"""
+if ($c1.getClass.getMethods.map(_.getName).filter(_ matches 
"compare").size == 1) {
+  $c1.compare($c2)
--- End diff --

hmm, `genComp` seems to return a java statement evaluated to boolean, are 
you sure returning a block like this work?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142841543
  
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col)
 else:
 jgd = self._jgd.pivot(pivot_col, values)
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self._df)
+
+def apply(self, udf):
+"""
+Maps each group of the current :class:`DataFrame` using a pandas 
udf and returns the result
+as a :class:`DataFrame`.
+
+The user-defined function should take a `pandas.DataFrame` and 
return another
+`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` 
to the user-function and
+the returned`pandas.DataFrame` are combined as a 
:class:`DataFrame`. The returned
+`pandas.DataFrame` can be arbitrary length and its schema should 
match the returnType of
+the pandas udf.
+
+:param udf: A wrapped function returned by `pandas_udf`
+
+>>> df = spark.createDataFrame(
+... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+... ("id", "v"))
+>>> @pandas_udf(returnType=df.schema)
+... def normalize(pdf):
+... v = pdf.v
+... return pdf.assign(v=(v - v.mean()) / v.std())
+>>> df.groupby('id').apply(normalize).show() #  doctest: +SKIP
--- End diff --

nit: the spaces around `#` are wrong.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19418: [SPARK-19984][SQL] Fix for ERROR codegen.CodeGenerator: ...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19418
  
**[Test build #82467 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82467/testReport)**
 for PR 19418 at commit 
[`773b429`](https://github.com/apache/spark/commit/773b429991510864878928b2518ff14825e1f865).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142840490
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -519,3 +519,18 @@ case class CoGroup(
 outputObjAttr: Attribute,
 left: LogicalPlan,
 right: LogicalPlan) extends BinaryNode with ObjectProducer
+
+case class FlatMapGroupsInPandas(
--- End diff --

Doc added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18732
  
**[Test build #82466 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82466/testReport)**
 for PR 18732 at commit 
[`fa88c88`](https://github.com/apache/spark/commit/fa88c881a2fa3a7bb49af882ee9c482314184ff1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribution whe...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19436
  
**[Test build #82465 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82465/testReport)**
 for PR 19436 at commit 
[`6710141`](https://github.com/apache/spark/commit/6710141767a2df92898af319bc4ef87f9110f911).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19436: [SQL][WIP] Fix FlatMapGroupsInR's child distribut...

2017-10-04 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/19436

[SQL][WIP] Fix FlatMapGroupsInR's child distribution when grouping 
attributes are empty

## What changes were proposed in this pull request?

Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider 
empty grouping attributes.

WIP: Not quite sure if this will cause problem in R side. Add test to see 
Jenkins test result.

## How was this patch tested?

Added test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 fix-flatmapinr-distribution

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19436.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19436


commit 6710141767a2df92898af319bc4ef87f9110f911
Author: Liang-Chi Hsieh 
Date:   2017-10-05T03:08:00Z

Fix FlatMapGroupsInR's child distribution when grouping attributes are 
empty.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19424
  
**[Test build #82464 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82464/testReport)**
 for PR 19424 at commit 
[`df1d1c5`](https://github.com/apache/spark/commit/df1d1c5150ba1787accd95e621d62d6adf215e60).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142839010
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
 ---
@@ -26,6 +26,25 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.execution.SparkPlan
 import org.apache.spark.sql.types.StructType
 
+private class BatchIterator[T](iter: Iterator[T], batchSize: Int)
+  extends Iterator[Iterator[T]] {
+
+  override def hasNext: Boolean = iter.hasNext
+
+  override def next(): Iterator[T] = {
--- End diff --

Sorry I pushed a bit late. The comment is added now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19424: [SPARK-22197][SQL] push down operators to data so...

2017-10-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19424#discussion_r142838899
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
 ---
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.spark.sql.catalyst.expressions.{And, AttributeMap, 
AttributeSet, Expression, ExpressionSet}
+import org.apache.spark.sql.catalyst.planning.PhysicalOperation
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, 
Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.DataSourceStrategy
+import org.apache.spark.sql.sources
+import org.apache.spark.sql.sources.v2.reader._
+
+/**
+ * Pushes down various operators to the underlying data source for better 
performance. We classify
+ * operators into different layers, operators in the same layer are 
orderless, i.e. the query result
+ * won't change if we switch the operators within a layer(e.g. we can 
switch the order of predicates
+ * and required columns). The operators in layer N can only be pushed down 
if operators in layer N-1
+ * that above the data source relation are all pushed down. As an example, 
you can't push down limit
+ * if a filter below limit is not pushed down.
+ *
+ * Current operator push down layers:
+ *   layer 1: predicates, required columns.
+ */
+object PushDownOperatorsToDataSource extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = {
--- End diff --

yea it's an optimizer rule run before planner


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18732
  
**[Test build #82463 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82463/testReport)**
 for PR 18732 at commit 
[`657942b`](https://github.com/apache/spark/commit/657942b1b4080c30fa5c60bcd700c862fb571465).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-04 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/18732
  
I pushed a new commit addressing the comments. Let me scan through the 
comments again. I think there are some comments around worker.py not being 
addressed yet.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142836611
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
 ---
@@ -26,6 +26,25 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.execution.SparkPlan
 import org.apache.spark.sql.types.StructType
 
+private class BatchIterator[T](iter: Iterator[T], batchSize: Int)
+  extends Iterator[Iterator[T]] {
+
+  override def hasNext: Boolean = iter.hasNext
+
+  override def next(): Iterator[T] = {
--- End diff --

I didn't see the comment added?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19416: [SPARK-22187][SS] Update unsaferow format for sav...

2017-10-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19416


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "ke...

2017-10-04 Thread lw-lin
Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19435#discussion_r142836474
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala
 ---
@@ -291,7 +291,7 @@ class SymmetricHashJoinStateManager(
   }
 
   /** A wrapper around a [[StateStore]] that stores [(key, index) -> 
value]. */
-  private class KeyWithIndexToValueStore extends 
StateStoreHandler(KeyWithIndexToValuesType) {
+  private class KeyWithIndexToValueStore extends 
StateStoreHandler(KeyWithIndexToValueType) {
--- End diff --

dropped the 's' from '...Values...'


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "keyWithIn...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19435
  
**[Test build #82462 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82462/testReport)**
 for PR 19435 at commit 
[`c62ba65`](https://github.com/apache/spark/commit/c62ba65fcc952de4b244695f32dc41783a2c1631).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142836297
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -519,3 +519,18 @@ case class CoGroup(
 outputObjAttr: Attribute,
 left: LogicalPlan,
 right: LogicalPlan) extends BinaryNode with ObjectProducer
+
+case class FlatMapGroupsInPandas(
+groupingAttributes: Seq[Attribute],
+functionExpr: Expression,
+output: Seq[Attribute],
+child: LogicalPlan) extends UnaryNode {
+  /**
+   * This is needed because output attributes is considered `reference` 
when
+   * passed through the constructor.
+   *
+   * Without this, catalyst will complain that output attributes are 
missing
+   * from the input.
+   */
+  override val producedAttributes = AttributeSet(output)
--- End diff --

Ok. I see. Should be fine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19435: [WIP][SS][MINOR] "keyWithIndexToNumValues" -> "ke...

2017-10-04 Thread lw-lin
GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/19435

[WIP][SS][MINOR] "keyWithIndexToNumValues" -> "keyWithIndexToValue"

## What changes were proposed in this pull request?

This PR changes `keyWithIndexToNumValues`  to `keyWithIndexToValue`.

There will be folders named with this `keyWithIndexToNumValues`. So if we 
ever want to fix this, let's fix it now.

## How was this patch tested?

existing unit test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark keyWithIndex

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19435.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19435


commit c62ba65fcc952de4b244695f32dc41783a2c1631
Author: Liwei Lin 
Date:   2017-10-05T02:21:02Z

Fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142836245
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -519,3 +519,18 @@ case class CoGroup(
 outputObjAttr: Attribute,
 left: LogicalPlan,
 right: LogicalPlan) extends BinaryNode with ObjectProducer
+
+case class FlatMapGroupsInPandas(
+groupingAttributes: Seq[Attribute],
+functionExpr: Expression,
+output: Seq[Attribute],
+child: LogicalPlan) extends UnaryNode {
+  /**
+   * This is needed because output attributes is considered `reference` 
when
--- End diff --

nit: `reference` -> `references`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142835260
  
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,66 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col)
 else:
 jgd = self._jgd.pivot(pivot_col, values)
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self)
--- End diff --

`return GroupedData(jgd, self)` -> `return GroupedData(jgd, self._df)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/18924#discussion_r142833499
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta)
 val alpha = this.alpha.asBreeze
 val gammaShape = this.gammaShape
+val optimizeDocConcentration = this.optimizeDocConcentration
+// We calculate logphat in the same pass as other statistics, but we 
only need
+// it if we are optimizing docConcentration
+val logphatPartOptionBase = () => if (optimizeDocConcentration) 
Some(BDV.zeros[Double](k))
+  else None
 
-val stats: RDD[(BDM[Double], List[BDV[Double]])] = batch.mapPartitions 
{ docs =>
+val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitions { docs =>
   val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0)
 
   val stat = BDM.zeros[Double](k, vocabSize)
-  var gammaPart = List[BDV[Double]]()
+  val logphatPartOption = logphatPartOptionBase()
+  var nonEmptyDocCount : Long = 0L
   nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
+nonEmptyDocCount += 1
 val (gammad, sstats, ids) = 
OnlineLDAOptimizer.variationalTopicInference(
   termCounts, expElogbetaBc.value, alpha, gammaShape, k)
-stat(::, ids) := stat(::, ids).toDenseMatrix + sstats
-gammaPart = gammad :: gammaPart
+stat(::, ids) := stat(::, ids) + sstats
+logphatPartOption.foreach(_ += 
LDAUtils.dirichletExpectation(gammad))
   }
-  Iterator((stat, gammaPart))
-}.persist(StorageLevel.MEMORY_AND_DISK)
-val statsSum: BDM[Double] = 
stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))(
-  _ += _, _ += _)
-val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
-  stats.map(_._2).flatMap(list => 
list).collect().map(_.toDenseMatrix): _*)
-stats.unpersist()
-expElogbetaBc.destroy(false)
-val batchResult = statsSum *:* expElogbeta.t
+  Iterator((stat, logphatPartOption, nonEmptyDocCount))
+}
+
+val elementWiseSum = (u : (BDM[Double], Option[BDV[Double]], Long),
+ v : (BDM[Double], Option[BDV[Double]], 
Long)) => {
+  u._1 += v._1
+  u._2.foreach(_ += v._2.get)
+  (u._1, u._2, u._3 + v._3)
+}
+
+val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], 
nonEmptyDocsN : Long) = stats
+  .treeAggregate((BDM.zeros[Double](k, vocabSize), 
logphatPartOptionBase(), 0L))(
+elementWiseSum, elementWiseSum
+  )
 
+val batchResult = statsSum *:* expElogbeta.t
 // Note that this is an optimization to avoid batch.count
-updateLambda(batchResult, (miniBatchFraction * corpusSize).ceil.toInt)
-if (optimizeDocConcentration) updateAlpha(gammat)
+val batchSize = (miniBatchFraction * corpusSize).ceil.toInt
+updateLambda(batchResult, batchSize)
+
+logphatOption.foreach(_ /= batchSize.toDouble)
--- End diff --

agree.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/18924#discussion_r142833374
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta)
 val alpha = this.alpha.asBreeze
 val gammaShape = this.gammaShape
+val optimizeDocConcentration = this.optimizeDocConcentration
+// We calculate logphat in the same pass as other statistics, but we 
only need
+// it if we are optimizing docConcentration
+val logphatPartOptionBase = () => if (optimizeDocConcentration) 
Some(BDV.zeros[Double](k))
+  else None
 
-val stats: RDD[(BDM[Double], List[BDV[Double]])] = batch.mapPartitions 
{ docs =>
+val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitions { docs =>
   val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0)
 
   val stat = BDM.zeros[Double](k, vocabSize)
-  var gammaPart = List[BDV[Double]]()
+  val logphatPartOption = logphatPartOptionBase()
+  var nonEmptyDocCount : Long = 0L
   nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
+nonEmptyDocCount += 1
 val (gammad, sstats, ids) = 
OnlineLDAOptimizer.variationalTopicInference(
   termCounts, expElogbetaBc.value, alpha, gammaShape, k)
-stat(::, ids) := stat(::, ids).toDenseMatrix + sstats
-gammaPart = gammad :: gammaPart
+stat(::, ids) := stat(::, ids) + sstats
+logphatPartOption.foreach(_ += 
LDAUtils.dirichletExpectation(gammad))
   }
-  Iterator((stat, gammaPart))
-}.persist(StorageLevel.MEMORY_AND_DISK)
-val statsSum: BDM[Double] = 
stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))(
-  _ += _, _ += _)
-val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
-  stats.map(_._2).flatMap(list => 
list).collect().map(_.toDenseMatrix): _*)
-stats.unpersist()
-expElogbetaBc.destroy(false)
-val batchResult = statsSum *:* expElogbeta.t
+  Iterator((stat, logphatPartOption, nonEmptyDocCount))
+}
+
+val elementWiseSum = (u : (BDM[Double], Option[BDV[Double]], Long),
+ v : (BDM[Double], Option[BDV[Double]], 
Long)) => {
+  u._1 += v._1
+  u._2.foreach(_ += v._2.get)
+  (u._1, u._2, u._3 + v._3)
+}
+
+val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], 
nonEmptyDocsN : Long) = stats
+  .treeAggregate((BDM.zeros[Double](k, vocabSize), 
logphatPartOptionBase(), 0L))(
+elementWiseSum, elementWiseSum
+  )
 
+val batchResult = statsSum *:* expElogbeta.t
 // Note that this is an optimization to avoid batch.count
-updateLambda(batchResult, (miniBatchFraction * corpusSize).ceil.toInt)
-if (optimizeDocConcentration) updateAlpha(gammat)
+val batchSize = (miniBatchFraction * corpusSize).ceil.toInt
--- End diff --

Sure, since we're talking about consistency with old LDA. It's fine to keep 
using batchSize here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-04 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/18924
  
Yes, I think local test is enough for both correctness and performance. 

For consistency with old LDA, just some manual local test would be 
sufficient. You may well just use the LDA example and switch the Spark jar 
files between Spark 2.2 and your branch. And I think the case with empty 
document worth special attention.

The same for the performance test. You may just post the result after your 
local test. I'm OK as long as there's no noticeable regression.





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/18924#discussion_r142831316
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta)
 val alpha = this.alpha.asBreeze
 val gammaShape = this.gammaShape
+val optimizeDocConcentration = this.optimizeDocConcentration
+// We calculate logphat in the same pass as other statistics, but we 
only need
+// it if we are optimizing docConcentration
+val logphatPartOptionBase = () => if (optimizeDocConcentration) 
Some(BDV.zeros[Double](k))
+  else None
 
-val stats: RDD[(BDM[Double], List[BDV[Double]])] = batch.mapPartitions 
{ docs =>
+val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitions { docs =>
   val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0)
 
   val stat = BDM.zeros[Double](k, vocabSize)
-  var gammaPart = List[BDV[Double]]()
+  val logphatPartOption = logphatPartOptionBase()
+  var nonEmptyDocCount : Long = 0L
   nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
+nonEmptyDocCount += 1
 val (gammad, sstats, ids) = 
OnlineLDAOptimizer.variationalTopicInference(
   termCounts, expElogbetaBc.value, alpha, gammaShape, k)
-stat(::, ids) := stat(::, ids).toDenseMatrix + sstats
-gammaPart = gammad :: gammaPart
+stat(::, ids) := stat(::, ids) + sstats
+logphatPartOption.foreach(_ += 
LDAUtils.dirichletExpectation(gammad))
   }
-  Iterator((stat, gammaPart))
-}.persist(StorageLevel.MEMORY_AND_DISK)
-val statsSum: BDM[Double] = 
stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))(
-  _ += _, _ += _)
-val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
-  stats.map(_._2).flatMap(list => 
list).collect().map(_.toDenseMatrix): _*)
-stats.unpersist()
-expElogbetaBc.destroy(false)
-val batchResult = statsSum *:* expElogbeta.t
+  Iterator((stat, logphatPartOption, nonEmptyDocCount))
+}
+
+val elementWiseSum = (u : (BDM[Double], Option[BDV[Double]], Long),
+ v : (BDM[Double], Option[BDV[Double]], 
Long)) => {
+  u._1 += v._1
+  u._2.foreach(_ += v._2.get)
+  (u._1, u._2, u._3 + v._3)
+}
+
+val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], 
nonEmptyDocsN : Long) = stats
+  .treeAggregate((BDM.zeros[Double](k, vocabSize), 
logphatPartOptionBase(), 0L))(
+elementWiseSum, elementWiseSum
+  )
 
+val batchResult = statsSum *:* expElogbeta.t
 // Note that this is an optimization to avoid batch.count
-updateLambda(batchResult, (miniBatchFraction * corpusSize).ceil.toInt)
-if (optimizeDocConcentration) updateAlpha(gammat)
+val batchSize = (miniBatchFraction * corpusSize).ceil.toInt
--- End diff --

this may wait.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19432
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19432
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82460/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19432
  
**[Test build #82460 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82460/testReport)**
 for PR 19432 at commit 
[`aaf41dd`](https://github.com/apache/spark/commit/aaf41dd02b8f2109a84d36768fdf1802f7817961).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-04 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/18924
  
@jkbradley, thank you!

- Correctness: in order to test the equivalence of two versions of 
`submitMiniBatch` I have to bring both of them into the scope... One solution 
would be to derive a class `OldOnlineLDAOptimizer` from `OnlineLDAOptimizer` 
and override `submitMiniBatch` but the class is final. What's the preferred 
approach?
- Sure. Sounds good. Should I add test-case reporting the CPU time or 
should I rather define an `App`? Should I add the code to the PR or just report 
the results here? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18924#discussion_r142826379
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta)
 val alpha = this.alpha.asBreeze
 val gammaShape = this.gammaShape
+val optimizeDocConcentration = this.optimizeDocConcentration
+// We calculate logphat in the same pass as other statistics, but we 
only need
+// it if we are optimizing docConcentration
+val logphatPartOptionBase = () => if (optimizeDocConcentration) 
Some(BDV.zeros[Double](k))
+  else None
 
-val stats: RDD[(BDM[Double], List[BDV[Double]])] = batch.mapPartitions 
{ docs =>
+val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitions { docs =>
   val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0)
 
   val stat = BDM.zeros[Double](k, vocabSize)
-  var gammaPart = List[BDV[Double]]()
+  val logphatPartOption = logphatPartOptionBase()
+  var nonEmptyDocCount : Long = 0L
   nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
+nonEmptyDocCount += 1
 val (gammad, sstats, ids) = 
OnlineLDAOptimizer.variationalTopicInference(
   termCounts, expElogbetaBc.value, alpha, gammaShape, k)
-stat(::, ids) := stat(::, ids).toDenseMatrix + sstats
-gammaPart = gammad :: gammaPart
+stat(::, ids) := stat(::, ids) + sstats
+logphatPartOption.foreach(_ += 
LDAUtils.dirichletExpectation(gammad))
   }
-  Iterator((stat, gammaPart))
-}.persist(StorageLevel.MEMORY_AND_DISK)
-val statsSum: BDM[Double] = 
stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))(
-  _ += _, _ += _)
-val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
-  stats.map(_._2).flatMap(list => 
list).collect().map(_.toDenseMatrix): _*)
-stats.unpersist()
-expElogbetaBc.destroy(false)
-val batchResult = statsSum *:* expElogbeta.t
+  Iterator((stat, logphatPartOption, nonEmptyDocCount))
+}
+
+val elementWiseSum = (u : (BDM[Double], Option[BDV[Double]], Long),
+ v : (BDM[Double], Option[BDV[Double]], 
Long)) => {
+  u._1 += v._1
+  u._2.foreach(_ += v._2.get)
+  (u._1, u._2, u._3 + v._3)
+}
+
+val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], 
nonEmptyDocsN : Long) = stats
+  .treeAggregate((BDM.zeros[Double](k, vocabSize), 
logphatPartOptionBase(), 0L))(
+elementWiseSum, elementWiseSum
+  )
 
+val batchResult = statsSum *:* expElogbeta.t
 // Note that this is an optimization to avoid batch.count
-updateLambda(batchResult, (miniBatchFraction * corpusSize).ceil.toInt)
-if (optimizeDocConcentration) updateAlpha(gammat)
+val batchSize = (miniBatchFraction * corpusSize).ceil.toInt
+updateLambda(batchResult, batchSize)
+
+logphatOption.foreach(_ /= batchSize.toDouble)
--- End diff --

That sounds right to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18924#discussion_r142826326
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +462,54 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta)
 val alpha = this.alpha.asBreeze
 val gammaShape = this.gammaShape
-
-val stats: RDD[(BDM[Double], List[BDV[Double]])] = batch.mapPartitions 
{ docs =>
+val optimizeDocConcentration = this.optimizeDocConcentration
+// If and only if optimizeDocConcentration is set true,
+// we calculate logphat in the same pass as other statistics.
+// No calculation of loghat happens otherwise.
+val logphatPartOptionBase = () => if (optimizeDocConcentration) {
+  Some(BDV.zeros[Double](k))
+} else {
+  None
+}
+
+val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitions { docs =>
   val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0)
 
   val stat = BDM.zeros[Double](k, vocabSize)
-  var gammaPart = List[BDV[Double]]()
+  val logphatPartOption = logphatPartOptionBase()
+  var nonEmptyDocCount : Long = 0L
   nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
+nonEmptyDocCount += 1
 val (gammad, sstats, ids) = 
OnlineLDAOptimizer.variationalTopicInference(
   termCounts, expElogbetaBc.value, alpha, gammaShape, k)
-stat(::, ids) := stat(::, ids).toDenseMatrix + sstats
-gammaPart = gammad :: gammaPart
+stat(::, ids) := stat(::, ids) + sstats
+logphatPartOption.foreach(_ += 
LDAUtils.dirichletExpectation(gammad))
   }
-  Iterator((stat, gammaPart))
-}.persist(StorageLevel.MEMORY_AND_DISK)
-val statsSum: BDM[Double] = 
stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))(
-  _ += _, _ += _)
-val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
-  stats.map(_._2).flatMap(list => 
list).collect().map(_.toDenseMatrix): _*)
-stats.unpersist()
-expElogbetaBc.destroy(false)
-val batchResult = statsSum *:* expElogbeta.t
+  Iterator((stat, logphatPartOption, nonEmptyDocCount))
+}
+
+val elementWiseSum = (u : (BDM[Double], Option[BDV[Double]], Long),
+  v : (BDM[Double], Option[BDV[Double]], Long)) => 
{
+  u._1 += v._1
+  u._2.foreach(_ += v._2.get)
+  (u._1, u._2, u._3 + v._3)
+}
+
+val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], 
nonEmptyDocsN: Long) = stats
+  .treeAggregate((BDM.zeros[Double](k, vocabSize), 
logphatPartOptionBase(), 0L))(
+elementWiseSum, elementWiseSum
+  )
 
+val batchResult = statsSum *:* expElogbeta.t
 // Note that this is an optimization to avoid batch.count
-updateLambda(batchResult, (miniBatchFraction * corpusSize).ceil.toInt)
-if (optimizeDocConcentration) updateAlpha(gammat)
+val batchSize = (miniBatchFraction * corpusSize).ceil.toInt
+updateLambda(batchResult, batchSize)
+
+logphatOption.foreach(_ /= nonEmptyDocsN.toDouble)
--- End diff --

Good point about dividing by 0, @hhbyyh .  We should probably just check 
nonEmptyDocsN to see if it's 0, and if it is, skip all of these updates.  
That's related to but actually separate from the follow-up SPARK-22111.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19424
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19424
  
**[Test build #82461 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82461/testReport)**
 for PR 19424 at commit 
[`a9f2c57`](https://github.com/apache/spark/commit/a9f2c57d1acdb10c1860aa75b9f102b7b9afbca8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19424
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82461/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19434: [SPARK-21785][SQL]Support create table from a parquet fi...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19434
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19434: [SPARK-21785][SQL]Support create table from a par...

2017-10-04 Thread CrazyJacky
GitHub user CrazyJacky opened a pull request:

https://github.com/apache/spark/pull/19434

[SPARK-21785][SQL]Support create table from a parquet file schema

## Support create table from a parquet file schema
As described in jira:
```sql
CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE 'PARQUET' 
'/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION
'/user/test/def/'; 
```
this is a very ugly fix and I would like someone to help to review and 
refine.
and it only supports create hive table.

## Tested by test case and tested about build the runnable distribution

```scala
test("create table like parquet") {

val f = getClass.getClassLoader.
  getResource("test-data/dec-in-fixed-len.parquet").getPath
val v1 =
  """
|create table if not exists db1.table1 like 'parquet'
  """.stripMargin.concat("'" + f + "'").concat(
  """
|stored as sequencefile
|location '/tmp/table1'
  """.stripMargin
  )

val (desc, allowExisting) = extractTableDesc(v1)

assert(allowExisting)
assert(desc.identifier.database == Some("db1"))
assert(desc.identifier.table == "table1")
assert(desc.tableType == CatalogTableType.EXTERNAL)
assert(desc.schema == new StructType()
  .add("fixed_len_dec", "decimal(10,2)"))
assert(desc.bucketSpec.isEmpty)
assert(desc.viewText.isEmpty)
assert(desc.viewDefaultDatabase.isEmpty)
assert(desc.viewQueryColumnNames.isEmpty)
assert(desc.storage.locationUri == Some(new URI("/tmp/table1")))
assert(desc.storage.inputFormat == 
Some("org.apache.hadoop.mapred.SequenceFileInputFormat"))
assert(desc.storage.outputFormat == 
Some("org.apache.hadoop.mapred.SequenceFileOutputFormat"))
assert(desc.storage.serde == 
Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"))
  }
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jacshen/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19434.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19434


commit 6b23cb8ff5a778f4f1b4ca4f218cbe8c4e422101
Author: Shen 
Date:   2017-10-04T20:35:03Z

Add support to create table which schema is reading from a given parquet 
file

commit 877a57ec439db4e688c71568ddd312bdc2a50cec
Author: jacshen 
Date:   2017-10-04T20:37:08Z

Merge branch 'master' of https://github.com/apache/spark

commit a22c39e795ab4a730d0277c4162cdfadd37dbf22
Author: jacshen 
Date:   2017-10-04T21:21:02Z

Add support to create table which schema is reading from a given parquet 
file




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19412: [SPARK-22142][BUILD][STREAMING] Move Flume support behin...

2017-10-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19412
  
I think the argument for it if anything is that it's a) deprecated, so 
should kinda be optional to build, and b) this would simply be consistent with 
how other external/* modules are handled. For Spark 2.x yes there isn't and 
shouldn't be an actual change to the outputs.

There's a legitimate separate question here about whether it should be 
deprecated? my sense is yes, to leave the option to remove it in Spark 3.0, 
which would probably follow 2.3. I recall something about flume-ng uses an old 
version of Netty and it's the thing blocking updating it for all of Spark, but 
I may be misremembering the detail there.

Yeah this passed a Maven test build as well as `dev/run-tests` now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-10-04 Thread ArtRand
Github user ArtRand commented on the issue:

https://github.com/apache/spark/pull/19272
  
@kalvinnchau I'm running Hadoop 2.6 on a DC/OS cluster with Mesos 1.4.0 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19424
  
**[Test build #82461 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82461/testReport)**
 for PR 19424 at commit 
[`a9f2c57`](https://github.com/apache/spark/commit/a9f2c57d1acdb10c1860aa75b9f102b7b9afbca8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19392: [SPARK-22169][SQL] support byte length literal as identi...

2017-10-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19392
  
hmm, it's not a bug fix but a nice-to-have feature, do we want this in 
spark 2.2?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...

2017-10-04 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
@WeichenXu123 would you be able to take an initial look at this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-04 Thread ajbozarth
Github user ajbozarth commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r142782246
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js 
---
@@ -0,0 +1,474 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+$(document).ajaxStop($.unblockUI);
+$(document).ajaxStart(function () {
+$.blockUI({message: 'Loading Tasks Page...'});
+});
+
+$.extend( $.fn.dataTable.ext.type.order, {
+"file-size-pre": ConvertDurationString,
+
+"file-size-asc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? -1 : ((a > b) ? 1 : 0));
+},
+
+"file-size-desc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? 1 : ((a > b) ? -1 : 0));
+}
+} );
+
+function createTemplateURI(appId) {
+var words = document.baseURI.split('/');
+var ind = words.indexOf("proxy");
+if (ind > 0) {
+var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + 
'/static/stagespage-template.html';
+return baseURI;
+}
+ind = words.indexOf("history");
+if(ind > 0) {
+var baseURI = words.slice(0, ind).join('/') + 
'/static/stagespage-template.html';
+return baseURI;
+}
+return location.origin + "/static/stagespage-template.html";
+}
+
+// This function will only parse the URL under certain formate
+// e.g. 
https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0
+function StageEndPoint(appId) {
+var words = document.baseURI.split('/');
+var words2 = document.baseURI.split('?');
+var ind = words.indexOf("proxy");
+if (ind > 0) {
+var appId = words[ind + 1];
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+var newBaseURI = words.slice(0, ind + 2).join('/');
+return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + 
stageId;
+}
+ind = words.indexOf("history");
+if (ind > 0) {
+var appId = words[ind + 1];
+var attemptId = words[ind + 2];
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+var newBaseURI = words.slice(0, ind).join('/');
+if (isNaN(attemptId) || attemptId == "0") {
+return newBaseURI + "/api/v1/applications/" + appId + 
"/stages/" + stageId;
+} else {
+return newBaseURI + "/api/v1/applications/" + appId + "/" + 
attemptId + "/stages/" + stageId;
+}
+}
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+return location.origin + "/api/v1/applications/" + appId + "/stages/" 
+ stageId;
+}
+
+function sortNumber(a,b) {
+return a - b;
+}
+
+function quantile(array, percentile) {
+index = percentile/100. * (array.length-1);
+if (Math.floor(index) == index) {
+   result = array[index];
+} else {
+var i = Math.floor(index);
+fraction = index - i;
+result = array[i];
+}
+return result;
+}
+
+$(document).ready(function () {
+$.extend($.fn.dataTable.defaults, {
+stateSave: true,
+lengthMenu: [[20, 40, 60, 100, -1], [20, 40, 60, 100, "All"]],
+pageLength: 20
+});
+
+$("#showAdditionalMetrics").append(
+"" +
+"" +
+" Show Additional Metrics" +
+"" +
+"" +
+" Select All" +
+" Scheduler Delay" +
+" Task Deserialization Time" +
+" Shuffle Read Blocked Time" +
+" Shuffle Remote Reads" +
+

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-04 Thread ajbozarth
Github user ajbozarth commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r142781908
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js ---
@@ -46,3 +46,64 @@ function formatBytes(bytes, type) {
 var i = Math.floor(Math.log(bytes) / Math.log(k));
 return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + 
sizes[i];
 }
+
+function formatLogsCells(execLogs, type) {
+if (type !== 'display') return Object.keys(execLogs);
+if (!execLogs) return;
+var result = '';
+$.each(execLogs, function (logName, logUrl) {
+result += '' + logName + ''
+});
+return result;
+}
+
+function getStandAloneAppId(cb) {
--- End diff --

So I know you just copied this function over, but why doesn't this just 
return the appId rather than taking in a function to run on an appId? It seems 
to me the later would make more sense. If changing it makes sense to you as 
well we should update it here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19433
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib][WIP] Add local tree training...

2017-10-04 Thread smurching
GitHub user smurching opened a pull request:

https://github.com/apache/spark/pull/19433

[SPARK-3162] [MLlib][WIP] Add local tree training for decision tree 
regressors

## What changes were proposed in this pull request?
 WIP, DO NOT MERGE

### Overview
This PR adds local tree training for decision tree regressors as a first 
step for addressing 
[SPARK-3162](https://issues.apache.org/jira/browse/SPARK-3162) (train decision 
trees locally when possible). See [this design 
doc](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit)
 for a detailed description of the proposed changes.

Distributed training logic has been refactored but only minimally modified; 
the local tree training implementation leverages existing distributed training 
logic for computing impurities and splits. This shared logic has been 
refactored into `...Utils` objects (e.g. `SplitUtils.scala`, 
`ImpurityUtils.scala`). 

### How to Review

Each commit in this PR adds non-overlapping functionality, so the PR should 
be reviewable commit-by-commit.

Changes introduced by each commit:
1. Adds new data structures for local tree training (`FeatureVector`, 
`TrainingInfo`) & associated unit tests (`LocalTreeDataSuite`)
2. Adds shared utility methods for computing splits/impurities 
(`SplitUtils`, `ImpurityUtils`, `AggUpdateUtils`), largely copied from existing 
distributed training code in `RandomForest.scala`.
3. Unit tests for split/impurity utility methods (`TreeSplitUtilsSuite`)
4. Updates distributed training code in `RandomForest.scala` to depend on 
the utility methods introduced in 2.
5. Adds local tree training logic (`LocalDecisionTree`) 
6. Local tree unit/integration tests (`LocalTreeUnitSuite`, 
`LocalTreeIntegrationSuite`)

## How was this patch tested?
No existing tests were modified. The following new tests were added (also 
described above):
* Unit tests for new data structures specific to local tree training 
(`LocalTreeDataSuite`, `LocalTreeUtilsSuite`)
* Unit tests for impurity/split utility methods (`TreeSplitUtilsSuite`)
* Unit tests for local tree training logic (`LocalTreeUnitSuite`)
* Integration tests verifying that local & distributed tree training 
produce the same trees (`LocalTreeIntegrationSuite`)

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smurching/spark pr-splitup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19433.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19433


commit 219a12001383017e70f10cd7c785272e70e64b28
Author: Sid Murching 
Date:   2017-10-04T20:55:35Z

Add data structures for local tree training & associated tests (in 
LocalTreeDataSuite):
* TrainingInfo: primary local tree training data structure, contains 
all information required to describe state of
algorithm at any point during learning
* FeatureVector: Stores data for an individual feature as an Array[Int]

commit 710714395c966f664af7f7b62226336675ec2ea7
Author: Sid Murching 
Date:   2017-10-04T20:57:30Z

Add utility methods used for impurity and split calculations during both 
local & distributed training:
 * AggUpdateUtils: Helper methods for updating sufficient stats for a given 
node
 * ImpurityUtils: Helper methods for impurity-related calcluations during 
node split decisions
 * SplitUtils: Helper methods for choosing splits given sufficient stats

NOTE: Both ImpurityUtils and SplitUtils primarily contain code taken from 
RandomForest.scala, with slight modifications.
Tests for SplitUtils are contained in the next commit.

commit 49bf0ae9b275264e757de573f81b816437be77e7
Author: Sid Murching 
Date:   2017-10-04T21:36:15Z

Add test suites for utility methods used during best-split computation:
 * TreeSplitUtilsSuite: Test suite for SplitUtils
 * TreeTests: Add utility method (getMetadata) for TreeSplitUtilsSuite

 Also add methods used by these tests in LocalDecisionTree.scala, 
RandomForest.scala

commit bc54b165849202269b80bbac1a84afb857e87e31
Author: Sid Murching 
Date:   2017-10-04T21:48:33Z

 Update RandomForest.scala to use new utility methods for impurity/split 
calculations

commit 6a68a5cc6a6b7087163bbe5681ad41aef5e3fd0a
Author: Sid Murching 

[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19432
  
**[Test build #82460 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82460/testReport)**
 for PR 19432 at commit 
[`aaf41dd`](https://github.com/apache/spark/commit/aaf41dd02b8f2109a84d36768fdf1802f7817961).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19424: [SPARK-22197][SQL] push down operators to data so...

2017-10-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19424#discussion_r142800670
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
 ---
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.spark.sql.catalyst.expressions.{And, AttributeMap, 
AttributeSet, Expression, ExpressionSet}
+import org.apache.spark.sql.catalyst.planning.PhysicalOperation
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, 
Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.DataSourceStrategy
+import org.apache.spark.sql.sources
+import org.apache.spark.sql.sources.v2.reader._
+
+/**
+ * Pushes down various operators to the underlying data source for better 
performance. We classify
+ * operators into different layers, operators in the same layer are 
orderless, i.e. the query result
+ * won't change if we switch the operators within a layer(e.g. we can 
switch the order of predicates
+ * and required columns). The operators in layer N can only be pushed down 
if operators in layer N-1
+ * that above the data source relation are all pushed down. As an example, 
you can't push down limit
+ * if a filter below limit is not pushed down.
--- End diff --

> As an example, given a LIMIT has a FILTER child, you can't push down 
LIMIT if FILTER is not completely pushed down. When both are pushed down, the 
data source should execute FILTER before LIMIT.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19424: [SPARK-22197][SQL] push down operators to data so...

2017-10-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19424#discussion_r142801593
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
 ---
@@ -32,13 +32,12 @@ import org.apache.spark.sql.types.StructType
 case class DataSourceV2ScanExec(
--- End diff --

```
/**
 * Physical plan node for scanning data from a data source.
 */
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19424: [SPARK-22197][SQL] push down operators to data so...

2017-10-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19424#discussion_r142806719
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
 ---
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.spark.sql.catalyst.expressions.{And, AttributeMap, 
AttributeSet, Expression, ExpressionSet}
+import org.apache.spark.sql.catalyst.planning.PhysicalOperation
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, 
Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.DataSourceStrategy
+import org.apache.spark.sql.sources
+import org.apache.spark.sql.sources.v2.reader._
+
+/**
+ * Pushes down various operators to the underlying data source for better 
performance. We classify
+ * operators into different layers, operators in the same layer are 
orderless, i.e. the query result
+ * won't change if we switch the operators within a layer(e.g. we can 
switch the order of predicates
+ * and required columns). The operators in layer N can only be pushed down 
if operators in layer N-1
+ * that above the data source relation are all pushed down. As an example, 
you can't push down limit
+ * if a filter below limit is not pushed down.
+ *
+ * Current operator push down layers:
+ *   layer 1: predicates, required columns.
+ */
+object PushDownOperatorsToDataSource extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = {
--- End diff --

This is an optimizer rule? The input is a `LogicalPlan` and the output is 
still a `LogicalPlan`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19432: [SPARK-22203][SQL]Add job description for file listing S...

2017-10-04 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19432
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19432: [SPARK-22203][SQL]Add job description for file li...

2017-10-04 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/19432

[SPARK-22203][SQL]Add job description for file listing Spark jobs

## What changes were proposed in this pull request?

The user may be confused about some 1-tasks jobs. We can add a job 
description for these jobs so that the user can figure it out.

## How was this patch tested?

The new unit test.

Before:
https://user-images.githubusercontent.com/1000778/31202567-f78d15c0-a917-11e7-841e-11b8bf8f0032.png;>

After:
https://user-images.githubusercontent.com/1000778/31202576-fc01e356-a917-11e7-9c2b-7bf80b153adb.png;>



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-22203

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19432.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19432


commit aaf41dd02b8f2109a84d36768fdf1802f7817961
Author: Shixiong Zhu 
Date:   2017-10-04T21:58:32Z

Add job description for file listing Spark jobs




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19428: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-04 Thread susanxhuynh
Github user susanxhuynh closed the pull request at:

https://github.com/apache/spark/pull/19428


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142801623
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
 ---
@@ -26,6 +26,25 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.execution.SparkPlan
 import org.apache.spark.sql.types.StructType
 
+private class BatchIterator[T](iter: Iterator[T], batchSize: Int)
+  extends Iterator[Iterator[T]] {
+
+  override def hasNext: Boolean = iter.hasNext
+
+  override def next(): Iterator[T] = {
--- End diff --

Added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19108: [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTe...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19108
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19108: [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTe...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19108
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82459/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19108: [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTe...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19108
  
**[Test build #82459 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82459/testReport)**
 for PR 19108 at commit 
[`62a8fcd`](https://github.com/apache/spark/commit/62a8fcd29da6d81981f29dfc3f6e3cb77c7c6fc3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142796899
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -519,3 +519,18 @@ case class CoGroup(
 outputObjAttr: Attribute,
 left: LogicalPlan,
 right: LogicalPlan) extends BinaryNode with ObjectProducer
+
+case class FlatMapGroupsInPandas(
+groupingAttributes: Seq[Attribute],
+functionExpr: Expression,
+output: Seq[Attribute],
+child: LogicalPlan) extends UnaryNode {
+  /**
+   * This is needed because output attributes is considered `reference` 
when
+   * passed through the constructor.
+   *
+   * Without this, catalyst will complain that output attributes are 
missing
+   * from the input.
+   */
+  override val producedAttributes = AttributeSet(output)
--- End diff --

This is one of the trick bit.

It's because of this code:

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L135

Because of `productIterator` will return all member variables, including 
`output`, `references` of the tree node will include all output attributes, and 
it will complain about missing input:

```
def missingInput: AttributeSet = references -- inputSet -- 
producedAttributes
```

I think my solution here isn't great but I don't know the best way of deal 
with this. If someone with deeper catalyst knowledge can suggest, I am happy to 
give rid of this bit..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18966
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18966
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82455/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18966
  
**[Test build #82455 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82455/testReport)**
 for PR 18966 at commit 
[`a489938`](https://github.com/apache/spark/commit/a489938b3f128558df31c97a32e196620c9fd475).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >