date:20171228

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19977
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85495/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19977
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19977
  
**[Test build #85495 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85495/testReport)**
 for PR 19977 at commit 
[`57a9d1e`](https://github.com/apache/spark/commit/57a9d1e9da21d56873c97eac08797499199a0c7b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19683
  
**[Test build #85502 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85502/testReport)**
 for PR 19683 at commit 
[`1c6626a`](https://github.com/apache/spark/commit/1c6626acad080404a73519735bc1b3a0fbf6e303).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...

2017-12-28 Thread uzadude

Github user uzadude commented on a diff in the pull request:

https://github.com/apache/spark/pull/19683#discussion_r159033662
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala ---
@@ -85,11 +84,19 @@ case class GenerateExec(
 val numOutputRows = longMetric("numOutputRows")
 child.execute().mapPartitionsWithIndexInternal { (index, iter) =>
   val generatorNullRow = new 
GenericInternalRow(generator.elementSchema.length)
-  val rows = if (join) {
+  val rows = if (requiredChildOutput.nonEmpty) {
+
+val pruneChildForResult: InternalRow => InternalRow =
+  if ((child.outputSet -- requiredChildOutput).isEmpty) {
--- End diff --

wouldn't it always return false? or should I use `child.output == 
AttributeSet(requiredChildOutput)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...

2017-12-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20082#discussion_r159033482
  
--- Diff: core/src/main/scala/org/apache/spark/TaskContext.scala ---
@@ -150,6 +150,13 @@ abstract class TaskContext extends Serializable {
*/
   def stageId(): Int
 
+  /**
+   * An ID that is unique to the stage attempt that this task belongs to. 
It represents how many
+   * times the stage has been attempted. The first stage attempt will be 
assigned stageAttemptId = 0
+   * , and subsequent attempts will increasing stageAttemptId one by one.
+   */
+  def stageAttemptId(): Int
--- End diff --

My concern is that, internally we use `stageAttemptId`, and internally we 
call `TaskContext.taskAttemptId` `taskId`. However, for end users, they don't 
know the internal code, and they are more familiar with `TaskContext`. I think 
the naming should be consistent with the public API `TaskContext`, instead of 
internal code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...

2017-12-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19683#discussion_r159033021
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala ---
@@ -85,11 +84,19 @@ case class GenerateExec(
 val numOutputRows = longMetric("numOutputRows")
 child.execute().mapPartitionsWithIndexInternal { (index, iter) =>
   val generatorNullRow = new 
GenericInternalRow(generator.elementSchema.length)
-  val rows = if (join) {
+  val rows = if (requiredChildOutput.nonEmpty) {
+
+val pruneChildForResult: InternalRow => InternalRow =
+  if ((child.outputSet -- requiredChildOutput).isEmpty) {
--- End diff --

just `child.output == requiredChildOutput`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...

2017-12-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19683#discussion_r159032990
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala ---
@@ -47,8 +47,13 @@ private[execution] sealed case class LazyIterator(func: 
() => TraversableOnce[In
  * terminate().
  *
  * @param generator the generator expression
- * @param join  when true, each output row is implicitly joined with the 
input tuple that produced
- *  it.
+ * @param requiredChildOutput this paramter starts as Nil and gets filled 
by the Optimizer.
--- End diff --

we don't need to duplicate the comment here, just say `required attributes 
from child output`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20062: [SPARK-22892] [SQL] Simplify some estimation logi...

2017-12-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20062


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20062: [SPARK-22892] [SQL] Simplify some estimation logic by us...

2017-12-28 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20062
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19683
  
**[Test build #85501 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85501/testReport)**
 for PR 19683 at commit 
[`8f06dda`](https://github.com/apache/spark/commit/8f06dda16c692cdc2204eddaee5ae3ba2321258d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20105: [SPARK-22920][SPARKR] sql functions for current_d...

2017-12-28 Thread felixcheung

GitHub user felixcheung reopened a pull request:

https://github.com/apache/spark/pull/20105

[SPARK-22920][SPARKR] sql functions for current_date, current_timestamp, 
rtrim/ltrim/trim with trimString

## What changes were proposed in this pull request?

Add sql functions

## How was this patch tested?

manual, unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rsqlfuncs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20105


commit c8e118a4c6e0f9f05d3c48c3d715da82b4ebd334
Author: Felix Cheung 
Date:   2017-12-28T11:11:41Z

ltrim/rtrim/trim with trimString + current_date() + current_timestamp()

commit 284c74a7e74fb24024cf4cc6557b30e1169cf445
Author: Felix Cheung 
Date:   2017-12-28T11:12:18Z

NeedsCompilation in DESCRIPTION

commit 1f2fac3afc376fd61e54cb12b6c34f60f5522280
Author: Felix Cheung 
Date:   2017-12-28T21:44:29Z

fix example




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20105: [SPARK-22920][SPARKR] sql functions for current_d...

2017-12-28 Thread felixcheung

Github user felixcheung closed the pull request at:

https://github.com/apache/spark/pull/20105


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20020: [SPARK-22834][SQL] Make insertion commands have r...

2017-12-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20020


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...

2017-12-28 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20020
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-28 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19683
  
LGTM except 2 comments


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...

2017-12-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19683#discussion_r159031802
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala ---
@@ -57,20 +62,19 @@ private[execution] sealed case class LazyIterator(func: 
() => TraversableOnce[In
  */
 case class GenerateExec(
 generator: Generator,
-join: Boolean,
+unrequiredChildIndex: Seq[Int],
--- End diff --

The physical plan can just take `requiredChildOutput`, and in the planner 
we can just do
```
case g @ logical.Generate(...) => GenerateExec(..., g.requiredChildOutput)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20010
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85497/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20010
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20010
  
**[Test build #85497 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85497/testReport)**
 for PR 20010 at commit 
[`86e1929`](https://github.com/apache/spark/commit/86e1929c490861d9e93ef34abd52c442f99f31a9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...

2017-12-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19683#discussion_r159031526
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -73,25 +73,32 @@ case class Project(projectList: Seq[NamedExpression], 
child: LogicalPlan) extend
  * their output.
  *
  * @param generator the generator expression
- * @param join  when true, each output row is implicitly joined with the 
input tuple that produced
- *  it.
+ * @param unrequiredChildIndex this paramter starts as Nil and gets filled 
by the Optimizer.
+ *  It's used as an optimization for omitting 
data generation that will
+ *  be discarded next by a projection.
+ *  A common use case is when we 
explode(array(..)) and are interested
+ *  only in the exploded data and not in the 
original array. before this
+ *  optimization the array got duplicated for 
each of its elements,
+ *  causing O(n^^2) memory consumption. (see 
[SPARK-21657])
  * @param outer when true, each input row will be output at least once, 
even if the output of the
  *  given `generator` is empty.
  * @param qualifier Qualifier for the attributes of generator(UDTF)
  * @param generatorOutput The output schema of the Generator.
  * @param child Children logical plan node
  */
 case class Generate(
-generator: Generator,
-join: Boolean,
-outer: Boolean,
-qualifier: Option[String],
-generatorOutput: Seq[Attribute],
-child: LogicalPlan)
+ generator: Generator,
+ unrequiredChildIndex: Seq[Int],
+ outer: Boolean,
+ qualifier: Option[String],
+ generatorOutput: Seq[Attribute],
+ child: LogicalPlan)
--- End diff --

wrong indentation?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel sav...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20113
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85498/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel sav...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20113
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel sav...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20113
  
**[Test build #85498 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85498/testReport)**
 for PR 20113 at commit 
[`408bfed`](https://github.com/apache/spark/commit/408bfed88cd237e5adbf42bd5b4fd2ccf875b5bd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20109: [SPARK-22891][SQL] Make hive client creation thre...

2017-12-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20109


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20109: [SPARK-22891][SQL] Make hive client creation thread safe

2017-12-28 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20109
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19683
  
**[Test build #85500 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85500/testReport)**
 for PR 19683 at commit 
[`288aa73`](https://github.com/apache/spark/commit/288aa733e2dce341aca1c60d6800935564fb9843).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

2017-12-28 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20114
  
How about simply returning `false` from `ArrowVectorAccessor.isNullAt(int 
rowId)` when `accessor.getValueCount() > 0 && 
accessor.getValidityBuffer().capacity() == 0` without modifying the buffer?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-28 Thread uzadude

Github user uzadude commented on the issue:

https://github.com/apache/spark/pull/19683
  
seems reasonable, let's do that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20020
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85493/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20020
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20020
  
**[Test build #85493 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85493/testReport)**
 for PR 20020 at commit 
[`18ec016`](https://github.com/apache/spark/commit/18ec01638b9da7f8150e3ea35c2876d6d1f41f3d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-28 Thread Tagar

Github user Tagar commented on the issue:

https://github.com/apache/spark/pull/19683
  
There was a similar exception as in failing unit tests was fixed in 
[SPARK-18300](https://issues.apache.org/jira/browse/SPARK-18300) 

> java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to 
org.apache.spark.sql.catalyst.expressions.Attribute

https://github.com/apache/spark/pull/15892

Not sure if this is directly applicable or helpful here though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19892: [SPARK-22797][PySpark] Bucketizer support multi-column

2017-12-28 Thread zhengruifeng

Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/19892
  
ping @MLnick ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

2017-12-28 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20114
  
ping @ueshin @HyukjinKwon

Unfortunately, there was a bug in the Arrow 0.8.0 release on the Java side 
https://issues.apache.org/jira/browse/ARROW-1948 that caused a problem here.  I 
was able to find a workaround, but it required me to make a change to the 
`ArrowVectorAccessor` class.  I'm not sure if this is something you would be ok 
putting in, or if you would prefer to wait until the next minor release to add 
the ArrayType support.

The issue was that the Arrow spec states that if the validity buffer is 
empty, then that means that all the values are non-null.  In Arrow 0.8.0, the 
C++/Python side started sending buffers this way, and the Arrow ListVector was 
not handling it properly, thinking instead that there were no valid values.  

The workaround I added here looks if the ListVector has a value count of > 
0 and has an empty validity buffer.  This means that all the values are 
non-null and it will allocate a new validity buffer with all bits set.

For Arrow with non-udfs (toPandas and createDataFrame) this only needs to 
be done once, but for udfs each batch read will load new buffers into the arrow 
VectorSchemaRoot, so it needs to be checked after each read.  The simplest 
place to put the workaround to cover these cases was to allow 
`ArrowVectorAccessor.isNullAt(int rowId)` to be overridden.  Let me know what 
you guys think, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20114
  
**[Test build #85499 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85499/testReport)**
 for PR 20114 at commit 
[`d2c5c2b`](https://github.com/apache/spark/commit/d2c5c2b4ea803ac8d1f08a5f79af1076f9e5bd2b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation

2017-12-28 Thread chetkhatri

Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20070
  
@srowen please do re-run the build.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the...

2017-12-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20058#discussion_r159028207
  
--- Diff: python/pyspark/ml/base.py ---
@@ -47,6 +86,28 @@ def _fit(self, dataset):
 """
 raise NotImplementedError()
 
+@since("2.3.0")
+def fitMultiple(self, dataset, params):
--- End diff --

That's a good point that we could rename "params" to be clearer.  How about 
"paramMaps"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support f...

2017-12-28 Thread BryanCutler

GitHub user BryanCutler opened a pull request:

https://github.com/apache/spark/pull/20114

[SPARK-22530][PYTHON][SQL] Adding Arrow support for ArrayType

## What changes were proposed in this pull request?

This change adds `ArrayType` support for working with Arrow in pyspark when 
creating a DataFrame, calling `toPandas()`, and using vectorized `pandas_udf`.

## How was this patch tested?

Added new Python unit tests using Array data.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/BryanCutler/spark 
arrow-ArrayType-support-SPARK-22530

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20114.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20114


commit 50fa54c5b04455729b019c660ab8e86c903bda44
Author: Bryan Cutler 
Date:   2017-11-15T23:44:23Z

wip, toPandas works with pyarrow 0.7.1

commit a149352d0c60882bb6692cd43d2fb60c8dddb07b
Author: Bryan Cutler 
Date:   2017-12-01T20:02:16Z

createDataFrame test now working

commit 36faab4d7a23421968e1885dc6f2f47ac20c0ce0
Author: Bryan Cutler 
Date:   2017-12-23T08:21:34Z

using is_list to check type

commit b0c79f108acf3ca91dd931bb9be45e4bbcf840a6
Author: Bryan Cutler 
Date:   2017-12-24T07:06:06Z

Using a workaround for ListVector validity buffer, ArrowTests passing

commit f1bc9a5d8ba09cf6d702269b2418697184ef5690
Author: Bryan Cutler 
Date:   2017-12-29T05:54:44Z

ArrayType working in vectorized udfs

commit d2c5c2b4ea803ac8d1f08a5f79af1076f9e5bd2b
Author: Bryan Cutler 
Date:   2017-12-29T06:04:19Z

fix import order




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-12-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r159028159
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,519 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data presented as an extra categorical 
feature) or
+   * 'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data presented as an extra categorical 
feature) " +
+"or error (throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def validateAndTransformSchema(
+  schema: StructType, dropLast: Boolean, keepInvalid: Boolean): 
StructType = {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+// Input columns must be NumericType.
+inputColNames.foreach(SchemaUtils.checkNumericType(schema, _))
+
+// Prepares output columns with proper attributes by examining input 
columns.
+val inputFields = $(inputCols).map(schema(_))
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, outputColName, dropLast, keepInvalid)
+}
+outputFields.foldLeft(schema) { case (newSchema, outputField) =>
+  SchemaUtils.appendColumn(newSchema, outputField)
+}
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to one, and hence linearly 
dependent.
+ * So an input value of 4.0 maps to

[GitHub] spark issue #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel sav...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20113
  
**[Test build #85498 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85498/testReport)**
 for PR 20113 at commit 
[`408bfed`](https://github.com/apache/spark/commit/408bfed88cd237e5adbf42bd5b4fd2ccf875b5bd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureMo...

2017-12-28 Thread zhengruifeng

GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/20113

[SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel save

## What changes were proposed in this pull request?
make sure model data is stored in order.  @WeichenXu123 



## How was this patch tested?
existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark gmm_save

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20113.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20113


commit 408bfed88cd237e5adbf42bd5b4fd2ccf875b5bd
Author: Zheng RuiFeng 
Date:   2017-12-29T06:01:51Z

create pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19843: [SPARK-22644][ML][TEST] Make ML testsuite support...

2017-12-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19843#discussion_r159027996
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala ---
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.util
+
+import java.io.File
+
+import org.scalatest.Suite
+
+import org.apache.spark.SparkContext
+import org.apache.spark.ml.Transformer
+import org.apache.spark.sql.{DataFrame, Encoder, Row}
+import org.apache.spark.sql.execution.streaming.MemoryStream
+import org.apache.spark.sql.streaming.StreamTest
+import org.apache.spark.sql.test.TestSparkSession
+import org.apache.spark.util.Utils
+
+trait MLTest extends StreamTest with TempDirectory { self: Suite =>
+
+  @transient var sc: SparkContext = _
+  @transient var checkpointDir: String = _
+
+  protected override def createSparkSession: TestSparkSession = {
+new TestSparkSession(new SparkContext("local[2]", "MLlibUnitTest", 
sparkConf))
+  }
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+sc = spark.sparkContext
+checkpointDir = Utils.createDirectory(tempDir.getCanonicalPath, 
"checkpoints").toString
+sc.setCheckpointDir(checkpointDir)
+  }
+
+  override def afterAll() {
--- End diff --

Actually, it's worse than this.  I see a bunch of failures when I run 
multiple test suites at once, even when doing `sbt clean package` beforehand 
and without any tests which fail by themselves.  Will test on master and 
complain on the dev list if it's an issue.  (No need to respond here)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...

2017-12-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20111#discussion_r159027879
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala ---
@@ -37,21 +36,28 @@ class HashingTFSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
   }
 
   test("hashingTF") {
--- End diff --

ditto: rearranged to do validity check per-row


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...

2017-12-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20111#discussion_r159027858
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala ---
@@ -51,31 +48,31 @@ class FeatureHasherSuite extends SparkFunSuite
   }
 
   test("feature hashing") {
--- End diff --

Rearranged this test so it checks each row independently.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-28 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19683
  
Now I feel it's a little hacky to introduce 
`Generate.unrequiredChildOuput`, as the attribute may get replaced by something 
else during optimization. How about `Generate.unreqiredChildIndex`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import...

2017-12-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20110


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...

2017-12-28 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20110
  
Thanks! merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...

2017-12-28 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20110
  
I confirmed the test came to pass after the patch in my local environment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20010
  
**[Test build #85497 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85497/testReport)**
 for PR 20010 at commit 
[`86e1929`](https://github.com/apache/spark/commit/86e1929c490861d9e93ef34abd52c442f99f31a9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-12-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r159025626
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,519 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data presented as an extra categorical 
feature) or
+   * 'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data presented as an extra categorical 
feature) " +
+"or error (throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def validateAndTransformSchema(
+  schema: StructType, dropLast: Boolean, keepInvalid: Boolean): 
StructType = {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+// Input columns must be NumericType.
+inputColNames.foreach(SchemaUtils.checkNumericType(schema, _))
+
+// Prepares output columns with proper attributes by examining input 
columns.
+val inputFields = $(inputCols).map(schema(_))
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, outputColName, dropLast, keepInvalid)
+}
+outputFields.foldLeft(schema) { case (newSchema, outputField) =>
+  SchemaUtils.appendColumn(newSchema, outputField)
+}
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to one, and hence linearly 
dependent.
+ * So an input value of 4.0 maps to

[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...

2017-12-28 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20110
  
LGTM for the change, but I'm not sure whether the test was indeed triggered 
or not.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20112
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85496/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20112
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20112
  
**[Test build #85496 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85496/testReport)**
 for PR 20112 at commit 
[`83bb7de`](https://github.com/apache/spark/commit/83bb7ded0d58d4173671904a452039b57bcbea3d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class VectorSizeHint(JavaTransformer, HasInputCol, HasHandleInvalid, 
JavaMLReadable,`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20111: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.fea...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20111
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85494/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20111: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.fea...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20111
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20111: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.fea...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20111
  
**[Test build #85494 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85494/testReport)**
 for PR 20111 at commit 
[`12b3dcf`](https://github.com/apache/spark/commit/12b3dcf13f90ea00c2a12ec186a5f3277e812095).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class BinarizerSuite extends MLTest with DefaultReadWriteTest `
  * `class BucketedRandomProjectionLSHSuite extends MLTest with 
DefaultReadWriteTest `
  * `class BucketizerSuite extends MLTest with DefaultReadWriteTest `
  * `class ChiSqSelectorSuite extends MLTest with DefaultReadWriteTest `
  * `class CountVectorizerSuite extends MLTest with DefaultReadWriteTest `
  * `class DCTSuite extends MLTest with DefaultReadWriteTest `
  * `class ElementwiseProductSuite extends MLTest with DefaultReadWriteTest 
`
  * `class FeatureHasherSuite extends MLTest with DefaultReadWriteTest `
  * `class HashingTFSuite extends MLTest with DefaultReadWriteTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

2017-12-28 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20025
  
My understanding is that the reflection was used because we might use a 
different version of hive then we didn't control what it was done inside the 
`super.init`. However, after we inlined the hive code, it is safe to call the 
`super.init` method. This is a cleaner way to fix the referred and other 
potential bugs, IMO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20097
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20097
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85491/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20097
  
**[Test build #85491 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85491/testReport)**
 for PR 20097 at commit 
[`9ffb92c`](https://github.com/apache/spark/commit/9ffb92c28014a0469cb8e3f77bea2d7100a9416f).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20112
  
**[Test build #85496 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85496/testReport)**
 for PR 20112 at commit 
[`83bb7de`](https://github.com/apache/spark/commit/83bb7ded0d58d4173671904a452039b57bcbea3d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20112: [SPARK-22734][ML][PySpark] Added Python API for V...

2017-12-28 Thread MrBago

GitHub user MrBago opened a pull request:

https://github.com/apache/spark/pull/20112

[SPARK-22734][ML][PySpark] Added Python API for VectorSizeHint.

(Please fill in changes proposed in this fix)

Python API for VectorSizeHint Transformer.

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)

doc-tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MrBago/spark vectorSizeHint-PythonAPI

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20112.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20112


commit 83bb7ded0d58d4173671904a452039b57bcbea3d
Author: Bago Amirbekian 
Date:   2017-12-29T03:05:53Z

Added Python API for VectorSizeHint.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the...

2017-12-28 Thread MrBago

Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/20058#discussion_r159024163
  
--- Diff: python/pyspark/ml/base.py ---
@@ -47,6 +86,28 @@ def _fit(self, dataset):
 """
 raise NotImplementedError()
 
+@since("2.3.0")
+def fitMultiple(self, dataset, params):
--- End diff --

We couldn't use `fit` because it's going to have the same signature as the 
existing `fit` method but return a different type, (Iterator[(Int, Model)] 
instead of Seq[Model]). I was trying to be consistent with Estimator.fit which 
uses the name `params` which is different than the name of the same argument in 
Scala :/. Happy to change it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

2017-12-28 Thread MrBago

Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/20058#discussion_r159023958
  
--- Diff: python/pyspark/ml/base.py ---
@@ -47,6 +86,28 @@ def _fit(self, dataset):
 """
 raise NotImplementedError()
 
+@since("2.3.0")
+def fitMultiple(self, dataset, params):
+"""
+Fits a model to the input dataset for each param map in params.
+
+:param dataset: input dataset, which is an instance of 
:py:class:`pyspark.sql.DataFrame`.
+:param params: A Sequence of param maps.
+:return: A thread safe iterable which contains one model for each 
param map. Each
+ call to `next(modelIterator)` will return `(index, 
model)` where model was fit
+ using `params[index]`. Params maps may be fit in an order 
different than their
+ order in params.
+
+.. note:: DeveloperApi
+.. note:: Experimental
+"""
+estimator = self.copy()
+
+def fitSingleModel(index):
+return estimator.fit(dataset, params[index])
+
+return FitMultipleIterator(fitSingleModel, len(params))
--- End diff --

The idea is you should be able to do something like this:

```
pool = ...
modelIter = estimator.fitMultiple(params)
rng = range(len(params))
for index, model in pool.imap_unordered(lambda _: next(modelIter), rng):
pass
```
That's pretty much how I've set up corss validator to use it, 
https://github.com/apache/spark/pull/20058/files/fe3d6bddc3e9e50febf706d7f22007b1e0d58de3#diff-cbc8c36bfdd245e4e4d5bd27f9b95359R292

The reason for set it up this way is so that, when appropriate, Estimators 
can implement their own optimized `fitMultiple` methods that just need to 
return an "iterator", A class with `__iter__` and `__next__`. For examples 
models that use `maxIter` and `maxDepth` params.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19977
  
**[Test build #85495 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85495/testReport)**
 for PR 19977 at commit 
[`57a9d1e`](https://github.com/apache/spark/commit/57a9d1e9da21d56873c97eac08797499199a0c7b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...

2017-12-28 Thread advancedxy

Github user advancedxy commented on the issue:

https://github.com/apache/spark/pull/20082
  
ping @cloud-fan @jiangxb1987 @zsxwing, I think it's ready for merging.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20111: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.fea...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20111
  
**[Test build #85494 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85494/testReport)**
 for PR 20111 at commit 
[`12b3dcf`](https://github.com/apache/spark/commit/12b3dcf13f90ea00c2a12ec186a5f3277e812095).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...

2017-12-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20111#discussion_r159022777
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala ---
@@ -17,26 +17,23 @@
 
 package org.apache.spark.ml.feature
 
-import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.attribute.AttributeGroup
 import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.param.ParamsSuite
-import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
 import org.apache.spark.ml.util.TestingUtils._
-import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
 import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
 import org.apache.spark.sql.functions.col
 import org.apache.spark.sql.types._
 
-class FeatureHasherSuite extends SparkFunSuite
-  with MLlibTestSparkContext
-  with DefaultReadWriteTest {
+class FeatureHasherSuite extends MLTest with DefaultReadWriteTest {
 
   import testImplicits._
 
   import HashingTFSuite.murmur3FeatureIdx
 
-  implicit private val vectorEncoder = ExpressionEncoder[Vector]()
+  implicit private val vectorEncoder: ExpressionEncoder[Vector] = 
ExpressionEncoder[Vector]()
--- End diff --

scala style


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...

2017-12-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20111#discussion_r159022766
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ElementwiseProductSuite.scala 
---
@@ -17,13 +17,31 @@
 
 package org.apache.spark.ml.feature
 
-import org.apache.spark.SparkFunSuite
-import org.apache.spark.ml.linalg.Vectors
-import org.apache.spark.ml.util.DefaultReadWriteTest
-import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
+import org.apache.spark.ml.util.TestingUtils._
+import org.apache.spark.sql.Row
 
-class ElementwiseProductSuite
-  extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+class ElementwiseProductSuite extends MLTest with DefaultReadWriteTest {
+
+  import testImplicits._
+
+  test("streaming transform") {
--- End diff --

No existing unit test to use


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...

2017-12-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20111#discussion_r159022677
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala ---
@@ -163,18 +162,19 @@ class ChiSqSelectorSuite extends SparkFunSuite with 
MLlibTestSparkContext
 assert(expected.selectedFeatures === actual.selectedFeatures)
   }
   }
-}
 
-object ChiSqSelectorSuite {
-
-  private def testSelector(selector: ChiSqSelector, dataset: Dataset[_]): 
ChiSqSelectorModel = {
-val selectorModel = selector.fit(dataset)
-selectorModel.transform(dataset).select("filtered", 
"topFeature").collect()
-  .foreach { case Row(vec1: Vector, vec2: Vector) =>
+  private def testSelector(selector: ChiSqSelector, data: Dataset[_]): 
ChiSqSelectorModel = {
--- End diff --

Moved from object to class b/c this needed testTransformer from the MLTest 
mix-in


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...

2017-12-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20111#discussion_r159022657
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala
 ---
@@ -98,6 +97,21 @@ class BucketedRandomProjectionLSHSuite
 MLTestingUtils.checkCopyAndUids(brp, brpModel)
   }
 
+  test("BucketedRandomProjectionLSH: streaming transform") {
--- End diff --

No existing test to use


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...

2017-12-28 Thread jkbradley

GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/20111

[SPARK-22883][ML][TEST] Streaming tests for spark.ml.feature, from A to H

## What changes were proposed in this pull request?

Adds structured streaming tests using testTransformer for these suites:
* BinarizerSuite
* BucketedRandomProjectionLSHSuite
* BucketizerSuite
* ChiSqSelectorSuite
* CountVectorizerSuite
* DCTSuite.scala
* ElementwiseProductSuite
* FeatureHasherSuite
* HashingTFSuite

## How was this patch tested?

It tests itself because it is a bunch of tests!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark 
SPARK-22883-streaming-featureAM

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20111.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20111


commit 12b3dcf13f90ea00c2a12ec186a5f3277e812095
Author: Joseph K. Bradley 
Date:   2017-12-29T03:31:17Z

added streaming tests for first quarter of spark.ml.feature




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...

2017-12-28 Thread gczsjdy

Github user gczsjdy commented on the issue:

https://github.com/apache/spark/pull/20010
  
Seems not a regular error? 
@bdrillard Maybe you can push a commit and trigger the test again.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20020
  
**[Test build #85493 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85493/testReport)**
 for PR 20020 at commit 
[`18ec016`](https://github.com/apache/spark/commit/18ec01638b9da7f8150e3ea35c2876d6d1f41f3d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20109: [SPARK-22891][SQL] Make hive client creation thread safe

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20109
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20109: [SPARK-22891][SQL] Make hive client creation thread safe

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20109
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85490/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20109: [SPARK-22891][SQL] Make hive client creation thread safe

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20109
  
**[Test build #85490 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85490/testReport)**
 for PR 20109 at commit 
[`163d344`](https://github.com/apache/spark/commit/163d3443681af2c5ff246ecc546355934c0f6dbb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

2017-12-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20058#discussion_r159021297
  
--- Diff: python/pyspark/ml/base.py ---
@@ -47,6 +86,28 @@ def _fit(self, dataset):
 """
 raise NotImplementedError()
 
+@since("2.3.0")
+def fitMultiple(self, dataset, params):
--- End diff --

Check out the discussion on the JIRA and the linked design doc.  Basically, 
we need the same argument types but different return types from what the 
current fit() method provides.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

2017-12-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20058#discussion_r159021231
  
--- Diff: python/pyspark/ml/base.py ---
@@ -47,6 +74,24 @@ def _fit(self, dataset):
 """
 raise NotImplementedError()
 
+@since("2.3.0")
+def fitMultiple(self, dataset, params):
+"""
+Fits a model to the input dataset for each param map in params.
+
+:param dataset: input dataset, which is an instance of 
:py:class:`pyspark.sql.DataFrame`.
+:param params: A list/tuple of param maps.
--- End diff --

Is there another Sequence type this could be other than list or tuple?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

2017-12-28 Thread zuotingbing

Github user zuotingbing commented on the issue:

https://github.com/apache/spark/pull/20025
  
@rxin Could you please to review this? Thanks.

In my opinion we can create a new or follow-up PR if refactor is necessary. 
This PR is to fix the bug about the Session Timeout Checker does not work 
currently.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20110
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20110
  
**[Test build #85492 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85492/testReport)**
 for PR 20110 at commit 
[`6b73dd8`](https://github.com/apache/spark/commit/6b73dd8b2d47f8ae218bbab4eeb696684cdac138).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20110
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85492/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19979: [SPARK-22881][ML][TEST] ML regression package testsuite ...

2017-12-28 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19979
  
@jkbradley 
> When there has been a shuffle, it is likely the Rows will not follow a 
fixed order.

Agreed. But we can make sure it generate fix order from the last shuffle 
position in the physical plan RDD lineage. Those model which works like `map` 
transformation, I think it can make sure output row order to be exactly the 
same with input row order.

> test statistics (such as min/max ) on global transformer output

This is also used in some tests, such as "predictRaw and 
predictProbability" testcase in `DecisionTreeClassifierSuite"

> For comparing results with expected values, I much prefer for those 
values to be in a column in the original input dataset.

Agreed.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

2017-12-28 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20058#discussion_r159020468
  
--- Diff: python/pyspark/ml/base.py ---
@@ -47,6 +86,28 @@ def _fit(self, dataset):
 """
 raise NotImplementedError()
 
+@since("2.3.0")
+def fitMultiple(self, dataset, params):
+"""
+Fits a model to the input dataset for each param map in params.
+
+:param dataset: input dataset, which is an instance of 
:py:class:`pyspark.sql.DataFrame`.
+:param params: A Sequence of param maps.
+:return: A thread safe iterable which contains one model for each 
param map. Each
+ call to `next(modelIterator)` will return `(index, 
model)` where model was fit
+ using `params[index]`. Params maps may be fit in an order 
different than their
+ order in params.
+
+.. note:: DeveloperApi
+.. note:: Experimental
+"""
+estimator = self.copy()
+
+def fitSingleModel(index):
+return estimator.fit(dataset, params[index])
+
+return FitMultipleIterator(fitSingleModel, len(params))
--- End diff --

So whats the benefit of `FitMultipleIterator` v.s. using `imap_unordered`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

2017-12-28 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20058#discussion_r159020312
  
--- Diff: python/pyspark/ml/base.py ---
@@ -47,6 +86,28 @@ def _fit(self, dataset):
 """
 raise NotImplementedError()
 
+@since("2.3.0")
+def fitMultiple(self, dataset, params):
--- End diff --

So in Scala Spark we use the `fit` function rather than separate functions. 
Also the `params` name is different than the Scala one. Any reason for the 
difference?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

2017-12-28 Thread zuotingbing

Github user zuotingbing commented on the issue:

https://github.com/apache/spark/pull/20025
  
@liufengdb I think the class `SessionManager.java` is merged from Hive 
originally, and in Spark we redesigned it by adding 
`SparkSQLSessionManager.scala` with no affect to `SessionManager.java` :
`val sparkSqlSessionManager = new SparkSQLSessionManager(hiveServer, 
sqlContext)
setSuperField(this, "sessionManager", sparkSqlSessionManager)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-28 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/19977
  
ah, ok. good catch. I'll fix soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20107: [SPARK-22921][PROJECT-INFRA] Choices for Assigning Jira ...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20107
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85485/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20107: [SPARK-22921][PROJECT-INFRA] Choices for Assigning Jira ...

2017-12-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20107
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20107: [SPARK-22921][PROJECT-INFRA] Choices for Assigning Jira ...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20107
  
**[Test build #85485 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85485/testReport)**
 for PR 20107 at commit 
[`a335000`](https://github.com/apache/spark/commit/a335000475f71eff0055ccee91e9d486f50288fd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19535: [SPARK-22313][PYTHON] Mark/print deprecation warn...

2017-12-28 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19535#discussion_r159020029
  
--- Diff: python/pyspark/streaming/flume.py ---
@@ -54,8 +54,13 @@ def createStream(ssc, hostname, port,
 :param bodyDecoder:  A function used to decode body (default is 
utf8_decoder)
 :return: A DStream object
 
-.. note:: Deprecated in 2.3.0
+.. note:: Deprecated in 2.3.0. Flume support is deprecated as of 
Spark 2.3.0.
+See SPARK-22142.
 """
+warnings.warn(
--- End diff --

Sure,  I took a quick look and I think this one is actually not being 
tested and seems that's why .. will double check and take a closer look tonight 
(KST).

I have seen few mistakes about this so far and .. I am working on Python 
coverage BTW - https://issues.apache.org/jira/browse/SPARK-7721

Anyway, it was my stupid mistake. Thanks .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...

2017-12-28 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/20110
  
Thank you! Let's also check the build result to make sure 
`pyspark.streaming.tests.FlumePollingStreamTests` is indeed triggered (I hit 
this issue while running this test). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...

2017-12-28 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20110
  
cc @ueshin too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19535: [SPARK-22313][PYTHON] Mark/print deprecation warn...

2017-12-28 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19535#discussion_r159019845
  
--- Diff: python/pyspark/streaming/flume.py ---
@@ -54,8 +54,13 @@ def createStream(ssc, hostname, port,
 :param bodyDecoder:  A function used to decode body (default is 
utf8_decoder)
 :return: A DStream object
 
-.. note:: Deprecated in 2.3.0
+.. note:: Deprecated in 2.3.0. Flume support is deprecated as of 
Spark 2.3.0.
+See SPARK-22142.
 """
+warnings.warn(
--- End diff --

thank you :) It will be good to also check why master build does not fail 
since python should complain about it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...

2017-12-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20110
  
**[Test build #85492 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85492/testReport)**
 for PR 20110 at commit 
[`6b73dd8`](https://github.com/apache/spark/commit/6b73dd8b2d47f8ae218bbab4eeb696684cdac138).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...

2017-12-28 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20110
  
cc @yhuai. Thank you for catching this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import...

2017-12-28 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/20110

[SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnings namespace in 
flume.py

## What changes were proposed in this pull request?

This PR explicitly imports the missing `warnings` in `flume.py`. 

## How was this patch tested?

Manually tested.

```python
>>> import warnings
>>> warnings.simplefilter('always', DeprecationWarning)
>>> from pyspark.streaming import flume
>>> flume.FlumeUtils.createStream(None, None, None)
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/streaming/flume.py", line 60, in 
createStream
warnings.warn(
NameError: global name 'warnings' is not defined
```

```python
>>> import warnings
>>> warnings.simplefilter('always', DeprecationWarning)
>>> from pyspark.streaming import flume
>>> flume.FlumeUtils.createStream(None, None, None)
/.../spark/python/pyspark/streaming/flume.py:65: DeprecationWarning: 
Deprecated in 2.3.0. Flume support is deprecated as of Spark 2.3.0. See 
SPARK-22142.
  DeprecationWarning)
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-22313-followup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20110.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20110


commit 6b73dd8b2d47f8ae218bbab4eeb696684cdac138
Author: hyukjinkwon 
Date:   2017-12-29T02:27:15Z

Explicitly import warnings in flume.py




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 364 matches

Mail list logo