date:20161008

[GitHub] spark issue #15218: [SPARK-17637][Scheduler]Packed scheduling for Spark task...

2016-10-08 Thread zhzhan

Github user zhzhan commented on the issue:

https://github.com/apache/spark/pull/15218
  
@mridulm  You are right. This patch is mainly for the job that has multiple 
stages, which is very common in production pipeline. As you mentioned, if there 
is shuffle involved, getLocationsWithLargestOutputs in MapOutputTracker 
typically return None for the ShuffledRowRDD and ShuffledRDD because of the 
threshold REDUCER_PREF_LOCS_FRACTION (20%).

The ShuffledRowRDD/ShuffleRDD can be easily more than 10 partitions (even 
hundreds) in real production pipeline, thus the patch does help a lot in CPU 
reservation time.

 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning Result...

2016-10-08 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15389
  
@holdenk @dusenberrymw  @HyukjinKwon Thanks for review! 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2016-10-08 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15178
  
ping @rxin @JoshRosen Can you review this? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning Result...

2016-10-08 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15389
  
@holdenk Thanks you for cc'ing me. It looks okay to me as targeted but I 
feel we need a sign-off.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...

2016-10-08 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15388
  
ping @hvanhovell @rxin Can you take a look again? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15314
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15314
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66595/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15314
  
**[Test build #66595 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66595/consoleFull)**
 for PR 15314 at commit 
[`fabe3c6`](https://github.com/apache/spark/commit/fabe3c65838a2b4c7e5ff227c8d585ad2f05ccee).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15319: [SPARK-17733][SQL] InferFiltersFromConstraints rule neve...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15319
  
**[Test build #66597 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66597/consoleFull)**
 for PR 15319 at commit 
[`1558d4c`](https://github.com/apache/spark/commit/1558d4c2f9190691239e9b27e9517714c2af2bcc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #10307: [SPARK-12334][SQL][PYSPARK] Support read from mul...

2016-10-08 Thread zjffdu

GitHub user zjffdu reopened a pull request:

https://github.com/apache/spark/pull/10307

[SPARK-12334][SQL][PYSPARK] Support read from multiple input paths for orc 
file in DataFrameReader.orc



Beside the issue in spark api, also fix 2 minor issues in pyspark
* support read from multiple input paths for orc
* support read from multiple input paths for text

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zjffdu/spark SPARK-12334

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10307.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10307


commit 3dd3452236156bb7ef36e9d290217e23556f6b6e
Author: Jeff Zhang 
Date:   2015-12-15T10:00:47Z

[SPARK-12334][SQL][PYSPARK] Support read from multiple input paths for orc 
file in DataFrameReader.orc

commit b6a26e946fcf4331fb62382537ce2b0964a5b90e
Author: Jeff Zhang 
Date:   2015-12-16T01:41:29Z

Update doc

commit 24a8f4f70cb9da2d83a39836e8517b55a9238e70
Author: Jeff Zhang 
Date:   2016-04-19T10:36:24Z

address code style

commit 6ac05805391f13dcd0530f1ecedbd837befcfb20
Author: Jeff Zhang 
Date:   2016-10-09T03:53:41Z

resolve conflicts




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #10307: [SPARK-12334][SQL][PYSPARK] Support read from mul...

2016-10-08 Thread zjffdu

Github user zjffdu closed the pull request at:

https://github.com/apache/spark/pull/10307


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15319: [SPARK-17733][SQL] InferFiltersFromConstraints ru...

2016-10-08 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15319#discussion_r82515583
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala 
---
@@ -74,14 +74,26 @@ abstract class QueryPlan[PlanType <: 
QueryPlan[PlanType]] extends TreeNode[PlanT
* additional constraint of the form `b = 5`
*/
   private def inferAdditionalConstraints(constraints: Set[Expression]): 
Set[Expression] = {
+// Collect alias from expressions to avoid producing non-converging 
set of constraints
+// for recursive functions.
+//
+// Don't apply transform on constraints if the attribute used to 
replace is an alias,
+// because then both `QueryPlan.inferAdditionalConstraints` and
+// `UnaryNode.getAliasedConstraints` applies and may produce a 
non-converging set of
+// constraints.
+// For more details, infer 
https://issues.apache.org/jira/browse/SPARK-17733
+val aliasMap = AttributeMap((expressions ++ 
children.flatMap(_.expressions)).collect {
--- End diff --

Yes, `AttributeSet` is a better choice here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15403: [SPARK-17832][SQL] TableIdentifier.quotedString creates ...

2016-10-08 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/15403
  
@hvanhovell nvm about the `catalog.getTable` issue, it turns out to be my 
mistake. Sorry about that...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15403: [SPARK-17832][SQL] TableIdentifier.quotedString creates ...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15403
  
**[Test build #66596 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66596/consoleFull)**
 for PR 15403 at commit 
[`59a1f0e`](https://github.com/apache/spark/commit/59a1f0e42bde93975b3a13447e626dcfbebc0d80).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...

2016-10-08 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15292
  
LGTM except 2 minor comments, thanks for working on it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15314
  
**[Test build #66595 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66595/consoleFull)**
 for PR 15314 at commit 
[`fabe3c6`](https://github.com/apache/spark/commit/fabe3c65838a2b4c7e5ff227c8d585ad2f05ccee).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11211
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11211
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66590/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8318: [SPARK-1267][PYSPARK] Adds pip installer for pyspark

2016-10-08 Thread mateiz

Github user mateiz commented on the issue:

https://github.com/apache/spark/pull/8318
  
Cool, good to know that there's another ASF project that does it. We should 
go for it then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11211
  
**[Test build #66590 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66590/consoleFull)**
 for PR 11211 at commit 
[`b761b85`](https://github.com/apache/spark/commit/b761b858391fd96e18b074b16763cfa46284917a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...

2016-10-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15292#discussion_r82515285
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -229,13 +229,9 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   table: String,
   parts: Array[Partition],
   connectionProperties: Properties): DataFrame = {
-val props = new Properties()
-extraOptions.foreach { case (key, value) =>
-  props.put(key, value)
-}
-// connectionProperties should override settings in extraOptions
--- End diff --

should we still keep this comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14861: [SPARK-17287] [PYSPARK] Add recursive kwarg to Py...

2016-10-08 Thread jpiper

Github user jpiper closed the pull request at:

https://github.com/apache/spark/pull/14861


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14861: [SPARK-17287] [PYSPARK] Add recursive kwarg to Python Sp...

2016-10-08 Thread jpiper

Github user jpiper commented on the issue:

https://github.com/apache/spark/pull/14861
  
Looks like this was actually added in #15140, so we can close this :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15159: [SPARK-17605][SPARK_SUBMIT] Add option spark.usePython a...

2016-10-08 Thread zjffdu

Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/15159
  
@holdenk that's correct. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15159: [SPARK-17605][SPARK_SUBMIT] Add option spark.usePython a...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15159
  
**[Test build #66594 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66594/consoleFull)**
 for PR 15159 at commit 
[`522e3e8`](https://github.com/apache/spark/commit/522e3e85143235a57c94cdc618133e66715264de).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15218: [SPARK-17637][Scheduler]Packed scheduling for Spark task...

2016-10-08 Thread mridulm

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15218
  
@zhzhan I am curious why this is the case for the jobs being mentioned.
This pr should have an impact if the locality preference of the taskset 
being run is fairly suboptimal to begin with, no ?

If the tasks have PROCESS_LOCAL or NODE_LOCAL locality preference - that 
will take precedence, and attempts to spread the load or reduce spread to nodes 
as envisioned here will not work.

So the target here seems to be RACK_LOCAL or ANY locality preference - 
which should be fairly uncommon; unless I am missing something here w.r.t the 
jobs being run.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14215: [SPARK-16544][SQL][WIP] Support for conversion from comp...

2016-10-08 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14215
  
@wgtmac I hope this one is merged into 2.1 but I believe I am not supposed 
to decide it. I will anyway take out of the vectorized one described in the PR 
then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14861: [SPARK-17287] [PYSPARK] Add recursive kwarg to Python Sp...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14861
  
**[Test build #66593 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66593/consoleFull)**
 for PR 14861 at commit 
[`190c63b`](https://github.com/apache/spark/commit/190c63b6fad4588533237fdc83d7e9e8d7b8de7f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2016-10-08 Thread zjffdu

Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13493
  
PR is updated, @holdenk @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13493
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13493
  
**[Test build #66589 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66589/consoleFull)**
 for PR 13493 at commit 
[`e8fefc0`](https://github.com/apache/spark/commit/e8fefc05e0125974ed224f1de3acadbbbf3d98c8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13493
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66589/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10307: [SPARK-12334][SQL][PYSPARK] Support read from multiple i...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/10307
  
**[Test build #66592 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66592/consoleFull)**
 for PR 10307 at commit 
[`6ac0580`](https://github.com/apache/spark/commit/6ac05805391f13dcd0530f1ecedbd837befcfb20).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10307: [SPARK-12334][SQL][PYSPARK] Support read from multiple i...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/10307
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66592/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10307: [SPARK-12334][SQL][PYSPARK] Support read from multiple i...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/10307
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10307: [SPARK-12334][SQL][PYSPARK] Support read from multiple i...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/10307
  
**[Test build #66592 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66592/consoleFull)**
 for PR 10307 at commit 
[`6ac0580`](https://github.com/apache/spark/commit/6ac05805391f13dcd0530f1ecedbd837befcfb20).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10307: [SPARK-12334][SQL][PYSPARK] Support read from multiple i...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/10307
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10307: [SPARK-12334][SQL][PYSPARK] Support read from multiple i...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/10307
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66591/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10307: [SPARK-12334][SQL][PYSPARK] Support read from multiple i...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/10307
  
**[Test build #66591 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66591/consoleFull)**
 for PR 10307 at commit 
[`727b35a`](https://github.com/apache/spark/commit/727b35a6024adad61d89d2d515c3a1561df51cd2).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10307: [SPARK-12334][SQL][PYSPARK] Support read from multiple i...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/10307
  
**[Test build #66591 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66591/consoleFull)**
 for PR 10307 at commit 
[`727b35a`](https://github.com/apache/spark/commit/727b35a6024adad61d89d2d515c3a1561df51cd2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning Result...

2016-10-08 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/15389
  
Maybe @hyukjinkwon could also do a review pass while we wait for @davies or 
someone with commit privileges to come by and do a final review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15404: Branch 2.0

2016-10-08 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/15404
  
Can you close this pull request? If it's an attempt at backporting you can 
just make a new PR once you get it sorted out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14650: [SPARK-17062][MESOS] add conf option to mesos dispatcher

2016-10-08 Thread skonto

Github user skonto commented on the issue:

https://github.com/apache/spark/pull/14650
  
WIP


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11211
  
**[Test build #66590 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66590/consoleFull)**
 for PR 11211 at commit 
[`b761b85`](https://github.com/apache/spark/commit/b761b858391fd96e18b074b16763cfa46284917a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...

2016-10-08 Thread zjffdu

Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/11211
  
Conflicts is resolved. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13493
  
**[Test build #66589 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66589/consoleFull)**
 for PR 13493 at commit 
[`e8fefc0`](https://github.com/apache/spark/commit/e8fefc05e0125974ed224f1de3acadbbbf3d98c8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-08 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14959
  
Awesome, thanks for updating. I'm at PyData this weekend so will be a bit 
slow on my end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-08 Thread zjffdu

Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/14959
  
@vanzin @holdenk @BryanCutler  PR is updated, please help review. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15399: [SPARK-17819][SQL] Support default database in connectio...

2016-10-08 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/15399
  
Hi, @gatorsmile .
Could you review this PR when you have sometime?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14959
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14959
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66588/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14959
  
**[Test build #66588 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66588/consoleFull)**
 for PR 14959 at commit 
[`1972714`](https://github.com/apache/spark/commit/19727142d633f19a348658cf1c45993f45867fa4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14375: [SPARK-15194] [ML] Add Python ML API for MultivariateGau...

2016-10-08 Thread praveendareddy21

Github user praveendareddy21 commented on the issue:

https://github.com/apache/spark/pull/14375
  
@MechCoder 
can you review and merge this PR?
refer https://github.com/apache/spark/pull/13248 for discussions. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14959
  
**[Test build #66588 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66588/consoleFull)**
 for PR 14959 at commit 
[`1972714`](https://github.com/apache/spark/commit/19727142d633f19a348658cf1c45993f45867fa4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14959
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14959
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66587/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14959
  
**[Test build #66587 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66587/consoleFull)**
 for PR 14959 at commit 
[`dbc4bb4`](https://github.com/apache/spark/commit/dbc4bb49a44569fd55e3ede895ab06d2e959da82).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14959
  
**[Test build #66587 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66587/consoleFull)**
 for PR 14959 at commit 
[`dbc4bb4`](https://github.com/apache/spark/commit/dbc4bb49a44569fd55e3ede895ab06d2e959da82).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82513101
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
+#' @param maxDepth Maximum depth of the tree (>= 0).
+#' @param maxBins Maximum number of bins used for discretizing continuous 
features and for choosing
+#'how to split on features at each node. More bins give 
higher granularity. Must be
+#'>= 2 and >= number of categories in any categorical 
feature. (default = 32)
+#' @param ... additional arguments passed to the method.
--- End diff --

Or a future PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512650
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -791,4 +791,59 @@ test_that("spark.kstest", {
   expect_match(capture.output(stats)[1], "Kolmogorov-Smirnov test 
summary:")
 })
 
+test_that("spark.decisionTree Regression", {
+  data <- suppressWarnings(createDataFrame(longley))
--- End diff --

please add a test for print (see spark.glm)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512708
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
--- End diff --

please add the types supported, eg. `one of "regression" or 
"classification" as the type of model`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512794
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
+#' @param maxDepth Maximum depth of the tree (>= 0).
+#' @param maxBins Maximum number of bins used for discretizing continuous 
features and for choosing
+#'how to split on features at each node. More bins give 
higher granularity. Must be
+#'>= 2 and >= number of categories in any categorical 
feature. (default = 32)
+#' @param ... additional arguments passed to the method.
+#' @aliases spark.decisionTree,SparkDataFrame,formula-method
+#' @return \code{spark.decisionTree} returns a fitted Decision Tree model.
+#' @rdname spark.decisionTree
+#' @name spark.decisionTree
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(longley)
+#'
+#' # fit a Decision Tree Regression Model
+#' model <- spark.decisionTree(data, Employed ~ ., type = "regression", 
maxDepth = 5, maxBins = 16)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.decisionTree since 2.1.0
+setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula 
= "formula"),
+  function(data, formula, type = c("regression", "classification"),
+   maxDepth = 5, maxBins = 32 ) {
+type <- match.arg(type)
+formula <- paste(deparse(formula), collapse = "")
+switch(type,
+   regression =  {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeRegressorWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeRegressionModel", jobj = jobj)
+   },
+   classification = {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeClassifierWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeClassificationModel", jobj = jobj)
+   }
+)
+  })
+
+# Makes predictions from a Decision Tree Regression model or
+# a model produced by spark.decisionTree()
+
+#' @param newData a SparkDataFrame for testing.
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
labeled in a column named
+#' "prediction"
+#' @rdname spark.decisionTree
+#' @export
--- End diff --

add @aliases


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82513082
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/DecisionTreeRegressorWrapper.scala 
---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.hadoop.fs.Path
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.ml.{Pipeline, PipelineModel}
+import org.apache.spark.ml.attribute.AttributeGroup
+import org.apache.spark.ml.feature.RFormula
+import org.apache.spark.ml.regression.{DecisionTreeRegressionModel, 
DecisionTreeRegressor}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+
+private[r] class DecisionTreeRegressorWrapper private (
+  val pipeline: PipelineModel,
+  val features: Array[String],
+  val maxDepth: Int,
+  val maxBins: Int) extends MLWritable {
+
+  private val DTModel: DecisionTreeRegressionModel =
+pipeline.stages(1).asInstanceOf[DecisionTreeRegressionModel]
+
+  lazy val depth: Int = DTModel.depth
+  lazy val numNodes: Int = DTModel.numNodes
+
+  def summary: String = DTModel.toDebugString
+
+  def transform(dataset: Dataset[_]): DataFrame = {
+pipeline.transform(dataset)
+  .drop(DTModel.getFeaturesCol)
+  }
+
+  override def write: MLWriter = new
+  DecisionTreeRegressorWrapper.DecisionTreeRegressorWrapperWriter(this)
+}
+
+private[r] object DecisionTreeRegressorWrapper extends 
MLReadable[DecisionTreeRegressorWrapper] {
+  def fit(data: DataFrame,
+  formula: String,
+  maxDepth: Int,
+  maxBins: Int): DecisionTreeRegressorWrapper = {
+
+val rFormula = new RFormula()
+  .setFormula(formula)
+  .setFeaturesCol("features")
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15334: [SPARK-10367][SQL][WIP] Support Parquet logical type INT...

2016-10-08 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15334
  
Saw @viirya submitted a PR for the same issue: 
https://github.com/apache/spark/pull/7793


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512809
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
+#' @param maxDepth Maximum depth of the tree (>= 0).
+#' @param maxBins Maximum number of bins used for discretizing continuous 
features and for choosing
+#'how to split on features at each node. More bins give 
higher granularity. Must be
+#'>= 2 and >= number of categories in any categorical 
feature. (default = 32)
+#' @param ... additional arguments passed to the method.
+#' @aliases spark.decisionTree,SparkDataFrame,formula-method
+#' @return \code{spark.decisionTree} returns a fitted Decision Tree model.
+#' @rdname spark.decisionTree
+#' @name spark.decisionTree
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(longley)
+#'
+#' # fit a Decision Tree Regression Model
+#' model <- spark.decisionTree(data, Employed ~ ., type = "regression", 
maxDepth = 5, maxBins = 16)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.decisionTree since 2.1.0
+setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula 
= "formula"),
+  function(data, formula, type = c("regression", "classification"),
+   maxDepth = 5, maxBins = 32 ) {
+type <- match.arg(type)
+formula <- paste(deparse(formula), collapse = "")
+switch(type,
+   regression =  {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeRegressorWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeRegressionModel", jobj = jobj)
+   },
+   classification = {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeClassifierWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeClassificationModel", jobj = jobj)
+   }
+)
+  })
+
+# Makes predictions from a Decision Tree Regression model or
+# a model produced by spark.decisionTree()
+
+#' @param newData a SparkDataFrame for testing.
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
labeled in a column named
+#' "prediction"
+#' @rdname spark.decisionTree
+#' @export
+#' @note predict(decisionTreeRegressionModel) since 2.1.0
+setMethod("predict", signature(object = "DecisionTreeRegressionModel"),
+  function(object, newData) {
+predict_internal(object, newData)
+  })
+
+#' @rdname spark.decisionTree
+#' @export
+#' @note predict(decisionTreeClassificationModel) since 2.1.0
+setMethod("predict", signature(object = "DecisionTreeClassificationModel"),
+  function(object, newData) {
+predict_internal(object, newData)
+  })
+
+#' Save the Decision Tree Regression model to the input path.
+#'
+#' @param object A fitted Decision tree regression model
+#' @param path The directory where the model is saved
+#' @param overwrite Overwrites or not if the output path already exists. 
Default is FALSE
+#'  which means throw exception if the output path exists.
+#'
+#' @aliases write.ml,DecisionTreeRegressionModel,character-method
+#' @rdname spark.deci

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512689
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
--- End diff --

could you point this url to the Spark programming guide, like 
http://spark.apache.org/docs/latest/ml-classification-regression.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512723
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
+#' @param maxDepth Maximum depth of the tree (>= 0).
+#' @param maxBins Maximum number of bins used for discretizing continuous 
features and for choosing
+#'how to split on features at each node. More bins give 
higher granularity. Must be
+#'>= 2 and >= number of categories in any categorical 
feature. (default = 32)
+#' @param ... additional arguments passed to the method.
+#' @aliases spark.decisionTree,SparkDataFrame,formula-method
+#' @return \code{spark.decisionTree} returns a fitted Decision Tree model.
+#' @rdname spark.decisionTree
+#' @name spark.decisionTree
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(longley)
+#'
+#' # fit a Decision Tree Regression Model
+#' model <- spark.decisionTree(data, Employed ~ ., type = "regression", 
maxDepth = 5, maxBins = 16)
+#'
--- End diff --

Could we add an example for "classification" too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512914
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
+#' @param maxDepth Maximum depth of the tree (>= 0).
+#' @param maxBins Maximum number of bins used for discretizing continuous 
features and for choosing
+#'how to split on features at each node. More bins give 
higher granularity. Must be
+#'>= 2 and >= number of categories in any categorical 
feature. (default = 32)
+#' @param ... additional arguments passed to the method.
+#' @aliases spark.decisionTree,SparkDataFrame,formula-method
+#' @return \code{spark.decisionTree} returns a fitted Decision Tree model.
+#' @rdname spark.decisionTree
+#' @name spark.decisionTree
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(longley)
+#'
+#' # fit a Decision Tree Regression Model
+#' model <- spark.decisionTree(data, Employed ~ ., type = "regression", 
maxDepth = 5, maxBins = 16)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.decisionTree since 2.1.0
+setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula 
= "formula"),
+  function(data, formula, type = c("regression", "classification"),
+   maxDepth = 5, maxBins = 32 ) {
+type <- match.arg(type)
+formula <- paste(deparse(formula), collapse = "")
+switch(type,
+   regression =  {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeRegressorWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeRegressionModel", jobj = jobj)
+   },
+   classification = {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeClassifierWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeClassificationModel", jobj = jobj)
+   }
+)
+  })
+
+# Makes predictions from a Decision Tree Regression model or
+# a model produced by spark.decisionTree()
+
+#' @param newData a SparkDataFrame for testing.
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
labeled in a column named
+#' "prediction"
+#' @rdname spark.decisionTree
+#' @export
+#' @note predict(decisionTreeRegressionModel) since 2.1.0
+setMethod("predict", signature(object = "DecisionTreeRegressionModel"),
+  function(object, newData) {
+predict_internal(object, newData)
+  })
+
+#' @rdname spark.decisionTree
+#' @export
+#' @note predict(decisionTreeClassificationModel) since 2.1.0
+setMethod("predict", signature(object = "DecisionTreeClassificationModel"),
+  function(object, newData) {
+predict_internal(object, newData)
+  })
+
+#' Save the Decision Tree Regression model to the input path.
+#'
+#' @param object A fitted Decision tree regression model
+#' @param path The directory where the model is saved
+#' @param overwrite Overwrites or not if the output path already exists. 
Default is FALSE
+#'  which means throw exception if the output path exists.
+#'
+#' @aliases write.ml,DecisionTreeRegressionModel,character-method
+#' @rdname spark.deci

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512937
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
+#' @param maxDepth Maximum depth of the tree (>= 0).
+#' @param maxBins Maximum number of bins used for discretizing continuous 
features and for choosing
+#'how to split on features at each node. More bins give 
higher granularity. Must be
+#'>= 2 and >= number of categories in any categorical 
feature. (default = 32)
+#' @param ... additional arguments passed to the method.
+#' @aliases spark.decisionTree,SparkDataFrame,formula-method
+#' @return \code{spark.decisionTree} returns a fitted Decision Tree model.
+#' @rdname spark.decisionTree
+#' @name spark.decisionTree
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(longley)
+#'
+#' # fit a Decision Tree Regression Model
+#' model <- spark.decisionTree(data, Employed ~ ., type = "regression", 
maxDepth = 5, maxBins = 16)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.decisionTree since 2.1.0
+setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula 
= "formula"),
+  function(data, formula, type = c("regression", "classification"),
+   maxDepth = 5, maxBins = 32 ) {
+type <- match.arg(type)
+formula <- paste(deparse(formula), collapse = "")
+switch(type,
+   regression =  {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeRegressorWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeRegressionModel", jobj = jobj)
+   },
+   classification = {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeClassifierWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeClassificationModel", jobj = jobj)
+   }
+)
+  })
+
+# Makes predictions from a Decision Tree Regression model or
+# a model produced by spark.decisionTree()
+
+#' @param newData a SparkDataFrame for testing.
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
labeled in a column named
+#' "prediction"
+#' @rdname spark.decisionTree
+#' @export
+#' @note predict(decisionTreeRegressionModel) since 2.1.0
+setMethod("predict", signature(object = "DecisionTreeRegressionModel"),
+  function(object, newData) {
+predict_internal(object, newData)
+  })
+
+#' @rdname spark.decisionTree
+#' @export
+#' @note predict(decisionTreeClassificationModel) since 2.1.0
+setMethod("predict", signature(object = "DecisionTreeClassificationModel"),
+  function(object, newData) {
+predict_internal(object, newData)
+  })
+
+#' Save the Decision Tree Regression model to the input path.
+#'
+#' @param object A fitted Decision tree regression model
+#' @param path The directory where the model is saved
+#' @param overwrite Overwrites or not if the output path already exists. 
Default is FALSE
+#'  which means throw exception if the output path exists.
+#'
+#' @aliases write.ml,DecisionTreeRegressionModel,character-method
+#' @rdname spark.deci

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512799
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
+#' @param maxDepth Maximum depth of the tree (>= 0).
+#' @param maxBins Maximum number of bins used for discretizing continuous 
features and for choosing
+#'how to split on features at each node. More bins give 
higher granularity. Must be
+#'>= 2 and >= number of categories in any categorical 
feature. (default = 32)
+#' @param ... additional arguments passed to the method.
+#' @aliases spark.decisionTree,SparkDataFrame,formula-method
+#' @return \code{spark.decisionTree} returns a fitted Decision Tree model.
+#' @rdname spark.decisionTree
+#' @name spark.decisionTree
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(longley)
+#'
+#' # fit a Decision Tree Regression Model
+#' model <- spark.decisionTree(data, Employed ~ ., type = "regression", 
maxDepth = 5, maxBins = 16)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.decisionTree since 2.1.0
+setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula 
= "formula"),
+  function(data, formula, type = c("regression", "classification"),
+   maxDepth = 5, maxBins = 32 ) {
+type <- match.arg(type)
+formula <- paste(deparse(formula), collapse = "")
+switch(type,
+   regression =  {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeRegressorWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeRegressionModel", jobj = jobj)
+   },
+   classification = {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeClassifierWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeClassificationModel", jobj = jobj)
+   }
+)
+  })
+
+# Makes predictions from a Decision Tree Regression model or
+# a model produced by spark.decisionTree()
+
+#' @param newData a SparkDataFrame for testing.
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
labeled in a column named
+#' "prediction"
+#' @rdname spark.decisionTree
+#' @export
+#' @note predict(decisionTreeRegressionModel) since 2.1.0
+setMethod("predict", signature(object = "DecisionTreeRegressionModel"),
+  function(object, newData) {
+predict_internal(object, newData)
+  })
+
+#' @rdname spark.decisionTree
+#' @export
+#' @note predict(decisionTreeClassificationModel) since 2.1.0
--- End diff --

add `@aliases`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512787
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
+#' @param maxDepth Maximum depth of the tree (>= 0).
+#' @param maxBins Maximum number of bins used for discretizing continuous 
features and for choosing
+#'how to split on features at each node. More bins give 
higher granularity. Must be
+#'>= 2 and >= number of categories in any categorical 
feature. (default = 32)
+#' @param ... additional arguments passed to the method.
+#' @aliases spark.decisionTree,SparkDataFrame,formula-method
+#' @return \code{spark.decisionTree} returns a fitted Decision Tree model.
+#' @rdname spark.decisionTree
+#' @name spark.decisionTree
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(longley)
+#'
+#' # fit a Decision Tree Regression Model
+#' model <- spark.decisionTree(data, Employed ~ ., type = "regression", 
maxDepth = 5, maxBins = 16)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.decisionTree since 2.1.0
+setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula 
= "formula"),
+  function(data, formula, type = c("regression", "classification"),
+   maxDepth = 5, maxBins = 32 ) {
+type <- match.arg(type)
+formula <- paste(deparse(formula), collapse = "")
+switch(type,
+   regression =  {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeRegressorWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeRegressionModel", jobj = jobj)
+   },
+   classification = {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeClassifierWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeClassificationModel", jobj = jobj)
+   }
+)
+  })
+
+# Makes predictions from a Decision Tree Regression model or
+# a model produced by spark.decisionTree()
--- End diff --

Isn't the `Decision Tree Regression model` produced by 
`spark.decisionTree()`? could you clarify?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512727
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
+#' @param maxDepth Maximum depth of the tree (>= 0).
+#' @param maxBins Maximum number of bins used for discretizing continuous 
features and for choosing
+#'how to split on features at each node. More bins give 
higher granularity. Must be
+#'>= 2 and >= number of categories in any categorical 
feature. (default = 32)
+#' @param ... additional arguments passed to the method.
+#' @aliases spark.decisionTree,SparkDataFrame,formula-method
+#' @return \code{spark.decisionTree} returns a fitted Decision Tree model.
+#' @rdname spark.decisionTree
+#' @name spark.decisionTree
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(longley)
+#'
+#' # fit a Decision Tree Regression Model
+#' model <- spark.decisionTree(data, Employed ~ ., type = "regression", 
maxDepth = 5, maxBins = 16)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.decisionTree since 2.1.0
+setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula 
= "formula"),
+  function(data, formula, type = c("regression", "classification"),
+   maxDepth = 5, maxBins = 32 ) {
--- End diff --

nit: extra space after `32 )`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512770
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
+#' @param maxDepth Maximum depth of the tree (>= 0).
+#' @param maxBins Maximum number of bins used for discretizing continuous 
features and for choosing
+#'how to split on features at each node. More bins give 
higher granularity. Must be
+#'>= 2 and >= number of categories in any categorical 
feature. (default = 32)
+#' @param ... additional arguments passed to the method.
--- End diff --

should it support other parameters, like numClasses, features, impurity?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82513063
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/DecisionTreeClassifierWrapper.scala 
---
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.hadoop.fs.Path
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.ml.{Pipeline, PipelineModel}
+import org.apache.spark.ml.attribute.AttributeGroup
+import 
org.apache.spark.ml.classification.{DecisionTreeClassificationModel, 
DecisionTreeClassifier}
+import org.apache.spark.ml.feature.RFormula
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+
+private[r] class DecisionTreeClassifierWrapper private (
+  val pipeline: PipelineModel,
+  val features: Array[String],
+  val maxDepth: Int,
+  val maxBins: Int) extends MLWritable {
+
+  private val DTModel: DecisionTreeClassificationModel =
+pipeline.stages(1).asInstanceOf[DecisionTreeClassificationModel]
+
+  lazy val depth: Int = DTModel.depth
+  lazy val numNodes: Int = DTModel.numNodes
+  lazy val numClasses: Int = DTModel.numClasses
+
+  def summary: String = DTModel.toDebugString
+
+  def transform(dataset: Dataset[_]): DataFrame = {
+pipeline.transform(dataset)
+  .drop(DTModel.getFeaturesCol)
+  }
+
+  override def write: MLWriter = new
+  
DecisionTreeClassifierWrapper.DecisionTreeClassifierWrapperWriter(this)
+}
+
+private[r] object DecisionTreeClassifierWrapper extends 
MLReadable[DecisionTreeClassifierWrapper] {
+  def fit(data: DataFrame,
+  formula: String,
+  maxDepth: Int,
+  maxBins: Int): DecisionTreeClassifierWrapper = {
+
+val rFormula = new RFormula()
+  .setFormula(formula)
+  .setFeaturesCol("features")
--- End diff --

could you take a look at another model wrapper (like NaiveBayesWrapper) and 
`RWrapperUtils` on how to handle DataFrame column name - this shouldn't be 
hardcoded here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13690: [SPARK-15767][R][ML] Decision Tree Regression wra...

2016-10-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13690#discussion_r82512912
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1427,3 +1447,185 @@ print.summary.KSTest <- function(x, ...) {
   cat(summaryStr, "\n")
   invisible(x)
 }
+
+#' Decision Tree Model for Regression and Classification
+#'
+#' \code{spark.decisionTree} fits a Decision Tree Regression model or 
Classification model on
+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the 
fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see 
\href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', ':', '+', and 
'-'.
+#' @param type type of model to fit
+#' @param maxDepth Maximum depth of the tree (>= 0).
+#' @param maxBins Maximum number of bins used for discretizing continuous 
features and for choosing
+#'how to split on features at each node. More bins give 
higher granularity. Must be
+#'>= 2 and >= number of categories in any categorical 
feature. (default = 32)
+#' @param ... additional arguments passed to the method.
+#' @aliases spark.decisionTree,SparkDataFrame,formula-method
+#' @return \code{spark.decisionTree} returns a fitted Decision Tree model.
+#' @rdname spark.decisionTree
+#' @name spark.decisionTree
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(longley)
+#'
+#' # fit a Decision Tree Regression Model
+#' model <- spark.decisionTree(data, Employed ~ ., type = "regression", 
maxDepth = 5, maxBins = 16)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.decisionTree since 2.1.0
+setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula 
= "formula"),
+  function(data, formula, type = c("regression", "classification"),
+   maxDepth = 5, maxBins = 32 ) {
+type <- match.arg(type)
+formula <- paste(deparse(formula), collapse = "")
+switch(type,
+   regression =  {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeRegressorWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeRegressionModel", jobj = jobj)
+   },
+   classification = {
+ jobj <- 
callJStatic("org.apache.spark.ml.r.DecisionTreeClassifierWrapper",
+ "fit", data@sdf, formula, 
as.integer(maxDepth),
+ as.integer(maxBins))
+ new("DecisionTreeClassificationModel", jobj = jobj)
+   }
+)
+  })
+
+# Makes predictions from a Decision Tree Regression model or
+# a model produced by spark.decisionTree()
+
+#' @param newData a SparkDataFrame for testing.
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
labeled in a column named
+#' "prediction"
+#' @rdname spark.decisionTree
+#' @export
+#' @note predict(decisionTreeRegressionModel) since 2.1.0
+setMethod("predict", signature(object = "DecisionTreeRegressionModel"),
+  function(object, newData) {
+predict_internal(object, newData)
+  })
+
+#' @rdname spark.decisionTree
+#' @export
+#' @note predict(decisionTreeClassificationModel) since 2.1.0
+setMethod("predict", signature(object = "DecisionTreeClassificationModel"),
+  function(object, newData) {
+predict_internal(object, newData)
+  })
+
+#' Save the Decision Tree Regression model to the input path.
+#'
+#' @param object A fitted Decision tree regression model
+#' @param path The directory where the model is saved
+#' @param overwrite Overwrites or not if the output path already exists. 
Default is FALSE
+#'  which means throw exception if the output path exists.
+#'
+#' @aliases write.ml,DecisionTreeRegressionModel,character-method
+#' @rdname spark.deci

[GitHub] spark pull request #14959: [SPARK-17387][PYSPARK] Creating SparkContext() fr...

2016-10-08 Thread zjffdu

Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14959#discussion_r82512997
  
--- Diff: python/pyspark/conf.py ---
@@ -101,13 +101,25 @@ def __init__(self, loadDefaults=True, _jvm=None, 
_jconf=None):
 self._jconf = _jconf
 else:
 from pyspark.context import SparkContext
-SparkContext._ensure_initialized()
 _jvm = _jvm or SparkContext._jvm
-self._jconf = _jvm.SparkConf(loadDefaults)
+
+if _jvm:
+# JVM is created, so create self._jconf directly through 
JVM
+self._jconf = _jvm.SparkConf(loadDefaults)
+self._conf = None
+else:
+# JVM is not created, so store data in self._conf first
+self._jconf = None
+self._conf = {}
 
 def set(self, key, value):
 """Set a configuration property."""
-self._jconf.set(key, unicode(value))
+# Try to set self._jconf first if JVM is created, set self._conf 
if JVM is not created yet.
+if self._jconf:
+self._jconf.set(key, unicode(value))
+else:
+# Don't use unicode for self._conf, otherwise we will get 
exception when launching jvm.
+self._conf[key] = value
--- End diff --

Fixed, It might be my last previous commits' issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14959: [SPARK-17387][PYSPARK] Creating SparkContext() fr...

2016-10-08 Thread zjffdu

Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14959#discussion_r82512981
  
--- Diff: python/pyspark/conf.py ---
@@ -118,28 +130,28 @@ def setIfMissing(self, key, value):
 
 def setMaster(self, value):
 """Set master URL to connect to."""
-self._jconf.setMaster(value)
+self.set("spark.master", value)
 return self
 
 def setAppName(self, value):
 """Set application name."""
-self._jconf.setAppName(value)
+self.set("spark.app.name", value)
 return self
 
 def setSparkHome(self, value):
 """Set path where Spark is installed on worker nodes."""
-self._jconf.setSparkHome(value)
+self.set("spark.home", value)
 return self
 
 def setExecutorEnv(self, key=None, value=None, pairs=None):
 """Set an environment variable to be passed to executors."""
 if (key is not None and pairs is not None) or (key is None and 
pairs is None):
 raise Exception("Either pass one key-value pair or a list of 
pairs")
 elif key is not None:
-self._jconf.setExecutorEnv(key, value)
+self.set("spark.executorEnv." + key, value)
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15404: Branch 2.0

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15404
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15404: Branch 2.0

2016-10-08 Thread yintengfei

GitHub user yintengfei opened a pull request:

https://github.com/apache/spark/pull/15404

Branch 2.0

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)


## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15404.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15404


commit 5735b8bd769c64e2b0e0fae75bad794cde3edc99
Author: Reynold Xin 
Date:   2016-08-18T08:37:25Z

[SPARK-16391][SQL] Support partial aggregation for reduceGroups

## What changes were proposed in this pull request?
This patch introduces a new private ReduceAggregator interface that is a 
subclass of Aggregator. ReduceAggregator only requires a single associative and 
commutative reduce function. ReduceAggregator is also used to implement 
KeyValueGroupedDataset.reduceGroups in order to support partial aggregation.

Note that the pull request was initially done by viirya.

## How was this patch tested?
Covered by original tests for reduceGroups, as well as a new test suite for 
ReduceAggregator.

Author: Reynold Xin 
Author: Liang-Chi Hsieh 

Closes #14576 from rxin/reduceAggregator.

(cherry picked from commit 1748f824101870b845dbbd118763c6885744f98a)
Signed-off-by: Wenchen Fan 

commit ec5f157a32f0c65b5f93bdde7a6334e982b3b83c
Author: petermaxlee 
Date:   2016-08-18T11:44:13Z

[SPARK-17117][SQL] 1 / NULL should not fail analysis

## What changes were proposed in this pull request?
This patch fixes the problem described in SPARK-17117, i.e. "SELECT 1 / 
NULL" throws an analysis exception:

```
org.apache.spark.sql.AnalysisException: cannot resolve '(1 / NULL)' due to 
data type mismatch: differing types in '(1 / NULL)' (int and null).
```

The problem is that division type coercion did not take null type into 
account.

## How was this patch tested?
A unit test for the type coercion, and a few end-to-end test cases using 
SQLQueryTestSuite.

Author: petermaxlee 

Closes #14695 from petermaxlee/SPARK-17117.

(cherry picked from commit 68f5087d2107d6afec5d5745f0cb0e9e3bdd6a0b)
Signed-off-by: Herman van Hovell 

commit 176af17a7213a4c2847a04f715137257657f2961
Author: Xin Ren 
Date:   2016-08-10T07:49:06Z

[MINOR][SPARKR] R API documentation for "coltypes" is confusing

## What changes were proposed in this pull request?

R API documentation for "coltypes" is confusing, found when working on 
another ticket.

Current version http://spark.apache.org/docs/2.0.0/api/R/coltypes.html, 
where parameters have 2 "x" which is a duplicate, and also the example is not 
very clear


![current](https://cloud.githubusercontent.com/assets/3925641/17386808/effb98ce-59a2-11e6-9657-d477d258a80c.png)

![screen shot 2016-08-03 at 5 56 00 
pm](https://cloud.githubusercontent.com/assets/3925641/17386884/91831096-59a3-11e6-84af-39890b3d45d8.png)

## How was this patch tested?

Tested manually on local machine. And the screenshots are like below:

![screen shot 2016-08-07 at 11 29 20 
pm](https://cloud.githubusercontent.com/assets/3925641/17471144/df36633c-5cf6-11e6-8238-4e32ead0e529.png)

![screen shot 2016-08-03 at 5 56 22 
pm](https://cloud.githubusercontent.com/assets/3925641/17386896/9d36cb26-59a3-11e6-9619-6dae29f7ab17.png)

Author: Xin Ren 

Closes #14489 from keypointt/rExample.

(cherry picked from commit 1203c8415cd11540f79a235e66a2f241ca6c71e4)
Signed-off-by: Shivaram Venkataraman 

commit ea684b69cd6934bc093f4a5a8b0d8470e92157cd
Author: Eric Liang 
Date:   2016-08-18T11:33:55Z

[SPARK-17069] Expose spark.range() as table-valued function in SQL

This adds analyzer rules for resolving table-valued functions, and adds one 
builtin implementation for range(). The arguments for range() are the same as 
those of `spark.range()`.

Unit tests.

cc hvanhovell

Author: Eric Liang 

Closes #14656 from ericl/sc-4309.

(cherry picked from commit 412dba63b511474a6db3c43c8618d803e604bc6b)
Signed-off-by: Reynold Xin 

commit c180d637a3caca0d4e46f4980c10d1005eb453bc
Author: petermaxlee 
Date:   2016-08-19T01:19:47Z

[SPARK-16947][SQL] Support type coercion and foldable expression for inline 
tables

This patch improves inline table support with the follow

[GitHub] spark pull request #14959: [SPARK-17387][PYSPARK] Creating SparkContext() fr...

2016-10-08 Thread zjffdu

Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14959#discussion_r82512714
  
--- Diff: python/pyspark/conf.py ---
@@ -149,35 +161,53 @@ def setAll(self, pairs):
 :param pairs: list of key-value pairs to set
 """
 for (k, v) in pairs:
-self._jconf.set(k, v)
+self.set(k, v)
 return self
 
 def get(self, key, defaultValue=None):
 """Get the configured value for some key, or return a default 
otherwise."""
 if defaultValue is None:   # Py4J doesn't call the right get() if 
we pass None
-if not self._jconf.contains(key):
-return None
-return self._jconf.get(key)
+if self._jconf:
+if not self._jconf.contains(key):
+return None
+return self._jconf.get(key)
+else:
+if key not in self._conf:
+return None
+return self._conf[key]
 else:
-return self._jconf.get(key, defaultValue)
+if self._jconf:
+return self._jconf.get(key, defaultValue)
+else:
+return self._conf.get(key, defaultValue)
 
 def getAll(self):
 """Get all values as a list of key-value pairs."""
 pairs = []
-for elem in self._jconf.getAll():
-pairs.append((elem._1(), elem._2()))
+if self._jconf:
+for elem in self._jconf.getAll():
+pairs.append((elem._1(), elem._2()))
+else:
+for k, v in self._conf.items():
+pairs.append((k, v))
 return pairs
 
 def contains(self, key):
 """Does this configuration contain a given key?"""
-return self._jconf.contains(key)
+if self._jconf:
+return self._jconf.contains(key)
+else:
+return key in self._conf
 
 def toDebugString(self):
 """
 Returns a printable version of the configuration, as a list of
 key=value pairs, one per line.
 """
-return self._jconf.toDebugString()
+if self._jconf:
+return self._jconf.toDebugString()
+else:
+return '\n'.join('%s=%s' % (k, v) for k, v in 
self._conf.items())
--- End diff --

They may be different, because _jconf has the extra configuration in jvm 
side (like spark-defaults.conf), while self._conf only has the configuration in 
python side.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-level sta...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15360
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66586/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-level sta...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15360
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-level sta...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15360
  
**[Test build #66586 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66586/consoleFull)**
 for PR 15360 at commit 
[`2ee4252`](https://github.com/apache/spark/commit/2ee4252c785848873fa422ec49b697154b703133).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #9287: SPARK-11326: Split networking in standalone mode

2016-10-08 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/9287
  
+1 for closing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #9287: SPARK-11326: Split networking in standalone mode

2016-10-08 Thread tnachen

Github user tnachen commented on the issue:

https://github.com/apache/spark/pull/9287
  
This has been stale for a while, we should close this if there is no update 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12933: [Spark-15155][Mesos] Optionally ignore default role reso...

2016-10-08 Thread tnachen

Github user tnachen commented on the issue:

https://github.com/apache/spark/pull/12933
  
@hellertime Are you able to rebase?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13077: [SPARK-10748] [Mesos] Log error instead of crashing Spar...

2016-10-08 Thread tnachen

Github user tnachen commented on the issue:

https://github.com/apache/spark/pull/13077
  
@devaraj-kavali Are you still able to update this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13713: [SPARK-15994] [MESOS] Allow enabling Mesos fetch cache i...

2016-10-08 Thread tnachen

Github user tnachen commented on the issue:

https://github.com/apache/spark/pull/13713
  
@drcrallen Are you still planning to update this? It's quite a useful 
feature, so hoping this can get in. Also since Fine grain mode is depcreated I 
don't think we need to update it too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15334: [SPARK-10367][SQL][WIP] Support Parquet logical type INT...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15334
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15334: [SPARK-10367][SQL][WIP] Support Parquet logical type INT...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15334
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66585/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15334: [SPARK-10367][SQL][WIP] Support Parquet logical type INT...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15334
  
**[Test build #66585 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66585/consoleFull)**
 for PR 15334 at commit 
[`72bc930`](https://github.com/apache/spark/commit/72bc93033b47266fd9661d72dcadbbd8ba906b4b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12691: [Spark-14761][SQL][WIP] Reject invalid join methods when...

2016-10-08 Thread bkpathak

Github user bkpathak commented on the issue:

https://github.com/apache/spark/pull/12691
  
Hi @holdenk I am still interested in working on this but  it looks like I 
pull and merged with a master branch instead of rebasing it. Can I close it and 
open another pull request. Or how should I proceed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15044: [SQL][SPARK-17490] Optimize SerializeFromObject() for a ...

2016-10-08 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/15044
  
@hvanhovell, after my investigations, I have added code to generate 
`UnsafeArrayData' at two code paths. Could you please review this again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12691: [Spark-14761][SQL][WIP] Reject invalid join metho...

2016-10-08 Thread bkpathak

GitHub user bkpathak reopened a pull request:

https://github.com/apache/spark/pull/12691

[Spark-14761][SQL][WIP] Reject invalid join methods  when join columns are 
not specified in PySpark DataFrame join.

## What changes were proposed in this pull request?

In PySpark, the invalid join type will not throw error for the following 
join:
```df1.join(df2, how='not-a-valid-join-type')```

The signature of the join is:
```def join(self, other, on=None, how=None):```
The existing code completely ignores the `how` parameter when `on` is 
`None`. This patch will process the arguments passed to join and pass in to JVM 
Spark SQL Analyzer, which will validate the join type passed.

## How was this patch tested?
Used manual and existing test suites.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bkpathak/spark spark-14761

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12691.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12691


commit c76baff0cc4775c2191d075cc9a8176e4915fec8
Author: Bryan Cutler 
Date:   2016-09-11T09:19:39Z

[SPARK-17336][PYSPARK] Fix appending multiple times to PYTHONPATH from 
spark-config.sh

## What changes were proposed in this pull request?
During startup of Spark standalone, the script file spark-config.sh appends 
to the PYTHONPATH and can be sourced many times, causing duplicates in the 
path.  This change adds a env flag that is set when the PYTHONPATH is appended 
so it will happen only one time.

## How was this patch tested?
Manually started standalone master/worker and verified PYTHONPATH has no 
duplicate entries.

Author: Bryan Cutler 

Closes #15028 from BryanCutler/fix-duplicate-pythonpath-SPARK-17336.

commit 883c7631847a95684534222c1b6cfed8e62710c8
Author: Yanbo Liang 
Date:   2016-09-11T12:47:13Z

[SPARK-17389][FOLLOW-UP][ML] Change KMeans k-means|| default init steps 
from 5 to 2.

## What changes were proposed in this pull request?
#14956 reduced default k-means|| init steps to 2 from 5 only for 
spark.mllib package, we should also do same change for spark.ml and PySpark.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #15050 from yanboliang/spark-17389.

commit 767d48076971f6f1e2c93ee540a9b2e5e465631b
Author: Sameer Agarwal 
Date:   2016-09-11T15:35:27Z

[SPARK-17415][SQL] Better error message for driver-side broadcast join OOMs

## What changes were proposed in this pull request?

This is a trivial patch that catches all `OutOfMemoryError` while building 
the broadcast hash relation and rethrows it by wrapping it in a nice error 
message.

## How was this patch tested?

Existing Tests

Author: Sameer Agarwal 

Closes #14979 from sameeragarwal/broadcast-join-error.

commit 72eec70bdbf6fb67c977463db5d8d95dd3040ae8
Author: Josh Rosen 
Date:   2016-09-12T04:51:22Z

[SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses field

The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never 
read, increasing the memory consumption of the web UI. We should remove this 
field.

Author: Josh Rosen 

Closes #15038 from 
JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData.

commit cc87280fcd065b01667ca7a59a1a32c7ab757355
Author: cenyuhai 
Date:   2016-09-12T10:52:56Z

[SPARK-17171][WEB UI] DAG will list all partitions in the graph

## What changes were proposed in this pull request?
DAG will list all partitions in the graph, it is too slow and hard to see 
all graph.
Always we don't want to see all partitionsï¼we just want to see the 
relations of DAG graph.
So I just show 2 root nodes for Rdds.

Before this PR, the DAG graph looks like 
[dag1.png](https://issues.apache.org/jira/secure/attachment/12824702/dag1.png), 
[dag3.png](https://issues.apache.org/jira/secure/attachment/12825456/dag3.png), 
after this PR, the DAG graph looks like 
[dag2.png](https://issues.apache.org/jira/secure/attachment/12824703/dag2.png),[dag4.png](https://issues.apache.org/jira/secure/attachment/12825457/dag4.png)

Author: cenyuhai 
Author: å²çæµ· <261810...@qq.com>

Closes #14737 from cenyuhai/SPARK-17171.

commit 4efcdb7feae24e41d8120b59430f8b77cc2106a6
Author: codlife <1004910...@qq.com>
Date:   2016-09-12T11:10:46Z

[SPARK-17447] Performance improvement in Partitioner.defaultPartitioner 
without sortBy

## What changes were proposed in this pull request?

if there are many rdds in some situations,the sort will loss he performance 
servely,actually we needn't sort the rdds , we can just scan the rdds one time 
to gai

[GitHub] spark pull request #12691: [Spark-14761][SQL][WIP] Reject invalid join metho...

2016-10-08 Thread bkpathak

Github user bkpathak closed the pull request at:

https://github.com/apache/spark/pull/12691


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12691: [Spark-14761][SQL][WIP] Reject invalid join metho...

2016-10-08 Thread bkpathak

Github user bkpathak closed the pull request at:

https://github.com/apache/spark/pull/12691


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12691: [Spark-14761][SQL][WIP] Reject invalid join metho...

2016-10-08 Thread bkpathak

GitHub user bkpathak reopened a pull request:

https://github.com/apache/spark/pull/12691

[Spark-14761][SQL][WIP] Reject invalid join methods  when join columns are 
not specified in PySpark DataFrame join.

## What changes were proposed in this pull request?

In PySpark, the invalid join type will not throw error for the following 
join:
```df1.join(df2, how='not-a-valid-join-type')```

The signature of the join is:
```def join(self, other, on=None, how=None):```
The existing code completely ignores the `how` parameter when `on` is 
`None`. This patch will process the arguments passed to join and pass in to JVM 
Spark SQL Analyzer, which will validate the join type passed.

## How was this patch tested?
Used manual and existing test suites.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bkpathak/spark spark-14761

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12691.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12691


commit c76baff0cc4775c2191d075cc9a8176e4915fec8
Author: Bryan Cutler 
Date:   2016-09-11T09:19:39Z

[SPARK-17336][PYSPARK] Fix appending multiple times to PYTHONPATH from 
spark-config.sh

## What changes were proposed in this pull request?
During startup of Spark standalone, the script file spark-config.sh appends 
to the PYTHONPATH and can be sourced many times, causing duplicates in the 
path.  This change adds a env flag that is set when the PYTHONPATH is appended 
so it will happen only one time.

## How was this patch tested?
Manually started standalone master/worker and verified PYTHONPATH has no 
duplicate entries.

Author: Bryan Cutler 

Closes #15028 from BryanCutler/fix-duplicate-pythonpath-SPARK-17336.

commit 883c7631847a95684534222c1b6cfed8e62710c8
Author: Yanbo Liang 
Date:   2016-09-11T12:47:13Z

[SPARK-17389][FOLLOW-UP][ML] Change KMeans k-means|| default init steps 
from 5 to 2.

## What changes were proposed in this pull request?
#14956 reduced default k-means|| init steps to 2 from 5 only for 
spark.mllib package, we should also do same change for spark.ml and PySpark.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #15050 from yanboliang/spark-17389.

commit 767d48076971f6f1e2c93ee540a9b2e5e465631b
Author: Sameer Agarwal 
Date:   2016-09-11T15:35:27Z

[SPARK-17415][SQL] Better error message for driver-side broadcast join OOMs

## What changes were proposed in this pull request?

This is a trivial patch that catches all `OutOfMemoryError` while building 
the broadcast hash relation and rethrows it by wrapping it in a nice error 
message.

## How was this patch tested?

Existing Tests

Author: Sameer Agarwal 

Closes #14979 from sameeragarwal/broadcast-join-error.

commit 72eec70bdbf6fb67c977463db5d8d95dd3040ae8
Author: Josh Rosen 
Date:   2016-09-12T04:51:22Z

[SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses field

The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never 
read, increasing the memory consumption of the web UI. We should remove this 
field.

Author: Josh Rosen 

Closes #15038 from 
JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData.

commit cc87280fcd065b01667ca7a59a1a32c7ab757355
Author: cenyuhai 
Date:   2016-09-12T10:52:56Z

[SPARK-17171][WEB UI] DAG will list all partitions in the graph

## What changes were proposed in this pull request?
DAG will list all partitions in the graph, it is too slow and hard to see 
all graph.
Always we don't want to see all partitionsï¼we just want to see the 
relations of DAG graph.
So I just show 2 root nodes for Rdds.

Before this PR, the DAG graph looks like 
[dag1.png](https://issues.apache.org/jira/secure/attachment/12824702/dag1.png), 
[dag3.png](https://issues.apache.org/jira/secure/attachment/12825456/dag3.png), 
after this PR, the DAG graph looks like 
[dag2.png](https://issues.apache.org/jira/secure/attachment/12824703/dag2.png),[dag4.png](https://issues.apache.org/jira/secure/attachment/12825457/dag4.png)

Author: cenyuhai 
Author: å²çæµ· <261810...@qq.com>

Closes #14737 from cenyuhai/SPARK-17171.

commit 4efcdb7feae24e41d8120b59430f8b77cc2106a6
Author: codlife <1004910...@qq.com>
Date:   2016-09-12T11:10:46Z

[SPARK-17447] Performance improvement in Partitioner.defaultPartitioner 
without sortBy

## What changes were proposed in this pull request?

if there are many rdds in some situations,the sort will loss he performance 
servely,actually we needn't sort the rdds , we can just scan the rdds one time 
to gai

[GitHub] spark issue #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-level sta...

2016-10-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15360
  
**[Test build #66586 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66586/consoleFull)**
 for PR 15360 at commit 
[`2ee4252`](https://github.com/apache/spark/commit/2ee4252c785848873fa422ec49b697154b703133).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-level sta...

2016-10-08 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/15360
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15371: [SPARK-17816] [Core] Fix ConcurrentModificationException...

2016-10-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15371
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 235 matches

Mail list logo