[GitHub] spark issue #15413: [SPARK-17847] [ML] Copy GaussianMixture implementation f...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15413
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66625/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15413: [SPARK-17847] [ML] Copy GaussianMixture implementation f...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15413
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15413: [SPARK-17847] [ML] Copy GaussianMixture implementation f...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15413
  
**[Test build #66625 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66625/consoleFull)**
 for PR 15413 at commit 
[`5a8de4a`](https://github.com/apache/spark/commit/5a8de4a7289700d20e240dcf82b61552c213dcf8).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15414: [SPARK-17848][ML] Move LabelCol datatype cast into Predi...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15414
  
**[Test build #66629 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66629/consoleFull)**
 for PR 15414 at commit 
[`5cb06fc`](https://github.com/apache/spark/commit/5cb06fcd7987d1889b42a47f38bff89a47161123).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-10-09 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/15414

[SPARK-17848][ML] Move LabelCol datatype cast into Predictor.fit

## What changes were proposed in this pull request?
1, move cast to `Predictor`
2, and then, remove unnecessary cast

## How was this patch tested?
existing tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark move_cast

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15414.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15414


commit 5cb06fcd7987d1889b42a47f38bff89a47161123
Author: Zheng RuiFeng 
Date:   2016-10-10T05:44:47Z

create pr




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15292
  
**[Test build #66628 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66628/consoleFull)**
 for PR 15292 at commit 
[`350f55d`](https://github.com/apache/spark/commit/350f55da303c6ccc876a4f6d5a1e455dd3337343).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...

2016-10-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15292
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15316: [SPARK-17751] [SQL] Remove spark.sql.eagerAnalysi...

2016-10-09 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15316#discussion_r82545437
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala ---
@@ -43,6 +43,11 @@ class AnalysisException protected[sql] (
   }
 
   override def getMessage: String = {
+val planAnnotation = plan.map(p => s";\n$p").getOrElse("")
--- End diff --

Why do we need a separate method here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15371: [SPARK-17816] [Core] Fix ConcurrentModificationException...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15371
  
**[Test build #66627 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66627/consoleFull)**
 for PR 15371 at commit 
[`da2311a`](https://github.com/apache/spark/commit/da2311a5a0cee356169bed1a940bcb5bb0c87b26).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15371: [SPARK-17816] [Core] Fix ConcurrentModificationException...

2016-10-09 Thread seyfe
Github user seyfe commented on the issue:

https://github.com/apache/spark/pull/15371
  
@zsxwing , I think that is a good idea. I search it and that is the only 
place we use `BlockStatusesAccumulator`. Let me remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14136: [SPARK-16282][SQL] Implement percentile SQL function.

2016-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14136
  
We definitely need this as a native implementation. One thing we should 
think about is memory management. collect_list, collect_set, and percentile are 
examples of functions that are very memory heavy and can easily OOM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14788: [SPARK-17174][SQL] Add the support for TimestampT...

2016-10-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14788#discussion_r82544981
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2548,16 +2548,20 @@ object functions {
   def to_date(e: Column): Column = withExpr { ToDate(e.expr) }
 
   /**
-   * Returns date truncated to the unit specified by the format.
+   * Returns timestamp truncated to the unit specified by the format.
--- End diff --

doesn't this actually change the data type returned?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14788: [SPARK-17174][SQL] Add the support for TimestampT...

2016-10-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14788#discussion_r82544965
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2374,14 +2374,14 @@ object functions {
* @group datetime_funcs
* @since 1.5.0
*/
-  def date_add(start: Column, days: Int): Column = withExpr { 
DateAdd(start.expr, Literal(days)) }
+  def date_add(start: Column, days: Int): Column = withExpr { 
AddDays(start.expr, Literal(days)) }
--- End diff --

why change the name of these expressions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14788
  
Actually can we avoid renaming these expressions? I don't see the point to 
rename DateSub to SubDays. It just makes it more annoying to link the user 
facing API with the internal expressions.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15292
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66622/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15292
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15292
  
**[Test build #66622 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66622/consoleFull)**
 for PR 15292 at commit 
[`350f55d`](https://github.com/apache/spark/commit/350f55da303c6ccc876a4f6d5a1e455dd3337343).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14897: [SPARK-17338][SQL] add global temp view

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14897
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14788
  
**[Test build #66626 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66626/consoleFull)**
 for PR 14788 at commit 
[`8c50b2c`](https://github.com/apache/spark/commit/8c50b2cecc8c69bac19206969bf33133779c6337).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14897: [SPARK-17338][SQL] add global temp view

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14897
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66620/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14897: [SPARK-17338][SQL] add global temp view

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14897
  
**[Test build #66620 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66620/consoleFull)**
 for PR 14897 at commit 
[`29e292a`](https://github.com/apache/spark/commit/29e292a954f1b07d80d03d0fd6c4ad4605b41ab7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15371: [SPARK-17816] [Core] Fix ConcurrentModificationException...

2016-10-09 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15371
  
@seyfe I think we can remove `BlockStatusesAccumulator` and just use 
`private val _updatedBlockStatuses = new CollectionAccumulator[(BlockId, 
BlockStatus)]` instead. `BlockStatusesAccumulator` doesn't provide more 
functions than `CollectionAccumulator`. Sorry that I didn't find that early.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-09 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14788
  
LGTM - I'll merge as soon as tests complete successfully



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-09 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14788
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15371: [SPARK-17816] [Core] Fix ConcurrentModificationEx...

2016-10-09 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/15371#discussion_r82544528
  
--- Diff: core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala ---
@@ -444,7 +444,9 @@ class CollectionAccumulator[T] extends AccumulatorV2[T, 
java.util.List[T]] {
 
   override def copy(): CollectionAccumulator[T] = {
 val newAcc = new CollectionAccumulator[T]
-newAcc._list.addAll(_list)
+_list.synchronized {
--- End diff --

Good catch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-10-09 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/15314
  
@sethah OK, I will open a new JIRA about labelCol. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15413: [SPARK-17847] [ML] Copy GaussianMixture implementation f...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15413
  
**[Test build #66625 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66625/consoleFull)**
 for PR 15413 at commit 
[`5a8de4a`](https://github.com/apache/spark/commit/5a8de4a7289700d20e240dcf82b61552c213dcf8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15413: [SPARK-17847] [ML] Copy GaussianMixture implement...

2016-10-09 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/15413

[SPARK-17847] [ML] Copy GaussianMixture implementation from mllib to ml

## What changes were proposed in this pull request?
Copy ```GaussianMixture``` implementation from mllib to ml, then we can add 
new features to it.
I left mllib ```GaussianMixture``` untouched, unlike some other algorithms 
to wrap the ml implementation. For the following reasons:
* mllib ```GaussianMixture``` allow k == 1, but ml does not.
* mllib ```GaussianMixture``` supports setting initial model, but ml does 
not support currently. (We will definitely add this feature for ml in the 
future)

Meanwhile, we did some improvements to handle sparse data more efficiently. 
I use ```ml.linalg``` as the underlying data structure rather than the old 
breeze dense vector.

Todo:
 - [ ] Performance test.

## How was this patch tested?
Existing tests and added new tests.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-17847

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15413.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15413


commit 5a8de4a7289700d20e240dcf82b61552c213dcf8
Author: Yanbo Liang 
Date:   2016-10-10T05:00:53Z

Copy GaussianMixture implementation from mllib to ml




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15346: [SPARK-17741][SQL] Grammar to parse top level and...

2016-10-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15346


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15346: [SPARK-17741][SQL] Grammar to parse top level and nested...

2016-10-09 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/15346
  
LGTM - merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning...

2016-10-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15389#discussion_r82544055
  
--- Diff: python/pyspark/rdd.py ---
@@ -2029,7 +2028,11 @@ def coalesce(self, numPartitions, shuffle=False):
 >>> sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(1).glom().collect()
 [[1, 2, 3, 4, 5]]
 """
-jrdd = self._jrdd.coalesce(numPartitions, shuffle)
+if shuffle:
+data_java_rdd = 
self._to_java_object_rdd().coalesce(numPartitions, shuffle)
--- End diff --

would be great to add some inline comment explaining why this is necessary. 
otherwise somebody can just come in 6 month from now and change this back to 
`jrdd = self._jrdd.coalesce(numPartitions, shuffle)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15403: [SPARK-17832][SQL] TableIdentifier.quotedString c...

2016-10-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15403


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15403: [SPARK-17832][SQL] TableIdentifier.quotedString creates ...

2016-10-09 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/15403
  
LGTM - merging to master/2.0. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12775
  
**[Test build #66624 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66624/consoleFull)**
 for PR 12775 at commit 
[`699730b`](https://github.com/apache/spark/commit/699730b592e8d913e728e0097e140c710c201dce).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-10-09 Thread lirui-intel
Github user lirui-intel commented on the issue:

https://github.com/apache/spark/pull/12775
  
Thanks for the review. Updated the patch to address the comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14426
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14426
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66619/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14426
  
**[Test build #66619 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66619/consoleFull)**
 for PR 14426 at commit 
[`57adfd3`](https://github.com/apache/spark/commit/57adfd33b84bee03c9f0a302d9981f226437c2e3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class Hint(name: String, parameters: Seq[String], child: 
LogicalPlan) extends UnaryNode `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15397: [SPARK-17834][SQL]Fetch the earliest offsets manually in...

2016-10-09 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15397
  
> How is this going to work with assign? It seems like it's just avoiding 
the problem, not fixing it.

We can seek to the offsets provided by the user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15387: [SPARK-17782][STREAMING][KAFKA] eliminate race condition...

2016-10-09 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15387
  
> During the original implementation I had verified that calling pause 
kills the internal message buffer, which is one of the complications leading to 
a cached consumer per partition.

I observed the same behavior during my debug. I found that the first 
`poll(0)` will always send a request to prefetch the data. Pausing partitions 
just prevents the second `poll(0)` from returning anything at here: 
https://github.com/apache/kafka/blob/0.10.0.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L527

> You dont want poll consuming messages, its not just about offset
correctness, the driver shouldnt be spending time or bandwidth doing that.

I think you have agreed that this is impossible via current KafkaConsumer 
APIs as well.

However, the unknown thing to me is that if the first `poll(0)` will return 
something. I saw the first `poll(0)` will always send the fetching request, but 
I'm not sure that if it's possible that the response will be processed in the 
first `poll(0)`. If this could happen, pausing partitions will not help in such 
case since it's called after the first `poll(0)`. In addition, since it's 
unclear in javadoc, it could be changed in the future. That's why I decided to 
manually seek to the beginning in #15397.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15376: [SPARK-17796][SQL] Support wildcard character in filenam...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15376
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66618/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15376: [SPARK-17796][SQL] Support wildcard character in filenam...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15376
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15376: [SPARK-17796][SQL] Support wildcard character in filenam...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15376
  
**[Test build #66618 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66618/consoleFull)**
 for PR 15376 at commit 
[`f328f3a`](https://github.com/apache/spark/commit/f328f3a2c0936555226a7c381625d3d5b8127302).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15412
  
**[Test build #66623 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66623/consoleFull)**
 for PR 15412 at commit 
[`e141868`](https://github.com/apache/spark/commit/e14186836a6aecbc58839edffa57213869271b91).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15412
  
Sure I can fix those in this pull request too. Thanks for the reminder.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15412
  
Hi @rxin , I just happened to look at this PR. I just want to leave a 
gentle reminder just in case, that there are 
[SPARK-17656](https://issues.apache.org/jira/browse/SPARK-17656) and two more 
cases in 
`./sql/core/src/main/scala/org/apache/spark/sql/expressions/udaf.scala` (Maybe 
this is not directly relevant with this PR but just when I saw the changes 
here, it rang a bell to me and I just wanted to let you know just in case). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15399: [SPARK-17819][SQL] Support default database in connectio...

2016-10-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15399
  
First, I am not familiar with the code in this component. Thus, I am not 
the right person to review it.

Second, when I going over the pending JIRA list, I found many bugs that are 
reported in Thrift Server. 

Third, I do not know what is the current strategy for supporting `beeline` 
and `spark-sql`. If this is the focus, I expect at least 50+ bug fixes in this 
area


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15396: [SPARK-14804][Spark][Graphx] Fix checkpointing of Vertex...

2016-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15396
  
Can you reference the earlier pull requests here too?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-10-09 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/15314
  
Maybe we can solve the label column issue first? Would you mind opening a 
new Jira/PR? I'm happy to hear other opinions as well :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15292
  
**[Test build #66622 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66622/consoleFull)**
 for PR 15292 at commit 
[`350f55d`](https://github.com/apache/spark/commit/350f55da303c6ccc876a4f6d5a1e455dd3337343).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15408: [SPARK-17839][CORE] UnsafeSorterSpillReader should use N...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15408
  
**[Test build #66621 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66621/consoleFull)**
 for PR 15408 at commit 
[`856593a`](https://github.com/apache/spark/commit/856593ac4d54c4981f79b7a4b09c94cc66b5c63b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15408: [SPARK-17839][CORE] UnsafeSorterSpillReader should use N...

2016-10-09 Thread sitalkedia
Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/15408
  
>> Can you also expand on what the 7% means? Is it some workload end-to-end 
that's been improved by 7%, or the sorting itself improves by 7%?

The perf improvement was end-to-end which means the sorting improvement is 
definitely more than that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15408: [SPARK-17839][CORE] UnsafeSorterSpillReader shoul...

2016-10-09 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/15408#discussion_r82541750
  
--- Diff: 
core/src/main/java/org/apache/spark/io/NioBasedBufferedFileInputStream.java ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.io;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.StandardOpenOption;
+
+/**
+ * {@link InputStream} implementation which uses direct buffer
+ * to read a file to avoid extra copy of data between Java and
+ * native memory which happens when using {@link 
java.io.BufferedInputStream}.
+ * Unfortunately, this is not something already available in JDK,
+ * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio,
+ * but does not support buffering.
+ *
+ */
+public final class NioBasedBufferedFileInputStream extends InputStream {
--- End diff --

Added a test suite for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15408: [SPARK-17839][CORE] UnsafeSorterSpillReader shoul...

2016-10-09 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/15408#discussion_r82541747
  
--- Diff: 
core/src/main/java/org/apache/spark/io/NioBasedBufferedFileInputStream.java ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.io;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.StandardOpenOption;
+
+/**
+ * {@link InputStream} implementation which uses direct buffer
+ * to read a file to avoid extra copy of data between Java and
+ * native memory which happens when using {@link 
java.io.BufferedInputStream}.
+ * Unfortunately, this is not something already available in JDK,
+ * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio,
+ * but does not support buffering.
+ *
+ */
+public final class NioBasedBufferedFileInputStream extends InputStream {
+
+  private static int DEFAULT_BUFFER_SIZE = 8192;
+
+  private final ByteBuffer bb;
+
+  private final FileChannel ch;
+
+  public NioBasedBufferedFileInputStream(File file, int bufferSize) throws 
IOException {
+bb = ByteBuffer.allocateDirect(bufferSize);
+ch = FileChannel.open(file.toPath(), StandardOpenOption.READ);
+ch.read(bb);
+bb.flip();
--- End diff --

removed, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-10-09 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/15314
  
@sethah ok, I will revert this PR to only focus on: 1, add test for 
WeightCol in MLTestingUtils.checkNumericTypes; 2, add cast for WeightCol in 
each algo; 3, add cast in `getNumClasses` to avoid test failure.
What about this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...

2016-10-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15292#discussion_r82541651
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
 ---
@@ -17,47 +17,132 @@
 
 package org.apache.spark.sql.execution.datasources.jdbc
 
+import java.sql.{Connection, DriverManager}
+import java.util.Properties
+
 /**
  * Options for the JDBC data source.
  */
 class JDBCOptions(
 @transient private val parameters: Map[String, String])
   extends Serializable {
 
+  import JDBCOptions._
+
+  def this(url: String, table: String, parameters: Map[String, String]) = {
+this(parameters ++ Map(
+  JDBCOptions.JDBC_URL -> url,
+  JDBCOptions.JDBC_TABLE_NAME -> table))
+  }
+
+  val asConnectionProperties: Properties = {
+val properties = new Properties()
+// We should avoid to pass the options into properties. See 
SPARK-17776.
+parameters.filterKeys(!jdbcOptionNames.contains(_))
+  .foreach { case (k, v) => properties.setProperty(k, v) }
+properties
+  }
+
   // 
   // Required parameters
   // 
-  require(parameters.isDefinedAt("url"), "Option 'url' is required.")
-  require(parameters.isDefinedAt("dbtable"), "Option 'dbtable' is 
required.")
+  require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is 
required.")
+  require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option 
'$JDBC_TABLE_NAME' is required.")
   // a JDBC URL
-  val url = parameters("url")
+  val url = parameters(JDBC_URL)
   // name of table
-  val table = parameters("dbtable")
+  val table = parameters(JDBC_TABLE_NAME)
+
+  // 
+  // Optional parameters
+  // 
+  val driverClass = {
+val userSpecifiedDriverClass = parameters.get(JDBC_DRIVER_CLASS)
+userSpecifiedDriverClass.foreach(DriverRegistry.register)
+
+// Performing this part of the logic on the driver guards against the 
corner-case where the
+// driver returned for a URL is different on the driver and executors 
due to classpath
+// differences.
+userSpecifiedDriverClass.getOrElse {
+  DriverManager.getDriver(url).getClass.getCanonicalName
+}
+  }
 
   // 
-  // Optional parameter list
+  // Optional parameters only for reading
   // 
   // the column used to partition
-  val partitionColumn = parameters.getOrElse("partitionColumn", null)
+  val partitionColumn = parameters.getOrElse(JDBC_PARTITION_COLUMN, null)
   // the lower bound of partition column
-  val lowerBound = parameters.getOrElse("lowerBound", null)
+  val lowerBound = parameters.getOrElse(JDBC_LOWER_BOUND, null)
   // the upper bound of the partition column
-  val upperBound = parameters.getOrElse("upperBound", null)
+  val upperBound = parameters.getOrElse(JDBC_UPPER_BOUND, null)
   // the number of partitions
-  val numPartitions = parameters.getOrElse("numPartitions", null)
-
+  val numPartitions = parameters.getOrElse(JDBC_NUM_PARTITIONS, null)
   require(partitionColumn == null ||
 (lowerBound != null && upperBound != null && numPartitions != null),
-"If 'partitionColumn' is specified then 'lowerBound', 'upperBound'," +
-  " and 'numPartitions' are required.")
+s"If '$JDBC_PARTITION_COLUMN' is specified then '$JDBC_LOWER_BOUND', 
'$JDBC_UPPER_BOUND'," +
+  s" and '$JDBC_NUM_PARTITIONS' are required.")
+  val fetchSize = {
+val size = parameters.getOrElse(JDBC_BATCH_FETCH_SIZE, "0").toInt
+require(size >= 0,
+  s"Invalid value `${size.toString}` for parameter " +
+s"`$JDBC_BATCH_FETCH_SIZE`. The minimum value is 0. When the value 
is 0, " +
+"the JDBC driver ignores the value and does the estimates.")
+size
+  }
 
   // 
-  // The options for DataFrameWriter
+  // Optional parameters only for writing
   // 
   // if to truncate the table from the JDBC database
-  val isTruncate = parameters.getOrElse("truncate", "false").toBoolean
+  val isTruncate = parameters.getOrElse(JDBC_TRUNCATE, "false").toBoolean
   // the create table option , which can be table_options or 
partition_options.
   // E.g., "CREATE TABLE t (name string) 

[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15412
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15412
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66616/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15412
  
**[Test build #66616 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66616/consoleFull)**
 for PR 15412 at commit 
[`4d02864`](https://github.com/apache/spark/commit/4d02864d2b023bec501578de86b68478feae05c6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-10-09 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/15314
  
I strongly prefer to move the issue with the label column into its own 
Jira/PR. They are different changes and I think the label column issues are 
large enough to warrant their own considerations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15412
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66615/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15412
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15412
  
**[Test build #66615 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66615/consoleFull)**
 for PR 15412 at commit 
[`98b77a7`](https://github.com/apache/spark/commit/98b77a7c660e0064353b1fa98e2e47bc2d971bea).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15388
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15388
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66614/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15388
  
**[Test build #66614 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66614/consoleFull)**
 for PR 15388 at commit 
[`25f5d4d`](https://github.com/apache/spark/commit/25f5d4d068509d93630d56db2155f11cc2a9b301).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15399: [SPARK-17819][SQL] Support default database in connectio...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15399
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66617/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15399: [SPARK-17819][SQL] Support default database in connectio...

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15399
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15399: [SPARK-17819][SQL] Support default database in connectio...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15399
  
**[Test build #66617 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66617/consoleFull)**
 for PR 15399 at commit 
[`d027421`](https://github.com/apache/spark/commit/d027421d0396f971976b18ef2d44ddda97dd5810).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14897: [SPARK-17338][SQL] add global temp view

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14897
  
**[Test build #66620 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66620/consoleFull)**
 for PR 14897 at commit 
[`29e292a`](https://github.com/apache/spark/commit/29e292a954f1b07d80d03d0fd6c4ad4605b41ab7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-09 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82539368
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate, and 
decreasing dimensionality" +
--- End diff --

No. Since we are implementing OR-amplification, increasing dimensionality 
lower the false negative rate.

In AND-amplification, increasing dimensionality will lower the false 
positive rate.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14897: [SPARK-17338][SQL] add global temp view

2016-10-09 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/14897
  
LGTM. Let's make a small change according to  
https://github.com/apache/spark/pull/14897#discussion_r82536096 and we can 
merge this pr. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15399: [SPARK-17819][SQL] Support default database in connectio...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15399
  
**[Test build #66617 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66617/consoleFull)**
 for PR 15399 at commit 
[`d027421`](https://github.com/apache/spark/commit/d027421d0396f971976b18ef2d44ddda97dd5810).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15376: [SPARK-17796][SQL] Support wildcard character in filenam...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15376
  
**[Test build #66618 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66618/consoleFull)**
 for PR 15376 at commit 
[`f328f3a`](https://github.com/apache/spark/commit/f328f3a2c0936555226a7c381625d3d5b8127302).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14426
  
**[Test build #66619 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66619/consoleFull)**
 for PR 14426 at commit 
[`57adfd3`](https://github.com/apache/spark/commit/57adfd33b84bee03c9f0a302d9981f226437c2e3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14527: [SPARK-16938][SQL] `drop/dropDuplicate` should ha...

2016-10-09 Thread dongjoon-hyun
Github user dongjoon-hyun closed the pull request at:

https://github.com/apache/spark/pull/14527


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-10-09 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r82538273
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/GlobalTempViewSuite.scala
 ---
@@ -0,0 +1,168 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.{AnalysisException, QueryTest, Row}
+import org.apache.spark.sql.catalog.Table
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.analysis.NoSuchTableException
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.StructType
+
+class GlobalTempViewSuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  override protected def beforeAll(): Unit = {
+super.beforeAll()
+globalTempDB = spark.sharedState.globalTempViewManager.database
+  }
+
+  private var globalTempDB: String = _
+
+  test("basic semantic") {
+sql("CREATE GLOBAL TEMP VIEW src AS SELECT 1, 'a'")
+
+// If there is no database in table name, we should try local temp 
view first, if not found,
+// try table/view in current database, which is "default" in this 
case. So we expect
+// NoSuchTableException here.
+intercept[NoSuchTableException](spark.table("src"))
+
+// Use qualified name to refer to the global temp view explicitly.
+checkAnswer(spark.table(s"$globalTempDB.src"), Row(1, "a"))
+
+// Table name without database will never refer to a global temp view.
+intercept[NoSuchTableException](sql("DROP VIEW src"))
+
+sql(s"DROP VIEW $globalTempDB.src")
+// The global temp view should be dropped successfully.
+intercept[NoSuchTableException](spark.table(s"$globalTempDB.src"))
+
+// We can also use Dataset API to create global temp view
+Seq(1 -> "a").toDF("i", "j").createGlobalTempView("src")
+checkAnswer(spark.table(s"$globalTempDB.src"), Row(1, "a"))
+
+// Use qualified name to rename a global temp view.
+sql(s"ALTER VIEW $globalTempDB.src RENAME TO src2")
--- End diff --

i see. Thanks for the explanation!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...

2016-10-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15292#discussion_r82538242
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
 ---
@@ -17,47 +17,132 @@
 
 package org.apache.spark.sql.execution.datasources.jdbc
 
+import java.sql.{Connection, DriverManager}
+import java.util.Properties
+
 /**
  * Options for the JDBC data source.
  */
 class JDBCOptions(
 @transient private val parameters: Map[String, String])
   extends Serializable {
 
+  import JDBCOptions._
+
+  def this(url: String, table: String, parameters: Map[String, String]) = {
+this(parameters ++ Map(
+  JDBCOptions.JDBC_URL -> url,
+  JDBCOptions.JDBC_TABLE_NAME -> table))
+  }
+
+  val asConnectionProperties: Properties = {
+val properties = new Properties()
+// We should avoid to pass the options into properties. See 
SPARK-17776.
+parameters.filterKeys(!jdbcOptionNames.contains(_))
+  .foreach { case (k, v) => properties.setProperty(k, v) }
+properties
+  }
+
   // 
   // Required parameters
   // 
-  require(parameters.isDefinedAt("url"), "Option 'url' is required.")
-  require(parameters.isDefinedAt("dbtable"), "Option 'dbtable' is 
required.")
+  require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is 
required.")
+  require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option 
'$JDBC_TABLE_NAME' is required.")
   // a JDBC URL
-  val url = parameters("url")
+  val url = parameters(JDBC_URL)
   // name of table
-  val table = parameters("dbtable")
+  val table = parameters(JDBC_TABLE_NAME)
+
+  // 
+  // Optional parameters
+  // 
+  val driverClass = {
+val userSpecifiedDriverClass = parameters.get(JDBC_DRIVER_CLASS)
+userSpecifiedDriverClass.foreach(DriverRegistry.register)
+
+// Performing this part of the logic on the driver guards against the 
corner-case where the
+// driver returned for a URL is different on the driver and executors 
due to classpath
+// differences.
+userSpecifiedDriverClass.getOrElse {
+  DriverManager.getDriver(url).getClass.getCanonicalName
+}
+  }
 
   // 
-  // Optional parameter list
+  // Optional parameters only for reading
   // 
   // the column used to partition
-  val partitionColumn = parameters.getOrElse("partitionColumn", null)
+  val partitionColumn = parameters.getOrElse(JDBC_PARTITION_COLUMN, null)
   // the lower bound of partition column
-  val lowerBound = parameters.getOrElse("lowerBound", null)
+  val lowerBound = parameters.getOrElse(JDBC_LOWER_BOUND, null)
   // the upper bound of the partition column
-  val upperBound = parameters.getOrElse("upperBound", null)
+  val upperBound = parameters.getOrElse(JDBC_UPPER_BOUND, null)
   // the number of partitions
-  val numPartitions = parameters.getOrElse("numPartitions", null)
-
+  val numPartitions = parameters.getOrElse(JDBC_NUM_PARTITIONS, null)
   require(partitionColumn == null ||
 (lowerBound != null && upperBound != null && numPartitions != null),
-"If 'partitionColumn' is specified then 'lowerBound', 'upperBound'," +
-  " and 'numPartitions' are required.")
+s"If '$JDBC_PARTITION_COLUMN' is specified then '$JDBC_LOWER_BOUND', 
'$JDBC_UPPER_BOUND'," +
+  s" and '$JDBC_NUM_PARTITIONS' are required.")
+  val fetchSize = {
+val size = parameters.getOrElse(JDBC_BATCH_FETCH_SIZE, "0").toInt
+require(size >= 0,
+  s"Invalid value `${size.toString}` for parameter " +
+s"`$JDBC_BATCH_FETCH_SIZE`. The minimum value is 0. When the value 
is 0, " +
+"the JDBC driver ignores the value and does the estimates.")
+size
+  }
 
   // 
-  // The options for DataFrameWriter
+  // Optional parameters only for writing
   // 
   // if to truncate the table from the JDBC database
-  val isTruncate = parameters.getOrElse("truncate", "false").toBoolean
+  val isTruncate = parameters.getOrElse(JDBC_TRUNCATE, "false").toBoolean
   // the create table option , which can be table_options or 
partition_options.
   // E.g., "CREATE TABLE t (name string) 

[GitHub] spark issue #15219: [SPARK-14098][SQL] Generate Java code to build CachedCol...

2016-10-09 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/15219
  
I see.
@davies, would it be possible to review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15408: [SPARK-17839][CORE] UnsafeSorterSpillReader shoul...

2016-10-09 Thread witgo
Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/15408#discussion_r82537791
  
--- Diff: 
core/src/main/java/org/apache/spark/io/NioBasedBufferedFileInputStream.java ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.io;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.StandardOpenOption;
+
+/**
+ * {@link InputStream} implementation which uses direct buffer
+ * to read a file to avoid extra copy of data between Java and
+ * native memory which happens when using {@link 
java.io.BufferedInputStream}.
+ * Unfortunately, this is not something already available in JDK,
+ * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio,
+ * but does not support buffering.
+ *
+ */
+public final class NioBasedBufferedFileInputStream extends InputStream {
+
+  private static int DEFAULT_BUFFER_SIZE = 8192;
+
+  private final ByteBuffer bb;
+
+  private final FileChannel ch;
+
+  public NioBasedBufferedFileInputStream(File file, int bufferSize) throws 
IOException {
+bb = ByteBuffer.allocateDirect(bufferSize);
+ch = FileChannel.open(file.toPath(), StandardOpenOption.READ);
+ch.read(bb);
+bb.flip();
--- End diff --

' ch.read(bb);' can be removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...

2016-10-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15292#discussion_r82537727
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
 ---
@@ -17,47 +17,132 @@
 
 package org.apache.spark.sql.execution.datasources.jdbc
 
+import java.sql.{Connection, DriverManager}
+import java.util.Properties
+
 /**
  * Options for the JDBC data source.
  */
 class JDBCOptions(
 @transient private val parameters: Map[String, String])
   extends Serializable {
 
+  import JDBCOptions._
+
+  def this(url: String, table: String, parameters: Map[String, String]) = {
+this(parameters ++ Map(
+  JDBCOptions.JDBC_URL -> url,
+  JDBCOptions.JDBC_TABLE_NAME -> table))
+  }
+
+  val asConnectionProperties: Properties = {
+val properties = new Properties()
+// We should avoid to pass the options into properties. See 
SPARK-17776.
+parameters.filterKeys(!jdbcOptionNames.contains(_))
+  .foreach { case (k, v) => properties.setProperty(k, v) }
+properties
+  }
+
   // 
   // Required parameters
   // 
-  require(parameters.isDefinedAt("url"), "Option 'url' is required.")
-  require(parameters.isDefinedAt("dbtable"), "Option 'dbtable' is 
required.")
+  require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is 
required.")
+  require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option 
'$JDBC_TABLE_NAME' is required.")
   // a JDBC URL
-  val url = parameters("url")
+  val url = parameters(JDBC_URL)
   // name of table
-  val table = parameters("dbtable")
+  val table = parameters(JDBC_TABLE_NAME)
+
+  // 
+  // Optional parameters
+  // 
+  val driverClass = {
+val userSpecifiedDriverClass = parameters.get(JDBC_DRIVER_CLASS)
+userSpecifiedDriverClass.foreach(DriverRegistry.register)
+
+// Performing this part of the logic on the driver guards against the 
corner-case where the
+// driver returned for a URL is different on the driver and executors 
due to classpath
+// differences.
+userSpecifiedDriverClass.getOrElse {
+  DriverManager.getDriver(url).getClass.getCanonicalName
+}
+  }
 
   // 
-  // Optional parameter list
+  // Optional parameters only for reading
   // 
   // the column used to partition
-  val partitionColumn = parameters.getOrElse("partitionColumn", null)
+  val partitionColumn = parameters.getOrElse(JDBC_PARTITION_COLUMN, null)
   // the lower bound of partition column
-  val lowerBound = parameters.getOrElse("lowerBound", null)
+  val lowerBound = parameters.getOrElse(JDBC_LOWER_BOUND, null)
   // the upper bound of the partition column
-  val upperBound = parameters.getOrElse("upperBound", null)
+  val upperBound = parameters.getOrElse(JDBC_UPPER_BOUND, null)
   // the number of partitions
-  val numPartitions = parameters.getOrElse("numPartitions", null)
-
+  val numPartitions = parameters.getOrElse(JDBC_NUM_PARTITIONS, null)
   require(partitionColumn == null ||
 (lowerBound != null && upperBound != null && numPartitions != null),
-"If 'partitionColumn' is specified then 'lowerBound', 'upperBound'," +
-  " and 'numPartitions' are required.")
+s"If '$JDBC_PARTITION_COLUMN' is specified then '$JDBC_LOWER_BOUND', 
'$JDBC_UPPER_BOUND'," +
+  s" and '$JDBC_NUM_PARTITIONS' are required.")
+  val fetchSize = {
+val size = parameters.getOrElse(JDBC_BATCH_FETCH_SIZE, "0").toInt
+require(size >= 0,
+  s"Invalid value `${size.toString}` for parameter " +
+s"`$JDBC_BATCH_FETCH_SIZE`. The minimum value is 0. When the value 
is 0, " +
+"the JDBC driver ignores the value and does the estimates.")
+size
+  }
 
   // 
-  // The options for DataFrameWriter
+  // Optional parameters only for writing
   // 
   // if to truncate the table from the JDBC database
-  val isTruncate = parameters.getOrElse("truncate", "false").toBoolean
+  val isTruncate = parameters.getOrElse(JDBC_TRUNCATE, "false").toBoolean
   // the create table option , which can be table_options or 
partition_options.
   // E.g., "CREATE TABLE t (name string) ENGINE=InnoDB 

[GitHub] spark pull request #15408: [SPARK-17839][CORE] UnsafeSorterSpillReader shoul...

2016-10-09 Thread witgo
Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/15408#discussion_r82537694
  
--- Diff: 
core/src/main/java/org/apache/spark/io/NioBasedBufferedFileInputStream.java ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.io;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.nio.channels.FileChannel;
+
+/**
+ * {@link InputStream} implementation which uses direct buffer
+ * to read a file to avoid extra copy of data between Java and
+ * native memory which happens when using {@link 
java.io.BufferedInputStream}
+ *
+ */
+public final class NioBasedBufferedFileInputStream extends InputStream {
+
+  ByteBuffer bb;
+
+  FileChannel ch;
+
+  public NioBasedBufferedFileInputStream(File file, int bufferSize) throws 
IOException {
+bb = ByteBuffer.allocateDirect(bufferSize);
+FileInputStream f = new FileInputStream(file);
+ch = f.getChannel();
+ch.read(bb);
+bb.flip();
+  }
+
+  public boolean refill() throws IOException {
+if (!bb.hasRemaining()) {
+  bb.clear();
+  int nRead = ch.read(bb);
+  if (nRead == -1) {
+return false;
+  }
+  bb.flip();
+}
+return true;
+  }
+
+  @Override
+  public int read() throws IOException {
+if (!refill()) {
+  return -1;
+}
+return bb.get();
+  }
+
+  @Override
+  public int read(byte[] b, int off, int len) throws IOException {
+if (!refill()) {
+  return -1;
+}
+len = Math.min(len, bb.remaining());
+bb.get(b, off, len);
+return len;
+  }
+
+  @Override
+  public void close() throws IOException {
+ch.close();
+  }
+}
--- End diff --

`skip()` in InputStream will call ` read()`,  this is not the optimal 
solution


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15412
  
**[Test build #66616 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66616/consoleFull)**
 for PR 15412 at commit 
[`4d02864`](https://github.com/apache/spark/commit/4d02864d2b023bec501578de86b68478feae05c6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15409: [Spark-14761][SQL] Reject invalid join methods when join...

2016-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15409
  
Oh well the test cases have issues. You can run those by `python/run-tests 
--module pyspark-sql`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15406: [Spark-17745][ml][PySpark] update NB python api

2016-10-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15406
  
@sethah OK. and I'm checking whether there is something else need to 
update...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15411: Updated master url

2016-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15411
  
LGTM - can you clean up the pr description to remove the messages from the 
template?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15412
  
**[Test build #66615 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66615/consoleFull)**
 for PR 15412 at commit 
[`98b77a7`](https://github.com/apache/spark/commit/98b77a7c660e0064353b1fa98e2e47bc2d971bea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15412: [SPARK-17844] Simplify DataFrame API for defining...

2016-10-09 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/15412

[SPARK-17844] Simplify DataFrame API for defining frame boundaries in 
window functions

## What changes were proposed in this pull request?
When I was creating the example code for SPARK-10496, I realized it was 
pretty convoluted to define the frame boundaries for window functions when 
there is no partition column or ordering column. The reason is that we don't 
provide a way to create a WindowSpec directly with the frame boundaries. We can 
trivially improve this by adding rowsBetween and rangeBetween to Window object.

As an example, to compute cumulative sum, before this pr:
```
df.select('key, 
sum("value").over(Window.partitionBy(lit(1)).rowsBetween(Long.MinValue, 0)))
```

After this pr:
```
df.select('key, sum("value").over(Window.rowsBetween(Long.MinValue, 0)))
```

## How was this patch tested?
Added test cases to compute cumulative sum in DataFrameWindowSuite for 
Scala/Java and tests.py for Python.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-17844

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15412.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15412


commit 98b77a7c660e0064353b1fa98e2e47bc2d971bea
Author: Reynold Xin 
Date:   2016-10-10T01:15:15Z

[SPARK-17844] Simplify DataFrame API for defining frame boundaries in 
window functions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15411: Updated master url

2016-10-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15411
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15411: Updated master url

2016-10-09 Thread getintouchapp
GitHub user getintouchapp opened a pull request:

https://github.com/apache/spark/pull/15411

Updated master url

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
This is the Spark Scala example which was missing setting a master URL in 
Spark Session

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)

Unit tested. Changes affect examples and documentation only

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)


Need to set master url to SparkSession for the example to run

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/getintouchapp/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15411.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15411


commit 532403476678a8161d18a30ef12b21bffb4d5f92
Author: Ganesh Krishnan 
Date:   2016-10-10T01:13:40Z

Updated master url

Need to set master url to SparkSession for the example to run




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15409: [Spark-14761][SQL] Reject invalid join methods when join...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15409
  
**[Test build #3302 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3302/consoleFull)**
 for PR 15409 at commit 
[`cec8ec4`](https://github.com/apache/spark/commit/cec8ec48de5f51f40ff4b929da0c0496fcc0a662).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-09 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82536925
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate, and 
decreasing dimensionality" +
--- End diff --

Does increasing dimensionality lower the false negative rate?
I think increasing dimensionality should lower false positive rate, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15388
  
**[Test build #66614 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66614/consoleFull)**
 for PR 15388 at commit 
[`25f5d4d`](https://github.com/apache/spark/commit/25f5d4d068509d93630d56db2155f11cc2a9b301).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15409: [Spark-14761][SQL] Reject invalid join methods when join...

2016-10-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15409
  
**[Test build #3302 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3302/consoleFull)**
 for PR 15409 at commit 
[`cec8ec4`](https://github.com/apache/spark/commit/cec8ec48de5f51f40ff4b929da0c0496fcc0a662).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15409: [Spark-14761][SQL] Reject invalid join methods when join...

2016-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15409
  
The change itself LGTM, but also cc @srinathshankar.

Right now Python behavior differs from Scala with respect to how crossJoin 
is handled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15409: [Spark-14761][SQL] Reject invalid join methods wh...

2016-10-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15409#discussion_r82536538
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -1508,6 +1508,23 @@ def test_toDF_with_schema_string(self):
 self.assertEqual(df.schema.simpleString(), "struct")
 self.assertEqual(df.collect(), [Row(key=i) for i in range(100)])
 
+# Regression test for invalid join methods when on is None, Spark-14761
+def test_invalid_join_method(self):
+df1 = self.sqlCtx.createDataFrame([("Alice", 5), ("Bob", 8)], 
["name", "age"])
+df2 = self.sqlCtx.createDataFrame([("Alice", 80), ("Bob", 90)], 
["name", "height"])
+self.assertRaises(AnalysisException, lambda: df1.join(df2, 
how="invalid-join-type"))
+
+result = df1.join(df2, how="inner").select(df1.name, 
df2.height).collect()
--- End diff --

can we remove everything from this test onward? they are no longer testing 
invalid join methods.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...

2016-10-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15388
  
@rxin Agree. Sorry for that. Will be more careful in the future. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...

2016-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15388
  
@cloud-fan / @gatorsmile 

I left some comments on improving clarity. It's pretty important to the 
maintenance of the project.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-10-09 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r82536096
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala ---
@@ -94,6 +69,47 @@ private[sql] class SharedState(val sparkContext: 
SparkContext) extends Logging {
   }
 
   /**
+   * Class for caching query results reused in future executions.
+   */
+  val cacheManager: CacheManager = new CacheManager
+
+  /**
+   * A listener for SQL-specific 
[[org.apache.spark.scheduler.SparkListenerEvent]]s.
+   */
+  val listener: SQLListener = createListenerAndUI(sparkContext)
+
+  /**
+   * A catalog that interacts with external systems.
+   */
+  val externalCatalog: ExternalCatalog =
+SharedState.reflect[ExternalCatalog, SparkConf, Configuration](
+  SharedState.externalCatalogClassName(sparkContext.conf),
+  sparkContext.conf,
+  sparkContext.hadoopConfiguration)
+
+  /**
+   * A manager for global temporary views.
+   */
+  val globalTempViewManager = {
+// System preserved database should not exists in metastore. However 
it's hard to guarantee it
+// for every session, because case-sensitivity differs. Here we always 
lowercase it to make our
+// life easier.
+val globalTempDB = 
sparkContext.conf.get(GLOBAL_TEMP_DATABASE).toLowerCase
+if (externalCatalog.databaseExists(globalTempDB)) {
+  throw new SparkException(
+s"$globalTempDB is a system preserved database, please rename your 
existing database " +
+  "to resolve the name conflict and launch your Spark application 
again.")
--- End diff --

oh no. I think it is fine to hide that conf. But, if the user hit this 
exception, seems it is better to let them know there is another workaround 
(renaming the existing db may not be easy).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15388: [SPARK-17821][SQL] Support And and Or in Expressi...

2016-10-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15388#discussion_r82536075
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionSetSuite.scala
 ---
@@ -80,6 +80,59 @@ class ExpressionSetSuite extends SparkFunSuite {
   setTest(1, Not(aUpper >= 1), aUpper < 1, Not(Literal(1) <= aUpper), 
Literal(1) > aUpper)
   setTest(1, Not(aUpper <= 1), aUpper > 1, Not(Literal(1) >= aUpper), 
Literal(1) < aUpper)
 
+  setTest(1, aUpper > bUpper && aUpper <= 10, aUpper <= 10 && aUpper > 
bUpper)
+  setTest(1,
+aUpper > bUpper && bUpper > 100 && aUpper <= 10,
+bUpper > 100 && aUpper <= 10 && aUpper > bUpper)
+
+  setTest(1, aUpper > bUpper || aUpper <= 10, aUpper <= 10 || aUpper > 
bUpper)
+  setTest(1,
+aUpper > bUpper || bUpper > 100 || aUpper <= 10,
+bUpper > 100 || aUpper <= 10 || aUpper > bUpper)
+
+  setTest(1,
+bUpper > 100 || aUpper <= 10 && aUpper > bUpper,
+bUpper > 100 || (aUpper <= 10 && aUpper > bUpper))
+
+  setTest(1,
+aUpper > 10 && bUpper < 10 || aUpper >= bUpper,
+(bUpper < 10 && aUpper > 10) || aUpper >= bUpper)
+
+  setTest(1,
--- End diff --

For these few test cases, they are getting too complicated that a human 
won't be able to tell immediately what the case is testing for. We should add 
some comment explaining what is being tested.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >