date:20161220

[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16308
  
**[Test build #70396 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70396/testReport)**
 for PR 16308 at commit 
[`4b6900c`](https://github.com/apache/spark/commit/4b6900cf6d182d87a545d736d320c6229fb8251d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2016-12-20 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16347
  
Thanks for submitting the ticket. In general I don't think the 
sortWithinPartitions property can carry over to writing out data, because one 
partition actually corresponds to more than one file.

Can your use case be satisfied by adding an explicit sortBy?

```
df.write.sortBy(col).parquet(...)
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16308
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16308
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70396/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16349: [Doc] bucketing is applicable to all file-based d...

2016-12-20 Thread rxin

GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/16349

[Doc] bucketing is applicable to all file-based data sources

## What changes were proposed in this pull request?
Starting Spark 2.1.0, bucketing feature is available for all file-based 
data sources. This patch fixes some function docs that haven't yet been updated 
to reflect that.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark ds-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16349.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16349


commit c8f1b42ec15af791de36a3e4311de424d2dd99de
Author: Reynold Xin 
Date:   2016-12-20T08:02:48Z

[Doc] bucketing is applicable to all file-based data sources




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16296
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70398/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12775
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70397/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12775
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16296
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16349: [Doc] bucketing is applicable to all file-based data sou...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16349
  
**[Test build #70399 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70399/testReport)**
 for PR 16349 at commit 
[`c8f1b42`](https://github.com/apache/spark/commit/c8f1b42ec15af791de36a3e4311de424d2dd99de).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2016-12-20 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15018
  
For the zero-weight values, can we do similar to scikit-learn to remove 
zero-weight values, like 
https://github.com/amueller/scikit-learn/commit/2415100f79293bbbf52c12c36d63a6cf602cf3c4


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16232: [SPARK-18800][SQL] Fix UnsafeKVExternalSorter by correct...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16232
  
**[Test build #70400 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70400/testReport)**
 for PR 16232 at commit 
[`e70692d`](https://github.com/apache/spark/commit/e70692dd060ee137842e1fd16e49826967114060).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2016-12-20 Thread junegunn

Github user junegunn commented on the issue:

https://github.com/apache/spark/pull/16347
  
Thanks for the comment. I was trying to implement the following Hive QL in 
Spark SQL/API:

```sql
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.mapred.mode = nonstrict;

insert overwrite table target_table
partition (day)
select * from source_table
distribute by day sort by id;
```

In Hive, `distribute by day` ensures that the records with the same "day" 
goes to the same reducer, and `sort by id` ensures that the input to each 
reducer is sorted by "id". It works as expected. The number of reducers is no 
more than the cardinality of "day" column, and I could confirm that the 
generated ORC file in each partition is sorted by "id".

However, if I run the same query or its equivalent Spark code â 
[`repartition('day)` for `distribute by 
day`](https://github.com/apache/spark/blob/bfeccd80ef032cab3525037be3d3e42519619493/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2423),
 and [`sortWithinPartitions('id)` for `sort by 
id`](https://github.com/apache/spark/blob/bfeccd80ef032cab3525037be3d3e42519619493/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L990)
 â on Spark, we have the right number of writer tasks, one for each 
partition, and each task generates a single output file, but the generated ORC 
file is not properly sorted by "id" making ORC index ineffective.

> Can your use case be satisfied by adding an explicit sortBy?

`sortBy` is for bucketed tables and requires `bucketBy`, so I'm not sure if 
it's related to this issue regarding Hive compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16232
  
**[Test build #70401 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70401/testReport)**
 for PR 16232 at commit 
[`5a31e37`](https://github.com/apache/spark/commit/5a31e378e3de301dad768eee776dcb88a404).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-20 Thread lirui-intel

Github user lirui-intel commented on the issue:

https://github.com/apache/spark/pull/12775
  
The new test passed locally and I can't find any failures in the Jenkins 
test report. Not sure what failed exactly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16336: [SPARK-18923][DOC][BUILD] Support skipping R/Python API ...

2016-12-20 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16336
  
I think it's fine to make this change for consistency and convenience. It's 
minor. It'd be nice to document them in the README.md, briefly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16349: [Doc] bucketing is applicable to all file-based data sou...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16349
  
**[Test build #70399 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70399/testReport)**
 for PR 16349 at commit 
[`c8f1b42`](https://github.com/apache/spark/commit/c8f1b42ec15af791de36a3e4311de424d2dd99de).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16232
  
**[Test build #70400 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70400/testReport)**
 for PR 16232 at commit 
[`e70692d`](https://github.com/apache/spark/commit/e70692dd060ee137842e1fd16e49826967114060).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16232
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70400/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16232
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16232
  
**[Test build #70401 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70401/testReport)**
 for PR 16232 at commit 
[`5a31e37`](https://github.com/apache/spark/commit/5a31e378e3de301dad768eee776dcb88a404).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock ...

2016-12-20 Thread xuanyuanking

GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/16350

[SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for each table's relation 
in cache

## What changes were proposed in this pull request?

Backport of #16135 to branch-2.0

## How was this patch tested?

Because of the diff between branch-2.0 and master/2.1, here add a 
multi-thread access table test in `HiveMetadataCacheSuite` and check it only 
loading once using metrics in `HiveCatalogMetrics`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-18700-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16350.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16350


commit 132d12ee1457c41a0bec56516ab5a41d36d8ac1f
Author: xuanyuanking 
Date:   2016-12-20T10:50:03Z

SPARK-18700: Add StripedLock for each table's relation in cache




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16232
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16232
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70401/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16350
  
**[Test build #70402 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70402/consoleFull)**
 for PR 16350 at commit 
[`132d12e`](https://github.com/apache/spark/commit/132d12ee1457c41a0bec56516ab5a41d36d8ac1f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16135: [SPARK-18700][SQL] Add StripedLock for each table's rela...

2016-12-20 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/16135
  
@hvanhovell Sure, I open a new BACKPORT-2.0.
There's a little diff in branch-2.0, the ut test of this patch based on the 
`HiveCatalogMetrics` which not added in 2.0, so I added the patch need metric. 
Thanks for check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93215941
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
+
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower > 1.0 && variancePower < 2.0) {
+require(y >= 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+math.max(y, 0.1)
+  } else {
+require(y > 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+y
+  }
+}
+
+override def variance(mu: Double): Double = math.pow(mu, variancePower)
+
+private def yp(y: Double, mu: Double, p: Double): Double = {
+  (math.pow(y, p) - math.pow(mu, p)) / p
+}
+
+// Force y >= 0.1 for deviance to work for (1 - variancePower). see 
tweedie()$dev.resid
+override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
+  2.0 * weight *
+(y * yp(math.max(y, 0.1), mu, 1.0 - variancePower) - yp(y, mu, 2.0 
- variancePower))
+}
+
+// This depends on the density of the tweedie distribution. Not yet 
implemented.
+override def aic(
+predictions: RDD[(Double, Double, Double)],
+deviance: Double,
+numInstances: Double,
+weightSum: Double): Double = {
+  0.0
+}
+
+override def project(mu: Double): Double = {
+  if (mu < epsilon) {
+epsilon
+  } else if (mu.isInfinity) {
+Double.MaxValue
--- End diff --

Out of curiosity is this meaningful to "cap" at Double.MaxValue? By the 
time you get there a lot of stuff is going to be infinite or not meaningful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93216003
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
--- End diff --

This is a global shared variable -- we really can't do this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93215641
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
+
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower > 1.0 && variancePower < 2.0) {
+require(y >= 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+math.max(y, 0.1)
--- End diff --

If we're going to use this magic 0.1 constant in many places, factor out a 
constant? 0.1 seems quite large as an 'epsilon' but I guess that's what R's 
implementation uses for whatever reason?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93215688
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
+
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower > 1.0 && variancePower < 2.0) {
+require(y >= 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+math.max(y, 0.1)
+  } else {
+require(y > 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+y
+  }
+}
+
+override def variance(mu: Double): Double = math.pow(mu, variancePower)
+
+private def yp(y: Double, mu: Double, p: Double): Double = {
+  (math.pow(y, p) - math.pow(mu, p)) / p
+}
+
+// Force y >= 0.1 for deviance to work for (1 - variancePower). see 
tweedie()$dev.resid
+override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
+  2.0 * weight *
+(y * yp(math.max(y, 0.1), mu, 1.0 - variancePower) - yp(y, mu, 2.0 
- variancePower))
+}
+
+// This depends on the density of the tweedie distribution. Not yet 
implemented.
+override def aic(
+predictions: RDD[(Double, Double, Double)],
+deviance: Double,
+numInstances: Double,
+weightSum: Double): Double = {
+  0.0
--- End diff --

Throw a UnsupportedOperationException?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...

2016-12-20 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/16350
  
Maybe we should just drop the UT (so we don't have to add the metrics). cc 
@ericl WDYT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...

2016-12-20 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/16351

[SPARK-18943][SQL] Avoid per-record type dispatch in CSV when reading

## What changes were proposed in this pull request?

`CSVRelation.csvParser` does type dispatch for each value in each row. We 
can prevent this because the schema is already kept in `CSVRelation`.

So, this PR proposes that converters are created first according to the 
schema, and then apply them to each.

## How was this patch tested?

Tests in `CSVTypeCastSuite` and `CSVRelation`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark type-dispatch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16351.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16351


commit e72d1bc419dfd6da7f6e298d5b5412dba69eb5ad
Author: hyukjinkwon 
Date:   2016-12-20T11:54:05Z

Avoid per-record type dispatch in CSV when reading




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16351
  
**[Test build #70403 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70403/testReport)**
 for PR 16351 at commit 
[`e72d1bc`](https://github.com/apache/spark/commit/e72d1bc419dfd6da7f6e298d5b5412dba69eb5ad).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-20 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r93229282
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,69 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. 
blockBounds(start) gives the
+// index of the end of the block and blockBounds(end) gives the index 
of the start of the
+// block. Entries that are not the start or end of the block are 
meaningless. The idea is that
--- End diff --

I'm still not sure about this comment -- how can `blockBounds(x)` be both 
the start and end of a block? the implementation below is identical.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...

2016-12-20 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16351#discussion_r93230247
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
 ---
@@ -215,84 +215,133 @@ private[csv] object CSVInferSchema {
 }
 
 private[csv] object CSVTypeCast {
+  // A `ValueConverter` is responsible for converting the given value to a 
desired type.
+  private type ValueConverter = String => Any
 
   /**
-   * Casts given string datum to specified type.
-   * Currently we do not support complex types (ArrayType, MapType, 
StructType).
+   * Create converters which cast each given string datum to each 
specified type in given schema.
+   * Currently, we do not support complex types (`ArrayType`, `MapType`, 
`StructType`).
*
-   * For string types, this is simply the datum. For other types.
+   * For string types, this is simply the datum.
+   * For other types, this is converted into the value according to the 
type.
* For other nullable types, returns null if it is null or equals to the 
value specified
* in `nullValue` option.
*
-   * @param datum string value
-   * @param name field name in schema.
-   * @param castType data type to cast `datum` into.
-   * @param nullable nullability for the field.
+   * @param schema schema that contains data types to cast the given value 
into.
* @param options CSV options.
*/
-  def castTo(
+  private[sql] def makeConverters(
--- End diff --

Ops, I can remove this access modifier. Will remove it soon and the one 
below too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16350
  
**[Test build #70402 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70402/consoleFull)**
 for PR 16350 at commit 
[`132d12e`](https://github.com/apache/spark/commit/132d12ee1457c41a0bec56516ab5a41d36d8ac1f).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16350
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16350
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70402/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...

2016-12-20 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16233
  
We need a way to isolate the analysis of view text with a different 
context. Using wrapper is one solution, and my proposal doesn't introduce a 
wrapper, instead it applies the context in place, i.e. when we parse the view 
text in `SessionCatalog.lookupRelation`, set the database of 
`UnresolvedRelation` right away, according to view context(only contains 
`currentDatabase` at the first version, we can add more information in the 
future).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...

2016-12-20 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16233
  
hmm, it seems hard to apply the view context in place, considering things 
like CTE. I think it's better to introduce analysis context, which can limit 
the max depth of stacked view easily.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16351
  
**[Test build #70404 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70404/testReport)**
 for PR 16351 at commit 
[`22c9a8a`](https://github.com/apache/spark/commit/22c9a8a9bb812eaa557aec09f3cf0ab25e97b3bf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL progra...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16329
  
**[Test build #70405 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70405/testReport)**
 for PR 16329 at commit 
[`2c1f182`](https://github.com/apache/spark/commit/2c1f1829677e74dc2dd2d7d67233b142f14007e8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16323: [SPARK-18911] [SQL] Define CatalogStatistics to i...

2016-12-20 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16323#discussion_r93240303
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -198,6 +200,10 @@ case class CatalogTable(
   locationUri, inputFormat, outputFormat, serde, compressed, 
properties))
   }
 
+  def withStats(cboStatsEnabled: Boolean): CatalogTable = {
--- End diff --

Yes I also think that's better, but as @cloud-fan said, we can't get the 
config in `def statistics`, we have to modify many places to support this. I'm 
about to do such modifications, do you have any advices to minimize the changes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16350
  
**[Test build #70406 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70406/consoleFull)**
 for PR 16350 at commit 
[`80b8664`](https://github.com/apache/spark/commit/80b86646e0f1af8fb99d78aaf3f16dc7e752a99d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL progra...

2016-12-20 Thread aokolnychyi

Github user aokolnychyi commented on the issue:

https://github.com/apache/spark/pull/16329
  
@marmbrus I have updated the pull request. The compiled docs can be found 
[here](https://aokolnychyi.github.io/spark-docs/sql-programming-guide.html). 

I did not manage to build the Java API docs. I believe the problem is in my 
local installation. Therefore, I checked each url manually, they should work 
once the API docs are compiled. I will verify everything one more time in the 
nightly build.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15996
  
**[Test build #70407 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70407/testReport)**
 for PR 15996 at commit 
[`28f88ef`](https://github.com/apache/spark/commit/28f88ef7b4796c6d07c80cf7fa942b27103937dd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL progra...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16329
  
**[Test build #70405 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70405/testReport)**
 for PR 16329 at commit 
[`2c1f182`](https://github.com/apache/spark/commit/2c1f1829677e74dc2dd2d7d67233b142f14007e8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  public static class Employee implements Serializable `
  * `  public static class MyAverage extends Aggregator `
  * `  case class Employee(name: String, salary: Long)`
  * `  case class Average(var sum: Long, var count: Long)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL progra...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16329
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL progra...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16329
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70405/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-20 Thread mridulm

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/12775
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15996
  
**[Test build #70408 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70408/testReport)**
 for PR 15996 at commit 
[`97dc307`](https://github.com/apache/spark/commit/97dc3079650e24d8412f093ad4184077ddf37c26).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12775
  
**[Test build #70409 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70409/testReport)**
 for PR 12775 at commit 
[`9778cef`](https://github.com/apache/spark/commit/9778cefce3e152d559e53cd4e2f5a113e561f0ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16296
  
**[Test build #70410 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70410/testReport)**
 for PR 16296 at commit 
[`a553366`](https://github.com/apache/spark/commit/a553366e9828c2a68a25023181beb9acbf908aa0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16352: [SPARK-18947][SQL] SQLContext.tableNames should n...

2016-12-20 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/16352

[SPARK-18947][SQL] SQLContext.tableNames should not call Catalog.listTables

## What changes were proposed in this pull request?

It's a huge waste to call `Catalog.listTables` in `SQLContext.tableNames`, 
which only need the table names, while `Catalog.listTables` will get the table 
metadata for each table name.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark minor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16352.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16352


commit f12dc7924dd3e4847578992d20d36a26d3d02792
Author: Wenchen Fan 
Date:   2016-12-20T14:27:37Z

SQLContext.tableNames should not call Catalog.listTables




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...

2016-12-20 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16352
  
cc @yhuai @gatorsmile


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16352
  
**[Test build #70411 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70411/testReport)**
 for PR 16352 at commit 
[`f12dc79`](https://github.com/apache/spark/commit/f12dc7924dd3e4847578992d20d36a26d3d02792).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16351
  
**[Test build #70403 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70403/testReport)**
 for PR 16351 at commit 
[`e72d1bc`](https://github.com/apache/spark/commit/e72d1bc419dfd6da7f6e298d5b5412dba69eb5ad).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16351
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70403/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16351
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...

2016-12-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16351
  
cc @cloud-fan, could I please ask to take a look? I remember a similar PR 
was reviewed by you before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-20 Thread jnh5y

Github user jnh5y commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93261268
  
--- Diff: docs/sql-programming-guide.md ---
@@ -382,6 +382,52 @@ For example:
 
 
 
+## Aggregations
+
+The [built-in DataFrames 
functions](api/scala/index.html#org.apache.spark.sql.functions$) mentioned 
+before provide such common aggregations as `count()`, `countDistinct()`, 
`avg()`, `max()`, `min()`, etc.
--- End diff --

As a suggestion, I'd change this to read:
"The [built-in DataFrames 
functions](api/scala/index.html#org.apache.spark.sql.functions$) provide common 
aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, and 
`min()`."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...

2016-12-20 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16352
  
The same issue also exists in 
[getTableNames](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L278).
 Could we also fix there?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-20 Thread jnh5y

Github user jnh5y commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93262242
  
--- Diff: docs/sql-programming-guide.md ---
@@ -382,6 +382,52 @@ For example:
 
 
 
+## Aggregations
+
+The [built-in DataFrames 
functions](api/scala/index.html#org.apache.spark.sql.functions$) mentioned 
+before provide such common aggregations as `count()`, `countDistinct()`, 
`avg()`, `max()`, `min()`, etc.
+While those functions are designed for DataFrames, Spark SQL also has 
type-safe versions for some of them in 

+[Scala](api/scala/index.html#org.apache.spark.sql.expressions.scalalang.typed$)
 and 
+[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to 
work with strongly typed Datasets.
+Moreover, users are not limited to the predefined aggregate functions and 
can create their own.
--- End diff --

I think it'd be worth showing an Spark SQL example using the 
included/pre-defined functions.  Since your example implements 'avg', maybe use 
'min' / 'max'?

Alternatively, the example could be added to the SQL statements in the main 
driver for the UserDefinedAggregateFunction implementations.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...

2016-12-20 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16352
  
LGTM except the above comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...

2016-12-20 Thread davies

Github user davies commented on the issue:

https://github.com/apache/spark/pull/16232
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16351
  
**[Test build #70404 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70404/testReport)**
 for PR 16351 at commit 
[`22c9a8a`](https://github.com/apache/spark/commit/22c9a8a9bb812eaa557aec09f3cf0ab25e97b3bf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16351
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70404/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16296
  
**[Test build #70410 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70410/testReport)**
 for PR 16296 at commit 
[`a553366`](https://github.com/apache/spark/commit/a553366e9828c2a68a25023181beb9acbf908aa0).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16296
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70410/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16296
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16351
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-20 Thread neggert

Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r9328
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,69 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. 
blockBounds(start) gives the
+// index of the end of the block and blockBounds(end) gives the index 
of the start of the
+// block. Entries that are not the start or end of the block are 
meaningless. The idea is that
--- End diff --

It relies on knowing ahead of time wether `x` is the start or end 
index1. If it's a start index, `blockBounds(x)` gives the ending 
index of that block. If it's an end index, `blockBounds(x)` gives the starting 
index of the block. So yes, the implementations of `blockStart` and `blockEnd` 
are identical. I just have two different functions because it makes the code 
more readable.

Maybe the comment from scikit-learn (where I borrowed this idea from) 
explains it better? (their `target` = my `blockBounds`)

> target describes a list of blocks.  At any time, if [i..j] (inclusive) is
> an active block, then blockBounds[i] := j and target[j] := i.

The trick is just in maintaining the array so that the above property is 
always true. At initialization, it's trivially true because all blocks have 
only one element, and `blockBounds(x)` = `x`. After initialization, 
`blockBounds` is only modified by the merge function, which is set up to modify 
`blockBounds` so that this property is preserved, then return the starting 
index of the newly-merged block.

This is admittedly a bit tricky, but it's a lot faster than the 
implementation I did where created a doubly-linked list of `Block` case 
classes. I'm open to suggestions on how to explain it better.

1 You could actually figure out whether you have a start or an 
end index by comparing `blockBounds(x)` to `x`. The lesser value will be the 
start index.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16240: [SPARK-16792][SQL] Dataset containing a Case Class with ...

2016-12-20 Thread michalsenkyr

Github user michalsenkyr commented on the issue:

https://github.com/apache/spark/pull/16240
  
None of them. The compilation will fail. That is why I had to provide those 
additional implicits.

```
scala> class Test[T]
defined class Test

scala> implicit def test1[T <: Seq[String]]: Test[T] = null
test1: [T <: Seq[String]]=> Test[T]

scala> implicit def test2[T <: Product]: Test[T] = null
test2: [T <: Product]=> Test[T]

scala> def test[T : Test](t: T) = null
test: [T](t: T)(implicit evidence$1: Test[T])Null

scala> test(List("abc"))
:31: error: ambiguous implicit values:
 both method test1 of type [T <: Seq[String]]=> Test[T]
 and method test2 of type [T <: Product]=> Test[T]
 match expected type Test[List[String]]
   test(List("abc"))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2016-12-20 Thread neggert

Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
@viirya Better to remove them, or throw an error? Personally, I'd rather be 
alerted that I'm passing invalid input, rather than have it "fixed" for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16350
  
**[Test build #70406 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70406/consoleFull)**
 for PR 16350 at commit 
[`80b8664`](https://github.com/apache/spark/commit/80b86646e0f1af8fb99d78aaf3f16dc7e752a99d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16350
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70406/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16350
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15996
  
**[Test build #70407 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70407/testReport)**
 for PR 15996 at commit 
[`28f88ef`](https://github.com/apache/spark/commit/28f88ef7b4796c6d07c80cf7fa942b27103937dd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15996
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70407/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15996
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12775
  
**[Test build #70409 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70409/testReport)**
 for PR 12775 at commit 
[`9778cef`](https://github.com/apache/spark/commit/9778cefce3e152d559e53cd4e2f5a113e561f0ff).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12775
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12775
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70409/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16353: [SPARK-18948][MLlib] Add Mean Percentile Rank met...

2016-12-20 Thread daniloascione

GitHub user daniloascione opened a pull request:

https://github.com/apache/spark/pull/16353

[SPARK-18948][MLlib] Add Mean Percentile Rank metric for ranking algorithms


## What changes were proposed in this pull request?

This PR adds the implementation of Mean Percentile Rank (MPR) metric in 
mllib.evaluation, as described in the paper âCollaborative Filtering for 
Implicit Feedback Datasets.â (Hu, Y., Y. Koren, and C. Volinsky 
doi:10.1109/ICDM.2008.22).
This metric is useful to evaluate recommendations given by the ALS with 
implicit feedback.

## How was this patch tested?

Additional test cases have been added to test Mean Percentile Rank (MPR).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/daniloascione/spark SPARK-18948

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16353.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16353


commit ed66bb09eddf776e932b29a7e4889128aa775946
Author: Danilo Ascione 
Date:   2016-12-20T16:23:28Z

[SPARK-18948][MLlib] Add Mean Percentile Rank metric for ranking algorithms




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15996
  
**[Test build #70408 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70408/testReport)**
 for PR 15996 at commit 
[`97dc307`](https://github.com/apache/spark/commit/97dc3079650e24d8412f093ad4184077ddf37c26).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15996
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70408/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15996
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16353: [SPARK-18948][MLlib] Add Mean Percentile Rank metric for...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16353
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16352
  
**[Test build #70411 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70411/testReport)**
 for PR 16352 at commit 
[`f12dc79`](https://github.com/apache/spark/commit/f12dc7924dd3e4847578992d20d36a26d3d02792).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16352
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...

2016-12-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16352
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70411/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16353: [SPARK-18948][MLlib] Add Mean Percentile Rank metric for...

2016-12-20 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16353
  
This is pretty specific to ALS, and relies on the r_ui strength value in 
the paper. I'm not sure it is that general. Without this weight, it's somewhat 
related to simple existing metrics like mean reciprocal rank.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16354: [SPARK-18886][Scheduler][WIP] Adjust Delay schedu...

2016-12-20 Thread squito

GitHub user squito opened a pull request:

https://github.com/apache/spark/pull/16354

[SPARK-18886][Scheduler][WIP] Adjust Delay scheduling to prevent 
under-utilization of cluster

## What changes were proposed in this pull request?

This is a significant change to delay scheduling to avoid under-utilization 
of cluster resources when there are locality preferences for a subset of 
resources.  The main change here is that the delay is no longer reset when any 
task is scheduled at a tighter locality constraint.  Instead, each task set 
starts the locality timer the first time it fails to utilize a resource offer 
due to locality constraints.  One task set *never* tightens the locality 
constraints, even if subsequent offers are made that utilize tighter 
constraints.

A more complete description of the issues w/ the previous scheduling method 
can be found under the jira.

## How was this patch tested?

Added unit test for original issue.  Ran all unit tests in 
o.a.s.scheduler.* manually.  Full tests via jenkins.

TODO
* [ ] add more unit tests, especially for recompute locality levels.
* [ ] code cleanup (especially all the logging added)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/squito/spark delay_sched-SPARK-18886

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16354.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16354


commit 8629823dcb61f67207ee5b6a6a1789a4c38e898f
Author: Imran Rashid 
Date:   2016-12-15T17:28:48Z

failing test case

commit 348a9f44a6f34e6ac15f4ece70b0178d134d0cc3
Author: Imran Rashid 
Date:   2016-12-20T03:35:55Z

"working" version -- but this is actually a significant departure from old 
delay scheduling

commit 8b7fd1adf510ef15a7c30aebdd4f029e71e2e50f
Author: Imran Rashid 
Date:   2016-12-20T03:57:44Z

test update

commit 22086999a9644086d6787fc4db5b2367e6ba70fe
Author: Imran Rashid 
Date:   2016-12-20T04:18:40Z

fix condition

commit af88dd8f8942b12edda2a466b292dd3bccdfbc4e
Author: Imran Rashid 
Date:   2016-12-20T04:19:05Z

update tests to reflect change in delay scheduling behavior

commit 27983a9a6d3f7d675bbfa83eb116e8329869aed7
Author: Imran Rashid 
Date:   2016-12-20T04:19:19Z

logging

commit 647bf400a0963ff8f5381e47f895e3cc606aa854
Author: Imran Rashid 
Date:   2016-12-20T17:13:59Z

fix other test cases, more fixes to recomputeLocality()

commit 2e5307f971767c0d5a228e3058db31244351da2d
Author: Imran Rashid 
Date:   2016-12-20T17:47:02Z

Merge branch 'master' into delay_sched-SPARK-18886

commit 449ba20c9c642884f5dcc5feccfa64cb1da833f2
Author: Imran Rashid 
Date:   2016-12-20T17:47:11Z

remove TODO




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery

2016-12-20 Thread nsyca

Github user nsyca commented on the issue:

https://github.com/apache/spark/pull/16337
  
 I have tested a few runs on `SQLQueryTestSuite` to confirm it allows to 
have sub-directories under 
`sql/core/src/test/resources/sql-tests/[inputs|results]` to group test files 
further. By reading the code, I'm pretty sure it supports multi-level 
sub-directories. The only requirement is the name of the file needs to be 
unique globally under inputs/.

With that knowledge, I propose we have this structure under directory 
`sql/core/src/test/resources/sql-tests/[inputs`:


subquery/
subquery/in-subquery/
subquery/in-subquery/simple-in.sql
subquery/in-subquery/simple-not-in.sql
subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)
subquery/in-subquery/not-in-group-by.sql
subquery/in-subquery/in-order-by.sql
subquery/in-subquery/in-limit.sql
subquery/in-subquery/in-having.sql
subquery/in-subquery/in-joins.sql
subquery/in-subquery/not-in-joins.sql
subquery/in-subquery/in-set-operations.sql
subquery/in-subquery/in-with-cte.sql
subquery/in-subquery/not-in-with-cte.sql
subquery/in-subquery/in-multiple-columns.sql
â¦
subquery/exists-subquery/
subquery/scalar-subquery/
```

Each test file will contain approximately 10-20 test cases. Some of the 
test cases will be able to classified in multiple ways. In that case, the 
tester will use his/her best judgement to classify them, or maybe we can 
entertain the idea of "complex" in the file name (I personally don't like it as 
those terms are subjective. To someone a SQL with a-few-table joins is complex. 
Others may see it not.)

We can run a single test file in `.../inputs` or any sub-directory under it 
with the following command 

`build/sbt "~sql/test-only *SQLQueryTestSuite -- -z .sql"`

The downside is  needs to be the exact name. It does not allow 
wildcard characters at this point. Perhaps this is something we can enhance in 
the future.

Lastly, we will break up this in-subquery test cases to multiple PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93289668
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
--- End diff --

Would you please suggest a better way to set the variancePower? I want to 
be consistent with the existing code to have the `Family` objects, but I need 
to also pass on the input `variancePower` to the `Tweedie` object which is used 
to compute the variance function. Any suggestion will be highly appreciated. 
@srowen @yanboliang   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16354: [SPARK-18886][Scheduler][WIP] Adjust Delay scheduling to...

2016-12-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16354
  
**[Test build #70412 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70412/testReport)**
 for PR 16354 at commit 
[`449ba20`](https://github.com/apache/spark/commit/449ba20c9c642884f5dcc5feccfa64cb1da833f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16354: [SPARK-18886][Scheduler][WIP] Adjust Delay scheduling to...

2016-12-20 Thread squito

Github user squito commented on the issue:

https://github.com/apache/spark/pull/16354
  
@mridulm @markhamstra @kayousterhout 
This is *not* ready to merge -- it needs some cleanup and more tests -- but 
I thought that seeing an implementation might help think through the design.  I 
think the discussion should still center on the overall approach, and that 
discussion should probably happen on the jira, not here.  (I'll happily fix 
code issues if that would help.)

Obviously, this is a pretty big change to the way delay scheduling works; 
I'd gladly consider alternative approaches that dont' involve such a large 
change in behavior, but I don't see them.  IMO this problem is serious enough 
that it merits the large behavior change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93290858
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
+
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower > 1.0 && variancePower < 2.0) {
+require(y >= 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+math.max(y, 0.1)
--- End diff --

I have not seen a formal justification for the choice of 0.1 in R. This 
seminal 
[paper](http://users.du.se/~lrn/StatMod10/HomeExercise2/Nelder_Pregibon.pdf) 
suggests 1/6 (about 0.17) to be the best constant. I would prefer to be 
consistent with R so that we can make comparison. Using a constant is a good 
idea. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93291042
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
--- End diff --

I think the Tweedie implementation needs to be able to access parameters of 
the GLM, to read off variancePower.

As it is this is a global variable and two jobs would overwrite each 
others' values. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93290854
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -242,7 +275,7 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   def setLinkPredictionCol(value: String): this.type = 
set(linkPredictionCol, value)
 
   override protected def train(dataset: Dataset[_]): 
GeneralizedLinearRegressionModel = {
-val familyObj = Family.fromName($(family))
+val familyObj = Family.fromName($(family), $(variancePower))
--- End diff --

I don't think we can do this either. variancePower is specific to one 
family, not a property of all of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 429 matches

Mail list logo