date:20160919

[GitHub] spark pull request #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Stat...

2016-09-19 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15054#discussion_r79540609
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -444,7 +444,7 @@ class SessionCatalogSuite extends SparkFunSuite {
 assert(!catalog.tableExists(TableIdentifier("view1", Some("default"
   }
 
-  test("getTableMetadata on temporary views") {
+  test("getTableMetadata and getTempViewOrPermanentTableMetadata on 
temporary views") {
--- End diff --

looks like it's unnecessary to test `getTableMetadata` on temporary views, 
let's just test `getTempViewOrPermanentTableMetadata` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...

2016-09-19 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15135
  
I understand the reasons why you want to add this -- but I feel this is too 
esoteric and if we add this one, there are also a lot of other cases that can 
be added and I don't know where we would stop.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Stat...

2016-09-19 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15054#discussion_r79540494
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -282,6 +271,24 @@ class SessionCatalog(
   }
 
   /**
+   * Retrieve the metadata of an existing temporary view or permanent 
table/view.
+   * If the temporary view does not exist, tries to get the metadata an 
existing permanent
+   * table/view. If no database is specified, assume the table/view is in 
the current database.
+   * If the specified table/view is not found in the database then a 
[[NoSuchTableException]] is
+   * thrown.
+   */
+  def getTempViewOrPermanentTableMetadata(name: TableIdentifier): 
CatalogTable = synchronized {
--- End diff --

it should just take a string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-09-19 Thread zjffdu

Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/14959
  
```
The internal SparkConf of the context will not be the same instance as conf.
```
This is the existing implementation that python is different from scala. 
But I think it is correct. I guess the reason why in scala the internal 
SparkConf of the SparkContext is not the same instance as conf is to make sure 
changing SparkConf take effect after SparkContext is created would not take 
effect. pyspark is the same in this perspective. Although in pyspark the 
internal SparkConf of SparkContext is the same instance as conf,  changing conf 
after SparkContext is created would not take effect  as it is guaranteed in 
scala side. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13513
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65631/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13513
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13513
  
**[Test build #65631 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65631/consoleFull)**
 for PR 13513 at commit 
[`84d3d27`](https://github.com/apache/spark/commit/84d3d27490556dc1de4e4bce3b6b19a75691f52e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15102
  
**[Test build #65636 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65636/consoleFull)**
 for PR 15102 at commit 
[`f5c57f5`](https://github.com/apache/spark/commit/f5c57f51f675002298c833edb486451642735221).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements ...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15054
  
**[Test build #65637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65637/consoleFull)**
 for PR 15054 at commit 
[`48ce44e`](https://github.com/apache/spark/commit/48ce44e20a3db290c4c563b4d45ec5bfb6a86195).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...

2016-09-19 Thread citoubest

Github user citoubest commented on the issue:

https://github.com/apache/spark/pull/15135
  
  @rxin @davies @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14808: [SPARK-17156][ML][EXAMPLE] Add multiclass logistic regre...

2016-09-19 Thread sethah

Github user sethah commented on the issue:

https://github.com/apache/spark/pull/14808
  
https://github.com/apache/spark/pull/14834 is merged now. We did not 
implement a new API, but we can still update the logistic regression examples 
to show the new multiclass functionality.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15147: [SPARK-17545] [SQL] Handle additional time offset format...

2016-09-19 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15147
  
I mean the problem in the JIRA is not reproduced in the master branch and 
therefore I believe we need another JIRA to describe the support for other time 
formats as the same one as casting operation as you suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Stat...

2016-09-19 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15054#discussion_r79539702
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -357,6 +346,21 @@ class SessionCatalog(
 tempTables.remove(formatTableName(name))
   }
 
+  /**
+   * Retrieve the metadata of an existing temporary view.
+   * If the temporary view does not exist, return None.
+   */
+  def getTempViewMetadataOption(name: String): Option[CatalogTable] = 
synchronized {
--- End diff --

Yeah, we can combine them. Let me do it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14959: [SPARK-17387][PYSPARK] Creating SparkContext() fr...

2016-09-19 Thread zjffdu

Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14959#discussion_r79539542
  
--- Diff: python/pyspark/java_gateway.py ---
@@ -50,13 +50,18 @@ def launch_gateway():
 # proper classpath and settings from spark-env.sh
 on_windows = platform.system() == "Windows"
 script = "./bin/spark-submit.cmd" if on_windows else 
"./bin/spark-submit"
+command = [os.path.join(SPARK_HOME, script)]
+if conf and conf.getAll():
+conf_items = [['--conf', '%s=%s' % (k, v)] for k, v in 
conf.getAll()]
--- End diff --

Correct, will fix it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14452: [SPARK-16849][SQL][WIP] Improve subquery execution by de...

2016-09-19 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14452
  
@davies Thanks for comment.

In our initial benchmark of the TPC-DS queries (totally 13) using CTE, this 
PR helps about half (6) of them, 5 queries are not affected, 2 queries are 
regressed. I might say it would help CTE queries at most cases according to the 
results.

I agree that 500+ LOC changes looks a bit big for this improvement. I would 
like to refactor and tailor part of the changes and break it down to small 
pieces for you to review. I wish I can reduce the LOC changes at the end.

I would like to run Q64 by disabling the push down, may I ask the purpose 
for it?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14852: [WIP][SPARK-17138][ML][MLib] Add Python API for multinom...

2016-09-19 Thread sethah

Github user sethah commented on the issue:

https://github.com/apache/spark/pull/14852
  
Now that https://github.com/apache/spark/pull/14834 has been merged, we can 
make the updates to Python API. There is no new interface to implement, but it 
would be great if this PR could take care of updating the Python side to 
reflect that LOR supports multiclass now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-19 Thread sethah

Github user sethah commented on the issue:

https://github.com/apache/spark/pull/15148
  
A few high-level comments/questions:

* Should this go into the `feature` package as a feature 
estimator/transformer? That is where other dimensionality reduction techniques 
have gone and I'm not sure we should create a new package for this.
* Could you please point me to a specific section of a specific paper that 
documents the approaches used here? AFAICT, this patch implements something 
different than most of the Approximate nearest neighbors via LSH algorithms 
found in papers. For instance, the method in section 2 
[here](http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf) as well as 
the method on Wikipedia 
[here](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search)
 are different than the implementation in this pr. Also, the spark package 
[`spark-neighbors`](https://github.com/sethah/spark-neighbors) employs those 
approaches. I'm not an expert in LSH so I was just hoping for some 
clarification.
* The implementation of the `RandomProjections` class actually follows the 
implementation of the "2-stable" (or more generically, "p-stable") LSH 
algorithm, and not the "Random Projection" algorithm in the paper that is 
referenced. At the very least, we should clarify this. Potentially, we should 
think of a better name.

@karlhigley Would you mind taking a look at the patch, or providing your 
input on the comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14803
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65627/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14803
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14803
  
**[Test build #65627 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65627/consoleFull)**
 for PR 14803 at commit 
[`541dfdc`](https://github.com/apache/spark/commit/541dfdc637b5373c384249a601d0a3e8486adb07).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13705: [SPARK-15472][SQL] Add support for writing in `csv` form...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13705
  
**[Test build #65635 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65635/consoleFull)**
 for PR 13705 at commit 
[`9869f98`](https://github.com/apache/spark/commit/9869f9885e4fdc7364cd46ab05b1f332921ff8d7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15134: [SPARK-17580][CORE]Add random UUID as app name wh...

2016-09-19 Thread phalodi

Github user phalodi closed the pull request at:

https://github.com/apache/spark/pull/15134


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15133: [SPARK-17578][Docs] Add spark.app.name default va...

2016-09-19 Thread phalodi

Github user phalodi closed the pull request at:

https://github.com/apache/spark/pull/15133


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15126: [SPARK-17513][SQL] Make StreamExecution garbage-c...

2016-09-19 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15126


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15126: [SPARK-17513][SQL] Make StreamExecution garbage-collect ...

2016-09-19 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15126
  
Merging in master/2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15067: [SPARK-17513] [STREAMING] [SQL] Make StreamExecution gar...

2016-09-19 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15067
  
@frreiss can you close this now?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15126: [SPARK-17513][SQL] Make StreamExecution garbage-collect ...

2016-09-19 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15126
  
Since @frreiss hasn't updated the pr yet, I'm going to merge this one and 
assign the jira ticket to Fred.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14634
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65629/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14634
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14634
  
**[Test build #65629 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65629/consoleFull)**
 for PR 14634 at commit 
[`90fbe4e`](https://github.com/apache/spark/commit/90fbe4e7bc8e80d7601eb020d428055a1a44797a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with 
BeforeAndAfter `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15157: Revert "[SPARK-17549][SQL] Only collect table size stat ...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15157
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15157: Revert "[SPARK-17549][SQL] Only collect table size stat ...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15157
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65624/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15157: Revert "[SPARK-17549][SQL] Only collect table size stat ...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15157
  
**[Test build #65624 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65624/consoleFull)**
 for PR 15157 at commit 
[`5b73205`](https://github.com/apache/spark/commit/5b732058ac911b6cb52a8639281681c3ee9d9dae).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15158: [SPARK-17603] [SQL] Utilize Hive-generated Statistics Fo...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15158
  
**[Test build #65634 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65634/consoleFull)**
 for PR 15158 at commit 
[`061e60b`](https://github.com/apache/spark/commit/061e60b3af819f235e531b1de24f136a431dc23c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15147: [SPARK-17545] [SQL] Handle additional time offset format...

2016-09-19 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15147
  
@nbeyer Thanks for your investigation. I think that sounds reasonable 
though I think it might be arguable because adding more cases virtually means 
more time and computation to parse/infer schema (specifically in case of 
`TimestampType` in CSV) and there is for an option to specify the dateformat; 
however, it'd make sense that users don't really want to specify any option 
when they want to read time data. I'd follow committer's lead.

Anyway, this might be not related with this JIRA anymore. How about 
creating another one describing current state and suggestion maybe?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14803
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65625/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14803
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14803
  
**[Test build #65625 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65625/consoleFull)**
 for PR 14803 at commit 
[`5b101ab`](https://github.com/apache/spark/commit/5b101aba62efd34077495eb55159ec1b93d2c90e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15158: [SPARK-17603] [SQL] Utilize Hive-generated Statis...

2016-09-19 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/15158

[SPARK-17603] [SQL] Utilize Hive-generated Statistics For Partitioned Tables

### What changes were proposed in this pull request?
For non-partitioned tables, Hive-generated statistics are stored in table 
properties. However, for partitioned tables, Hive-generated statistics are 
stored in partition properties. Thus, we are unable to utilize the 
Hive-generated statistics for partitioned tables. 

The statistics might not be gathered for all the partitions in Hive. For 
partial collection, we will not utilize the Hive-generated statistics.

### How was this patch tested?
Added test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark partitionedTableStatistics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15158.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15158


commit 061e60b3af819f235e531b1de24f136a431dc23c
Author: gatorsmile 
Date:   2016-09-20T04:49:51Z

fix.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-19 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/15090
  
This pr has been updated based on all the above comments, changes are as 
follows:
1. Modify analyze syntax a little bit: `identifierSeq` is now non-optional, 
i.e. users must specify column names after `FOR COLUMNS`.
2. Check column correctness based on case sensitivity.
3. Deduplicate columns when checking correctness.
4. Support analyzing columns independently, i.e. when analyzing new 
columns, now we donât remove stats of columns which are analyzed before.
5. Rename `BasicColStats` to `ColumnStats`.
6. Use 3*standard deviation to check ndv result in test cases.
7. Code refactoring based on comments.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15090
  
**[Test build #65633 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65633/consoleFull)**
 for PR 15090 at commit 
[`0f974c0`](https://github.com/apache/spark/commit/0f974c019401ac5cef1be1b1e69a523ee2287101).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14834


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface

2016-09-19 Thread dbtsai

Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/14834
  
Merged into master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15147: [SPARK-17545] [SQL] Handle additional time offset format...

2016-09-19 Thread nbeyer

Github user nbeyer commented on the issue:

https://github.com/apache/spark/pull/15147
  
@HyukjinKwon Based on my further reading of the code, I'd like to suggest 
that add a deprecation to the stringToTime method and then update the 
stringToTimestamp method, specifically here 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L356
 to handle the "no colon" case. It is the stringToTimestamp method that is used 
by the 'cast'.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14803
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14803
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65623/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14803
  
**[Test build #65623 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65623/consoleFull)**
 for PR 14803 at commit 
[`23ba9a2`](https://github.com/apache/spark/commit/23ba9a23ab835987ed326a9320cf8632a0783885).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15150
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15150
  
**[Test build #65632 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65632/consoleFull)**
 for PR 15150 at commit 
[`f7311a2`](https://github.com/apache/spark/commit/f7311a22d78b1875446e86aa53ad9f15892df7e2).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15150
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65632/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15082: [SPARK-17528][SQL] MutableProjection should not cache co...

2016-09-19 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15082
  
I re-targeted it to 2.1 only.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-19 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r79533403
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/lsh/LSH.scala ---
@@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.lsh
+
+import scala.util.Random
+
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for output dimension.
+   *
+   * @group param
+   */
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension",
+ParamValidators.gt(0))
+
+  /** @group getParam */
+  final def getOutputDim: Int = $(outputDim)
+
+  setDefault(outputDim -> 1)
+
+  setDefault(outputCol -> "lsh_output")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without outputCol
+   * @return A derived schema with outputCol added
+   */
+  final def transformLSHSchema(schema: StructType): StructType = {
+val outputFields = schema.fields :+
+  StructField($(outputCol), new VectorUDT, nullable = false)
+StructType(outputFields)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+abstract class LSHModel[KeyType, T <: LSHModel[KeyType, T]] private[ml]
+  extends Model[T] with LSHParams {
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+  /**
+   * :: DeveloperApi ::
+   *
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  protected[this] val hashFunction: KeyType => Vector
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One of the point in the metric space
+   * @param y Another the point in the metric space
+   * @return The distance between x and y in double
+   */
+  protected[ml] def keyDistance(x: KeyType, y: KeyType): Double
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different hash Vectors. By 
default, the distance is the
+   * minimum distance of two hash values in any dimension.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y in double
+   */
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+(x.asBreeze - y.asBreeze).toArray.map(math.abs).min
--- End diff --

For a pair of `DenseVector`, you can directly use its `values` member and 
do something like:

x.values.zip(y.values).map(x => math.abs(x._1 - x._2)).min

For a pair of `SparseVector`, you may not need to conver `(x.asBreeze - 
y.asBreeze)` back to `Array`, because the resulting array should be sparse too. 
We can directly map on the Breeze vector, i.e., `(x.asBreeze - 
y.Breeze).map(math.abs).min`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:

[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15150
  
**[Test build #65632 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65632/consoleFull)**
 for PR 15150 at commit 
[`f7311a2`](https://github.com/apache/spark/commit/f7311a22d78b1875446e86aa53ad9f15892df7e2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data So...

2016-09-19 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15046#discussion_r79533281
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -293,6 +293,39 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 Option(dir).map(spark.read.format("org.apache.spark.sql.test").load)
   }
 
+  test("read a data source that does not extend SchemaRelationProvider") {
+val dfReader = spark.read
+  .option("from", "1")
+  .option("TO", "10")
+  .format("org.apache.spark.sql.sources.SimpleScanSource")
+
+// when users do not specify the schema
+checkAnswer(dfReader.load(), spark.range(1, 11).toDF())
+
+// when users specify the schema
+val inputSchema = new StructType().add("s", IntegerType, nullable = 
false)
+val e = intercept[AnalysisException] { 
dfReader.schema(inputSchema).load() }
--- End diff --

there is not test for this case before?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data So...

2016-09-19 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15046#discussion_r79533310
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -327,8 +327,13 @@ case class DataSource(
 dataSource.createRelation(sparkSession.sqlContext, 
caseInsensitiveOptions)
   case (_: SchemaRelationProvider, None) =>
 throw new AnalysisException(s"A schema needs to be specified when 
using $className.")
-  case (_: RelationProvider, Some(_)) =>
-throw new AnalysisException(s"$className does not allow 
user-specified schemas.")
+  case (dataSource: RelationProvider, Some(schema)) =>
+val baseRelation =
+  dataSource.createRelation(sparkSession.sqlContext, 
caseInsensitiveOptions)
+if (baseRelation.schema != schema) {
--- End diff --

cc @yhuai @liancheng to confirm, is it safe?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data So...

2016-09-19 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15046#discussion_r79533043
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/TableScanSuite.scala ---
@@ -345,34 +345,72 @@ class TableScanSuite extends DataSourceTest with 
SharedSQLContext {
   (1 to 10).map(Row(_)).toSeq)
   }
 
+  test("create a temp table that does not have a path in the option") {
+Seq("TEMPORARY VIEW", "TABLE").foreach { tableType =>
+  val tableName = "relationProvierWithSchema"
+  withTable(tableName) {
+sql(
+  s"""
+ |CREATE $tableType $tableName
--- End diff --

what does this test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data So...

2016-09-19 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15046#discussion_r79532870
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/TableScanSuite.scala ---
@@ -345,34 +345,72 @@ class TableScanSuite extends DataSourceTest with 
SharedSQLContext {
   (1 to 10).map(Row(_)).toSeq)
   }
 
+  test("create a temp table that does not have a path in the option") {
--- End diff --

`temp view`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data So...

2016-09-19 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15046#discussion_r79532807
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala ---
@@ -65,6 +65,26 @@ class InsertSuite extends DataSourceTest with 
SharedSQLContext {
 )
   }
 
+  test("insert into a temp view that does not point to an insertable data 
source") {
+import testImplicits._
+withTempView("t1", "t2") {
+  sql(
+"""
+  |CREATE TEMPORARY TABLE t1
--- End diff --

let's use CREATE TEMPORARY VIEW


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-19 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15148
  
@Yunni Thanks for working on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13513
  
**[Test build #65631 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65631/consoleFull)**
 for PR 13513 at commit 
[`84d3d27`](https://github.com/apache/spark/commit/84d3d27490556dc1de4e4bce3b6b19a75691f52e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-19 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r79532298
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/lsh/LSH.scala ---
@@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.lsh
+
+import scala.util.Random
+
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for output dimension.
+   *
+   * @group param
+   */
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension",
+ParamValidators.gt(0))
+
+  /** @group getParam */
+  final def getOutputDim: Int = $(outputDim)
+
+  setDefault(outputDim -> 1)
+
+  setDefault(outputCol -> "lsh_output")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without outputCol
+   * @return A derived schema with outputCol added
+   */
+  final def transformLSHSchema(schema: StructType): StructType = {
+val outputFields = schema.fields :+
+  StructField($(outputCol), new VectorUDT, nullable = false)
+StructType(outputFields)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+abstract class LSHModel[KeyType, T <: LSHModel[KeyType, T]] private[ml]
+  extends Model[T] with LSHParams {
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+  /**
+   * :: DeveloperApi ::
+   *
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  protected[this] val hashFunction: KeyType => Vector
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One of the point in the metric space
+   * @param y Another the point in the metric space
+   * @return The distance between x and y in double
+   */
+  protected[ml] def keyDistance(x: KeyType, y: KeyType): Double
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different hash Vectors. By 
default, the distance is the
+   * minimum distance of two hash values in any dimension.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y in double
+   */
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+(x.asBreeze - y.asBreeze).toArray.map(math.abs).min
+  }
+
+  /**
+   * Transforms the input dataset.
+   */
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Check transform validity and derive the output schema from the input 
schema.
+   *
+   * Typical implementation should first conduct verification on schema 
change and parameter
+   * validity, including complex parameter interaction checks.
+   */
+  override def transformSchema(schema: StructType): StructType = {
+transformLSHSchema(schema)
+  }
+
+  /**
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the item.
+

[GitHub] spark issue #15155: [SPARK-17477][SQL] SparkSQL cannot handle schema evoluti...

2016-09-19 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15155
  
Yea. I meant if we want to read old/new Parquet files without user-given 
schema with enabling merging schemas, then, we'd face SPARK-15516 first. This 
is why I thought that JIRA blocks this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14784: [SPARK-17210][SPARKR] sparkr.zip is not distributed to e...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14784
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15146: [SPARK-17590][SQL] Analyze CTE definitions at once and a...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15146
  
**[Test build #65630 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65630/consoleFull)**
 for PR 15146 at commit 
[`baf239b`](https://github.com/apache/spark/commit/baf239b69a1e82ef37845857f295fe4df1780b46).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14784: [SPARK-17210][SPARKR] sparkr.zip is not distributed to e...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14784
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65626/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14784: [SPARK-17210][SPARKR] sparkr.zip is not distributed to e...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14784
  
**[Test build #65626 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65626/consoleFull)**
 for PR 14784 at commit 
[`c91d02a`](https://github.com/apache/spark/commit/c91d02a95d8239db5d2d4db7a796a987705a449d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14634
  
**[Test build #65629 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65629/consoleFull)**
 for PR 14634 at commit 
[`90fbe4e`](https://github.com/apache/spark/commit/90fbe4e7bc8e80d7601eb020d428055a1a44797a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15146: [SPARK-17590][SQL] Analyze CTE definitions at once and a...

2016-09-19 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15146
  
I guess if using the same analyzed plan increases the chance to reuse 
exchange, then it may improve the performance. Anyway, it is not the purpose of 
this change. Because the analyzed subquery plan will be changed largely in 
optimization later, we cannot guarantee this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13513
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65628/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13513
  
**[Test build #65628 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65628/consoleFull)**
 for PR 13513 at commit 
[`bddbc7f`](https://github.com/apache/spark/commit/bddbc7f8e1563000ea4a9dcad07c92e34c24199f).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13513
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15155: [SPARK-17477][SQL] SparkSQL cannot handle schema evoluti...

2016-09-19 Thread wgtmac

Github user wgtmac commented on the issue:

https://github.com/apache/spark/pull/15155
  
@HyukjinKwon Yup this PR is very similar to yours.

For merging parquet schema, it won't work. Think about this: the table 
contains two parquet files, one has int, one has long. The DataFrame schema 
uses long (mergeSchema will also result in this case). So when reading the 
parquet file with Int, we still run into this problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15146: [SPARK-17590][SQL] Analyze CTE definitions at once and a...

2016-09-19 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15146
  
@hvanhovell This is for analyzer change and adds CTE in CTE feature. I 
don't expect there is performance improvement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13513
  
**[Test build #65628 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65628/consoleFull)**
 for PR 13513 at commit 
[`bddbc7f`](https://github.com/apache/spark/commit/bddbc7f8e1563000ea4a9dcad07c92e34c24199f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14803
  
@marmbrus 
>
  * I think that for all but text you have to include the partition columns 
in the schema if inference is turned off (which it is by default).

For text format, when inference is turned off but there is user provided 
schema, we will use the schema. In this case, I think user should also include 
the partition columns in the schema, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14803
  
**[Test build #65627 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65627/consoleFull)**
 for PR 14803 at commit 
[`541dfdc`](https://github.com/apache/spark/commit/541dfdc637b5373c384249a601d0a3e8486adb07).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15156: [SPARK-17160] Properly escape field names in code...

2016-09-19 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15156


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15156: [SPARK-17160] Properly escape field names in code-genera...

2016-09-19 Thread JoshRosen

Github user JoshRosen commented on the issue:

https://github.com/apache/spark/pull/15156
  
I'm going to merge this to master and branch-2.0. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13513: [SPARK-15698][SQL][Streaming] Add the ability to ...

2016-09-19 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/13513#discussion_r79530152
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala
 ---
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming
+
+import java.util.{LinkedHashMap => JLinkedHashMap}
+import java.util.Map.Entry
+
+import scala.collection.mutable
+
+import org.json4s.NoTypeHints
+import org.json4s.jackson.Serialization
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.execution.streaming.FileStreamSource.FileEntry
+import org.apache.spark.sql.internal.SQLConf
+
+class FileStreamSourceLog(
+metadataLogVersion: String,
+sparkSession: SparkSession,
+path: String)
+  extends CompactibleFileStreamLog[FileEntry](metadataLogVersion, 
sparkSession, path) {
+
+  import CompactibleFileStreamLog._
+
+  // Configurations about metadata compaction
+  protected override val compactInterval =
+  sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL)
+  require(compactInterval > 0,
+s"Please set ${SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key} (was 
$compactInterval) to a " +
+  s"positive value.")
+
+  protected override val fileCleanupDelayMs =
+sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY)
+
+  protected override val isDeletingExpiredLog =
+sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_DELETION)
+
+  private implicit val formats = Serialization.formats(NoTypeHints)
+
+  // A fixed size log entry cache to cache the file entries belong to the 
compaction batch. It is
+  // used to avoid scanning the compacted log file to retrieve it's own 
batch data.
+  private val cacheSize = compactInterval
+  private val fileEntryCache = new JLinkedHashMap[Long, Array[FileEntry]] {
+override def removeEldestEntry(eldest: Entry[Long, Array[FileEntry]]): 
Boolean = {
+  size() > cacheSize
+}
+  }
+
+  protected override def serializeData(data: FileEntry): String = {
+Serialization.write(data)
+  }
+
+  protected override def deserializeData(encodedString: String): FileEntry 
= {
+Serialization.read[FileEntry](encodedString)
+  }
+
+  def compactLogs(logs: Seq[FileEntry]): Seq[FileEntry] = {
+logs
+  }
+
+  override def add(batchId: Long, logs: Array[FileEntry]): Boolean = {
+if (super.add(batchId, logs) && isCompactionBatch(batchId, 
compactInterval)) {
--- End diff --

yes, you're right, I will fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...

2016-09-19 Thread zjffdu

Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/14639
  
Close it as it is resolved somewhere else. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-...

2016-09-19 Thread zjffdu

Github user zjffdu closed the pull request at:

https://github.com/apache/spark/pull/14639


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14784: [SPARK-17210][SPARKR] sparkr.zip is not distributed to e...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14784
  
**[Test build #65626 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65626/consoleFull)**
 for PR 14784 at commit 
[`c91d02a`](https://github.com/apache/spark/commit/c91d02a95d8239db5d2d4db7a796a987705a449d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14803
  
**[Test build #65625 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65625/consoleFull)**
 for PR 14803 at commit 
[`5b101ab`](https://github.com/apache/spark/commit/5b101aba62efd34077495eb55159ec1b93d2c90e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15157: Revert "[SPARK-17549][SQL] Only collect table size stat ...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15157
  
**[Test build #65624 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65624/consoleFull)**
 for PR 15157 at commit 
[`5b73205`](https://github.com/apache/spark/commit/5b732058ac911b6cb52a8639281681c3ee9d9dae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14834
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65622/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14834
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14834
  
**[Test build #65622 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65622/consoleFull)**
 for PR 14834 at commit 
[`4dae595`](https://github.com/apache/spark/commit/4dae59569732ace5cb2cf583d6db315fb3eda596).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15157: Revert "[SPARK-17549][SQL] Only collect table size stat ...

2016-09-19 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/15157
  
cc @vanzin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15157: Revert "[SPARK-17549][SQL] Only collect table siz...

2016-09-19 Thread yhuai

GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/15157

Revert "[SPARK-17549][SQL] Only collect table size stat in driver for 
cached relation."

This reverts commit 39e2bad6a866d27c3ca594d15e574a1da3ee84cc because of the 
problem mentioned at 
https://issues.apache.org/jira/browse/SPARK-17549?focusedCommentId=15505060=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15505060

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark revert-SPARK-17549

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15157


commit 5b732058ac911b6cb52a8639281681c3ee9d9dae
Author: Yin Huai 
Date:   2016-09-20T02:58:30Z

Revert "[SPARK-17549][SQL] Only collect table size stat in driver for 
cached relation."

This reverts commit 39e2bad6a866d27c3ca594d15e574a1da3ee84cc.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14803
  
>
  * If the partition directories are not present when the stream starts 
then I believe this breaks.

Yes. Schema inference only happens when starting the stream.

>
  * I think that for all but text you have to include the partition columns 
in the schema if inference is turned off (which it is by default).

I will update this to the programming guide.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...

2016-09-19 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/14634
  
This change looks good. Let's add a regression test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Stat...

2016-09-19 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15054#discussion_r79528505
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -357,6 +346,21 @@ class SessionCatalog(
 tempTables.remove(formatTableName(name))
   }
 
+  /**
+   * Retrieve the metadata of an existing temporary view.
+   * If the temporary view does not exist, return None.
+   */
+  def getTempViewMetadataOption(name: String): Option[CatalogTable] = 
synchronized {
--- End diff --

Seems we always use it with the pattern 
`getTempViewMetadataOption.getOrElse(getTableMetadata)`, maybe we should just 
rename it to `getTempViewOrPermanentTableMetadata`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14803: [SPARK-17153][SQL] Should read partition data whe...

2016-09-19 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14803#discussion_r79526950
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 ---
@@ -608,6 +608,34 @@ class FileStreamSourceSuite extends 
FileStreamSourceTest {
 
   // === other tests 
 
+  test("read new files in partitioned table without globbing, should read 
partition data") {
--- End diff --

Added a test for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14803
  
**[Test build #65623 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65623/consoleFull)**
 for PR 14803 at commit 
[`23ba9a2`](https://github.com/apache/spark/commit/23ba9a23ab835987ed326a9320cf8632a0783885).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14634
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65620/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...

2016-09-19 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14634
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...

2016-09-19 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14634
  
**[Test build #65620 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65620/consoleFull)**
 for PR 14634 at commit 
[`64268f3`](https://github.com/apache/spark/commit/64268f34191d9f5447a63f34e53c9663aac714e2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-19 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79521372
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsColumnSuite.scala 
---
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.sql.{Date, Timestamp}
+
+import org.apache.spark.sql.{AnalysisException, Row}
+import org.apache.spark.sql.catalyst.plans.logical.BasicColStats
+import org.apache.spark.sql.execution.command.AnalyzeColumnCommand
+import org.apache.spark.sql.types._
+
+class StatisticsColumnSuite extends StatisticsTest {
+
+  test("parse analyze column commands") {
+val table = "table"
+assertAnalyzeCommand(
+  s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, value",
+  classOf[AnalyzeColumnCommand])
+
+val noColumnError = intercept[AnalysisException] {
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS")
+}
+assert(noColumnError.message == "Need to specify the columns to 
analyze. Usage: " +
+  "ANALYZE TABLE tbl COMPUTE STATISTICS FOR COLUMNS key, value")
+
+withTable(table) {
+  sql(s"CREATE TABLE $table (key INT, value STRING)")
+  val invalidColError = intercept[AnalysisException] {
+sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS k")
+  }
+  assert(invalidColError.message == s"Invalid column name: k")
+
+  val duplicateColError = intercept[AnalysisException] {
+sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, 
value, key")
+  }
+  assert(duplicateColError.message == s"Duplicate column name: key")
+
+  withSQLConf("spark.sql.caseSensitive" -> "true") {
+val invalidErr = intercept[AnalysisException] {
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS keY")
+}
+assert(invalidErr.message == s"Invalid column name: keY")
+  }
+
+  withSQLConf("spark.sql.caseSensitive" -> "false") {
+val duplicateErr = intercept[AnalysisException] {
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, 
value, vaLue")
+}
+assert(duplicateErr.message == s"Duplicate column name: vaLue")
+  }
+}
+  }
+
+  test("basic statistics for integral type columns") {
+val rdd = sparkContext.parallelize(Seq("1", null, "2", "3", null)).map 
{ i =>
+  if (i != null) Row(i.toByte, i.toShort, i.toInt, i.toLong) else 
Row(i, i, i, i)
--- End diff --

Cool, please add some salt to this when you fix (as I don't think mine is 
perfect anyway :)).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15034: [SPARK-16240][ML] ML persistence backward compatibility ...

2016-09-19 Thread JoshRosen

Github user JoshRosen commented on the issue:

https://github.com/apache/spark/pull/15034
  
@jkbradley, it looks like this is legitimately failing MiMa (not sure why 
it passed on the first run...):

```
[error]  * the type hierarchy of object org.apache.spark.ml.clustering.LDA 
is different in current version. Missing types 
{org.apache.spark.ml.util.DefaultParamsReadable}
[error]filter with: 
ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.ml.clustering.LDA$")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-19 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79520564
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsColumnSuite.scala 
---
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.sql.{Date, Timestamp}
+
+import org.apache.spark.sql.{AnalysisException, Row}
+import org.apache.spark.sql.catalyst.plans.logical.BasicColStats
+import org.apache.spark.sql.execution.command.AnalyzeColumnCommand
+import org.apache.spark.sql.types._
+
+class StatisticsColumnSuite extends StatisticsTest {
+
+  test("parse analyze column commands") {
+val table = "table"
+assertAnalyzeCommand(
+  s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, value",
+  classOf[AnalyzeColumnCommand])
+
+val noColumnError = intercept[AnalysisException] {
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS")
+}
+assert(noColumnError.message == "Need to specify the columns to 
analyze. Usage: " +
+  "ANALYZE TABLE tbl COMPUTE STATISTICS FOR COLUMNS key, value")
+
+withTable(table) {
+  sql(s"CREATE TABLE $table (key INT, value STRING)")
+  val invalidColError = intercept[AnalysisException] {
+sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS k")
+  }
+  assert(invalidColError.message == s"Invalid column name: k")
+
+  val duplicateColError = intercept[AnalysisException] {
+sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, 
value, key")
+  }
+  assert(duplicateColError.message == s"Duplicate column name: key")
+
+  withSQLConf("spark.sql.caseSensitive" -> "true") {
+val invalidErr = intercept[AnalysisException] {
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS keY")
+}
+assert(invalidErr.message == s"Invalid column name: keY")
+  }
+
+  withSQLConf("spark.sql.caseSensitive" -> "false") {
+val duplicateErr = intercept[AnalysisException] {
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, 
value, vaLue")
+}
+assert(duplicateErr.message == s"Duplicate column name: vaLue")
+  }
+}
+  }
+
+  test("basic statistics for integral type columns") {
+val rdd = sparkContext.parallelize(Seq("1", null, "2", "3", null)).map 
{ i =>
+  if (i != null) Row(i.toByte, i.toShort, i.toInt, i.toLong) else 
Row(i, i, i, i)
--- End diff --

@HyukjinKwon Seems better. Let me change the code based on this. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 480 matches

Mail list logo