[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20126
  
Hm, I see. Will open a followup PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20126
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85560/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20126
  
**[Test build #85560 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85560/testReport)**
 for PR 20126 at commit 
[`2a9dff4`](https://github.com/apache/spark/commit/2a9dff449c9dc4a18e9a5d7f042450760bb9af2d).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20126
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20126
  
**[Test build #85560 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85560/testReport)**
 for PR 20126 at commit 
[`2a9dff4`](https://github.com/apache/spark/commit/2a9dff449c9dc4a18e9a5d7f042450760bb9af2d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20069: [SPARK-22895] [SQL] Push down the deterministic p...

2017-12-31 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20069#discussion_r159141412
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -851,7 +851,7 @@ object PushDownPredicate extends Rule[LogicalPlan] with 
PredicateHelper {
 
 case filter @ Filter(condition, union: Union) =>
   // Union could change the rows, so non-deterministic predicate can't 
be pushed down
-  val (pushDown, stayUp) = 
splitConjunctivePredicates(condition).span(_.deterministic)
+  val (pushDown, stayUp) = 
splitConjunctivePredicates(condition).partition(_.deterministic)
--- End diff --

What does it mean "after the first non-deterministic"? Doesn't this simply 
partition predicates to deterministic and non-deterministic? Have it considered 
"first" non-deterministic?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20126
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20126
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85559/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20126
  
**[Test build #85559 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85559/testReport)**
 for PR 20126 at commit 
[`23cc79b`](https://github.com/apache/spark/commit/23cc79b0cecb33555e8f6374cc03eddccce86445).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20126
  
**[Test build #85559 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85559/testReport)**
 for PR 20126 at commit 
[`23cc79b`](https://github.com/apache/spark/commit/23cc79b0cecb33555e8f6374cc03eddccce86445).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20125
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20125
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/8/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20125
  
**[Test build #8 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/8/testReport)**
 for PR 20125 at commit 
[`5cae64b`](https://github.com/apache/spark/commit/5cae64b0da57a3f45b54bcc39c18463d3945a934).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...

2017-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20125
  
I actually think 
https://github.com/apache/spark/pull/20125#issuecomment-354604768 are good 
points and I was hesitant about it. Although IMHO I think it's fine but let me 
cc @hvanhovell and @rxin too, who reviewed my related PRs before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...

2017-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20125
  
> Btw, is this any difference than using string? Like:

Nope, they will be the same but I was thinking this is a simplest fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...

2017-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20125
  
Yup, I was thinking of SparkSQL only feature.

For more details, the original intention was to support multiple values for 
`nullValue` but I realised such option support can be generallised - there were 
several issues about this since CSV is thirdparty library (I will find and give 
some links if requested). Also, there is one reference in R too:

```R
> d <- "col1,col2
+ 1,3
+ 2,4"
> df <- read.csv(text=d, na.strings=c("3", "2"))
> df
```
```
  col1 col2
11   NA
2   NA4
```

For more context, original proposal (Scala/SQL/Python/Java) here - 
https://github.com/apache/spark/pull/16611 touched many files and I received an 
advice to make this smaller, which I liked.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...

2017-12-31 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20125
  
Is this a special feature for SparkSQL only? Seems Hive doesn't have such 
support.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20126
  
**[Test build #85558 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85558/testReport)**
 for PR 20126 at commit 
[`85639dd`](https://github.com/apache/spark/commit/85639dd220e8fcb0489febc0414b51d22c0e41a9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20126
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20126
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85556/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20126
  
**[Test build #85556 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85556/testReport)**
 for PR 20126 at commit 
[`ada1c4c`](https://github.com/apache/spark/commit/ada1c4c7e1c175ee821c0ac191fc1decc3701f68).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20127: [SPARK-22932] [SQL] Refactor AnalysisContext

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20127
  
**[Test build #85557 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85557/testReport)**
 for PR 20127 at commit 
[`f158a95`](https://github.com/apache/spark/commit/f158a951b779e56e06d2c73234bac5c79055b2f5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20127: [SPARK-22932] [SQL] Refactor AnalysisContext

2017-12-31 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20127
  
cc @cloud-fan @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20127: [SPARK-22932] [SQL] Refactor AnalysisContext

2017-12-31 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/20127

[SPARK-22932] [SQL] Refactor AnalysisContext

## What changes were proposed in this pull request?
Add a `reset` function to ensure the state in `AnalysisContext ` is 
per-query. 

## How was this patch tested?
The existing test cases

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark refactorAnalysisContext

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20127.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20127


commit f158a951b779e56e06d2c73234bac5c79055b2f5
Author: gatorsmile 
Date:   2017-12-31T13:21:13Z

refactor




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20126
  
**[Test build #85556 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85556/testReport)**
 for PR 20126 at commit 
[`ada1c4c`](https://github.com/apache/spark/commit/ada1c4c7e1c175ee821c0ac191fc1decc3701f68).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20126: [DO-NOT-MERGE] Investigate if changes in flume.py...

2017-12-31 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/20126

[DO-NOT-MERGE] Investigate if changes in flume.py actually triggeres 
related tests

## What changes were proposed in this pull request?

Do not merge this.

Seems the changes in `flume.py` not actually triggering related tests. It's 
easy to test on Jenkins env.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark investigate-streaming

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20126.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20126


commit ada1c4c7e1c175ee821c0ac191fc1decc3701f68
Author: hyukjinkwon 
Date:   2017-12-31T13:23:53Z

Investigate if changes in flume.py actually triggeres related tests




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19991: [SPARK-22801][ML][PYSPARK] Allow FeatureHasher to...

2017-12-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19991


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19991: [SPARK-22801][ML][PYSPARK] Allow FeatureHasher to treat ...

2017-12-31 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/19991
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...

2017-12-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19715


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19715: [SPARK-22397][ML]add multiple columns support to Quantil...

2017-12-31 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/19715
  
Merged to master. If there are any further small comments / clean ups we 
can do that during QA for 2.3

Thanks @huaxingao and all others for review!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20114
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85554/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20114
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20114
  
**[Test build #85554 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85554/testReport)**
 for PR 20114 at commit 
[`281ffdc`](https://github.com/apache/spark/commit/281ffdc9132829617af28dcb1668e2fa5eddc599).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20125
  
**[Test build #8 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/8/testReport)**
 for PR 20125 at commit 
[`5cae64b`](https://github.com/apache/spark/commit/5cae64b0da57a3f45b54bcc39c18463d3945a934).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...

2017-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20125
  
cc @gatorsmile could you take a look please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20125: [SPARK-17967][SQL] Support for array as an option...

2017-12-31 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/20125

[SPARK-17967][SQL] Support for array as an option in SQL parser

## What changes were proposed in this pull request?

This PR targets to add the ability for dealing with an array (JSON array) 
in `tablePropertyValue` rule.

**SQL**

```sql
CREATE TEMPORARY TABLE tableA USING csv
OPTIONS (nullValue [2012, 1.1, 'null'], ...)
```

## How was this patch tested?

Manually tested and test cases added in `DDLParserSuite.scala`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-17967-sql

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20125


commit 5cae64b0da57a3f45b54bcc39c18463d3945a934
Author: hyukjinkwon 
Date:   2017-12-31T10:27:00Z

Support for array as an option in SQL




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2017-12-31 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r159136922
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CompressionCodecPrecedenceSuite.scala
 ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
--- End diff --

Should we move this to `org.apache.spark.sql.execution.datasources.parquet`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20124: [WIP][SPARK-22126][ML] Fix model-specific optimization s...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20124
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20124: [WIP][SPARK-22126][ML] Fix model-specific optimization s...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20124
  
**[Test build #85553 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85553/testReport)**
 for PR 20124 at commit 
[`53521ca`](https://github.com/apache/spark/commit/53521cac9d39bf9682d67d94d46adde357db1b43).
 * This patch **fails to generate documentation**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20124: [WIP][SPARK-22126][ML] Fix model-specific optimization s...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20124
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85553/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20114
  
**[Test build #85554 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85554/testReport)**
 for PR 20114 at commit 
[`281ffdc`](https://github.com/apache/spark/commit/281ffdc9132829617af28dcb1668e2fa5eddc599).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

2017-12-31 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20114
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20124: [WIP][SPARK-22126][ML] Fix model-specific optimization s...

2017-12-31 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20124
  
This basically works by splitting the array of ParamMaps into two.  One 
that has params that can be optimized by the estimator, and one that can be 
parallelized over.  These are then grouped together so that the estimator can 
fit a sequence of Models.  This allows us to reuse the previous API for fitting 
multiple Models and still keep the parallelization logic pretty 
straightforward.  Model specific optimization support is just how it was before 
there was any parallelism introduced too.  I can explain in further detail or 
make a design document if needed.

cc @MLnick @WeichenXu123 @jkbradley 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20124: [WIP][SPARK-22126][ML] Fix model-specific optimization s...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20124
  
**[Test build #85553 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85553/testReport)**
 for PR 20124 at commit 
[`53521ca`](https://github.com/apache/spark/commit/53521cac9d39bf9682d67d94d46adde357db1b43).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20124: [WIP][SPARK-22126][ML] Fix model-specific optimiz...

2017-12-31 Thread BryanCutler
GitHub user BryanCutler opened a pull request:

https://github.com/apache/spark/pull/20124

[WIP][SPARK-22126][ML] Fix model-specific optimization support for ML 
tuning.

## What changes were proposed in this pull request?

Support model-specific optimizations for CrossValidator and 
TrainValidationSplit by grouping `ParamMap`s so that param groups can fit 
models in parallel, but still allow `Estimator`s to optimally fit a sequence of 
models themselves.  This PR adds a new API to `Estimator` that can be 
overridden to indicate optimized params, and additional functions in 
`ParamGridBuilder` to group `ParamMap` arrays that can then be used by the 
meta-algorithms.

## How was this patch tested?

WIP, need to add tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/BryanCutler/spark 
wip-model-specific-tuning-SPARK-22126

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20124


commit c4ff7ab016f440a6f1684f79fdfe677507fca279
Author: Bryan Cutler 
Date:   2017-12-01T00:24:55Z

added model specific optimization to parallel TVS

commit 47a40399250af2f777e53475b5dee812bf244788
Author: Bryan Cutler 
Date:   2017-12-01T17:52:10Z

remove unused import

commit 4d113386a2ae20bae0c4e54860386103c82ae627
Author: Bryan Cutler 
Date:   2017-12-14T19:01:06Z

moved splitting of param maps to ParamGridBuilder

commit 6599cbac79375686b78792ff7c50c85749e4a6cf
Author: Bryan Cutler 
Date:   2017-12-15T00:58:41Z

got param map split working

commit 47781a15cd4d2307a6268d86cd693394e227d842
Author: Bryan Cutler 
Date:   2017-12-15T07:03:54Z

added pipeline getOptimizedParams

commit 0a887bc656e9485d247add1f6de34c299da4c19d
Author: Bryan Cutler 
Date:   2017-12-15T07:47:56Z

moved param grouping to ParamGridBuilder.groupByParam

commit f7256e649fb6aa1e63baca5159e919fbde30dd24
Author: Bryan Cutler 
Date:   2017-12-18T05:56:57Z

remove unused import

commit 7a53f57403ef17753e13cb099ac4866edabc5778
Author: Bryan Cutler 
Date:   2017-12-31T07:18:46Z

fix CrossValidator to use grouped params

commit 994accd402d87639ed70d3cd594f883633a0d849
Author: Bryan Cutler 
Date:   2017-12-31T07:44:34Z

fixed style checks and added docs

commit 53521cac9d39bf9682d67d94d46adde357db1b43
Author: Bryan Cutler 
Date:   2017-12-31T07:46:27Z

added doc




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20114
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85552/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

2017-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20114
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

2017-12-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20114
  
**[Test build #85552 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85552/testReport)**
 for PR 20114 at commit 
[`281ffdc`](https://github.com/apache/spark/commit/281ffdc9132829617af28dcb1668e2fa5eddc599).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

2017-12-31 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/20072#discussion_r159135028
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -261,6 +261,17 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val HADOOPFSRELATION_SIZE_FACTOR = buildConf(
+"org.apache.spark.sql.execution.datasources.sizeFactor")
--- End diff --

Is this config for all data sources or only hadoopFS-related data sources?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

2017-12-31 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/20072#discussion_r159134987
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -261,6 +261,17 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val HADOOPFSRELATION_SIZE_FACTOR = buildConf(
--- End diff --

How about `DISK_TO_MEMORY_SIZE_FACTOR`? IMHO the current name doesn't 
describe the purpose clearly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

2017-12-31 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/20072#discussion_r159135036
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
 ---
@@ -60,6 +60,8 @@ case class HadoopFsRelation(
 }
   }
 
+  private val hadoopFSSizeFactor = sqlContext.conf.hadoopFSSizeFactor
--- End diff --

shall we move it into the method `sizeInBytes` since it's only used there?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

2017-12-31 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/20072#discussion_r159135272
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
 ---
@@ -82,7 +84,15 @@ case class HadoopFsRelation(
 }
   }
 
-  override def sizeInBytes: Long = location.sizeInBytes
+  override def sizeInBytes: Long = {
+val size = location.sizeInBytes * hadoopFSSizeFactor
+if (size > Long.MaxValue) {
--- End diff --

I think this branch can be removed? `Long.MaxValue` is returned when 
converting a double value larger than `Long.MaxValue`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2