[GitHub] spark pull request #18029: [SPARK-20168] [DStream] Add changes to use kinesi...

2017-12-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18029


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20081
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85392/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20081
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20081
  
**[Test build #85392 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85392/testReport)**
 for PR 20081 at commit 
[`10a80b2`](https://github.com/apache/spark/commit/10a80b272e898043e250c2b24a792c9474cf0d10).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-12-25 Thread brkyvz
Github user brkyvz commented on the issue:

https://github.com/apache/spark/pull/18029
  
Merged to master. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20081
  
FYI, there is a JIRA for a doc about `spark.sql.parquet.writeLegacyFormat ` 
- https://issues-test.apache.org/jira/plugins/servlet/mobile#issue/SPARK-20937


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-25 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
Which one is more common? A chain of arithmetic expressions? Or a deeply 
nested expression? I don't see strong evidence that supports statement output 
from the discussion. The only one possibility for now is to reducing code size. 
This is also for performance, not stability. On the contrary, isn't using local 
variable more stable? Don't forget we need to introduce other mechanism to fix 
the problem of statement output like re-evaluation I pointed out above.

I'm not saying it is not good to support statement output. But for now, the 
reason to support it is very vague.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...

2017-12-25 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19954#discussion_r158675633
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/init-container/Dockerfile
 ---
@@ -0,0 +1,24 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FROM spark-base
+
+# If this docker file is being used in the context of building your images 
from a Spark distribution, the docker build
+# command should be invoked from the top level directory of the Spark 
distribution. E.g.:
+# docker build -t spark-init:latest -f 
dockerfiles/init-container/Dockerfile .
--- End diff --

`kubernetes/dockerfiles/..` instead of `dockerfiles/..`

Btw, only nits but seems like paths here in `Dockerfile`s for 
driver/executor are wrong: `kubernetes/dockerfiles/driver/Dockerfile` and 
`kubernetes/dockerfiles/executor/Dockerfile` respectively?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-25 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20081
  
> spark.sql.parquet.writeLegacyFormat - if you don't use this 
configuration, hive external table won't be able to access parquet data.

Well, that's really an undocumented feature... Can you submit a PR to 
update the description of `SQLConf.PARQUET_WRITE_LEGACY_FORMAT` and add a test?

> repartition and coalesce is most common use case in Industry to control N 
Number of files under directory while doing partitioning data.

Yea I know, but that's not accurate. It assumes each task would output one 
file, which is not true if `spark.sql.files.maxRecordsPerFile` is set to a 
small number. Anyway this is not a Hive feature, we should probably put it in 
the `SQL Programming Guide`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...

2017-12-25 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/20020
  
All these insertion commands are from `postHocResolutionRules`, while there 
are other batches after it.  Skipping the batches after 
`postHocResolutionRules` will cause analysis error.
I decide not to add `AnalysisBarrier` for correctness and robustness.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20020
  
**[Test build #85396 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85396/testReport)**
 for PR 20020 at commit 
[`cd2bbf8`](https://github.com/apache/spark/commit/cd2bbf8434a6b142f89f427db8654aeef36cec11).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...

2017-12-25 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20080
  
cc @srowen


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-25 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19929
  
Could you change the JIRA number to 
https://issues.apache.org/jira/browse/SPARK-22901 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20076
  
**[Test build #85395 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85395/testReport)**
 for PR 20076 at commit 
[`9229e6f`](https://github.com/apache/spark/commit/9229e6f1fa8f9fe58d279c6ab14cb1d20068a277).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20076
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20076
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85394/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20076
  
**[Test build #85394 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85394/testReport)**
 for PR 20076 at commit 
[`e510b48`](https://github.com/apache/spark/commit/e510b486ab1cea2f2f4f855747c86cd8af73728c).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class CompressionCodecPrecedenceSuite extends SQLTestUtils with 
SharedSQLContext `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20076
  
**[Test build #85394 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85394/testReport)**
 for PR 20076 at commit 
[`e510b48`](https://github.com/apache/spark/commit/e510b486ab1cea2f2f4f855747c86cd8af73728c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape

2017-12-25 Thread ep1804
Github user ep1804 commented on the issue:

https://github.com/apache/spark/pull/20004
  
Revision followed:
- comment on the default values.
- applying charToEscapeQuoteEscaping using Option type.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20004
  
**[Test build #85393 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85393/testReport)**
 for PR 20004 at commit 
[`c2f877d`](https://github.com/apache/spark/commit/c2f877d9d29668114b8672ec8481636a95c53987).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11994: [SPARK-14151] Expose metrics Source and Sink interface

2017-12-25 Thread CodingCat
Github user CodingCat commented on the issue:

https://github.com/apache/spark/pull/11994
  
if I understand correctly, the only issue here is that we exposed 
codehale's MetricsRegistry in Sink base 
class..https://github.com/apache/spark/pull/11994/files#diff-9ffc4de02d8a9b4961815f89557ca472R39
 

Fortunately, we only utilize this registry for registering reporter, 

how about provide an abstract method for creating reporter in Sink class ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20080
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85390/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20080
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20080
  
**[Test build #85390 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85390/testReport)**
 for PR 20080 at commit 
[`1dcec41`](https://github.com/apache/spark/commit/1dcec41a3c1e2c001b0f9fed92aa6f03b6c47f3a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-25 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20081
  
@cloud-fan spark.sql.files.maxRecordsPerFile didn't worked out when i was 
working with mine 30 TB of Spark Hive workload whereas repartition and coalesce 
made sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-25 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20081
  
@cloud-fan Thanks for PR
4. spark.sql.parquet.writeLegacyFormat - if you don't use this 
configuration, hive external table won't be able to access parquet data.
5. repartition and coalesce is most common use case in Industry to control 
N Number of files under directory while doing partitioning data.
i.e  If Data volume is very huge, then every partitions would have many 
small-small files which may harm
downstream query performance due to File I/O, Bandwidth I/O, Network 
I/O, Disk I/O.
Else I am good this your approach. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-25 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19813
  
I think all arithmetic, predicate and bitwise expressions can benefit from 
it, and they are very common expressions in SQL. More importantly, allowing 
expressions to output statement may have other benefits that we haven't 
discovered yet, I don't think we should sacrifice it just for supporting 
splitting code in whole stage codegen, which is only for performance not 
stability.

For now I think we can fix the 64kb compile error caused by the whole stage 
codegen framework not expressions. I remember @maropu has a PR to fix that and 
I prefer to take priority to review that PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20081
  
**[Test build #85392 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85392/testReport)**
 for PR 20081 at commit 
[`10a80b2`](https://github.com/apache/spark/commit/10a80b272e898043e250c2b24a792c9474cf0d10).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-25 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20081
  
@chetkhatri  @srowen @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scal...

2017-12-25 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/20081

[SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examples

## What changes were proposed in this pull request?
Some improvements:
1. Point out we are using both Spark SQ native syntax and HQL syntax in the 
example
2. Avoid using the same table name with temp view, to not confuse users.
3. Create the external hive table with a directory that already has data, 
which is a more common use case.
4. Remove the usage of `spark.sql.parquet.writeLegacyFormat`. This config 
was introduced by https://github.com/apache/spark/pull/8566 and has nothing to 
do with Hive.
5. Remove `repartition` and `coalesce` example. These 2 are not Hive 
specific, we should put them in a different example file. BTW they can't 
accurately control the number of output files, 
`spark.sql.files.maxRecordsPerFile` also controls it.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark minor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20081.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20081


commit 10a80b272e898043e250c2b24a792c9474cf0d10
Author: Wenchen Fan 
Date:   2017-12-26T04:30:10Z

clean up




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/20076
  
Well, I'll revert back the renaming. Any comments? @gatorsmile 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20076
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20076
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85388/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20076
  
**[Test build #85388 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85388/testReport)**
 for PR 20076 at commit 
[`2ab2d29`](https://github.com/apache/spark/commit/2ab2d293a0548b66070e840372e589eb2949a0ff).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20076
  
Sure, let's revert back the rename then.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20067: [SPARK-22894][SQL] DateTimeOperations should accept SQL ...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20067
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20067: [SPARK-22894][SQL] DateTimeOperations should accept SQL ...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20067
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85387/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20067: [SPARK-22894][SQL] DateTimeOperations should accept SQL ...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20067
  
**[Test build #85387 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85387/testReport)**
 for PR 20067 at commit 
[`ae998ec`](https://github.com/apache/spark/commit/ae998ec2b5548b7028d741da4813473dde1ad81e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-25 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
This is only valid when by coincidence the all expressions involved can use 
statement as output. As I looked at the codebase, I think only few expressions 
can output statement. This may not apply generally to reduce code size.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-25 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19813
  
I did a search but can't find one in the current codebase, but I do think 
this is a valid idea, e.g. a simple example would be `a + b +  z`, if 
expressions can output statement, then we just generate code like
```
int result = a + b ... + z;
boolean isNull = false;
```
instead of
```
int result 1 = a + b;
boolean isNull1 = false;
int result2 = result1 + c;
boolean isNull2 = false;
...
```

This can apply to both whole stage codegen and normal codegen, and reduce 
the code size dramatically, and make whole stage codegen less likely to hit 
64kb compile error.

Another thing I'm working on is: do not create global variables if 
`ctx.spiltExpression` doesn't spit. This optimization should be much more 
useful if combined with this optimization.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20076
  
Thanks for the PR. Why are we complicating the PR by doing the rename? Does 
this actually gain anything other than minor cosmetic changes? It makes the 
simple PR pretty long ...



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2017-12-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r158663731
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala 
---
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
--- End diff --

Move it to sql/core.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2017-12-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r158663721
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala 
---
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import org.apache.parquet.hadoop.ParquetOutputFormat
+
+import org.apache.spark.sql.execution.datasources.parquet.ParquetOptions
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+
+class CompressionCodecSuite extends TestHiveSingleton with SQLTestUtils {
--- End diff --

This suite does not need `TestHiveSingleton `. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19527
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19527
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85391/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19527
  
**[Test build #85391 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85391/testReport)**
 for PR 19527 at commit 
[`587ad42`](https://github.com/apache/spark/commit/587ad427a6682e98e1fefe592ecf278c674767f3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-12-25 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19527
  
Unit tests are reformatted too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19527
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85389/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19527
  
**[Test build #85389 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85389/testReport)**
 for PR 19527 at commit 
[`144f07d`](https://github.com/apache/spark/commit/144f07d5e92bf5cbc10cb2dc990fc32f15405977).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19527
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20080
  
**[Test build #85390 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85390/testReport)**
 for PR 20080 at commit 
[`1dcec41`](https://github.com/apache/spark/commit/1dcec41a3c1e2c001b0f9fed92aa6f03b6c47f3a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19527
  
**[Test build #85391 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85391/testReport)**
 for PR 19527 at commit 
[`587ad42`](https://github.com/apache/spark/commit/587ad427a6682e98e1fefe592ecf278c674767f3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20080: [SPARK-22870][CORE] Dynamic allocation should all...

2017-12-25 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/20080

[SPARK-22870][CORE] Dynamic allocation should allow 0 idle time

## What changes were proposed in this pull request?

This pr to make `0` as a valid value for 
`spark.dynamicAllocation.executorIdleTimeout`. 
For details, see the jira description: 
https://issues.apache.org/jira/browse/SPARK-22870.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-22870

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20080.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20080


commit 1dcec41a3c1e2c001b0f9fed92aa6f03b6c47f3a
Author: Yuming Wang 
Date:   2017-12-26T01:58:49Z

Dynamic allocation should allow 0 idle time




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-12-25 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r158659399
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,456 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data produces a vector of zeros) or 
'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data produces a vector of zeros) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  require(schema(inputColName).dataType.isInstanceOf[NumericType],
+s"Input column must be of type NumericType but got 
${schema(inputColName).dataType}")
+  require(!existingFields.exists(_.name == outputColName),
+s"Output column $outputColName already exists.")
+}
+
+// Prepares output columns with proper attributes by examining input 
columns.
+val inputFields = $(inputCols).map(schema(_))
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, $(dropLast), outputColName)
+}
+StructType(schema.fields ++ outputFields)
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to one, and

[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19527
  
**[Test build #85389 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85389/testReport)**
 for PR 19527 at commit 
[`144f07d`](https://github.com/apache/spark/commit/144f07d5e92bf5cbc10cb2dc990fc32f15405977).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-12-25 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r158659167
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,479 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data presented as an extra categorical 
feature) or
+   * 'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data presented as an extra categorical 
feature) " +
+"or error (throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  require(schema(inputColName).dataType.isInstanceOf[NumericType],
+s"Input column must be of type NumericType but got 
${schema(inputColName).dataType}")
+  require(!existingFields.exists(_.name == outputColName),
+s"Output column $outputColName already exists.")
+}
+
+// Prepares output columns with proper attributes by examining input 
columns.
+val inputFields = $(inputCols).map(schema(_))
+val keepInvalid = $(handleInvalid) == 
OneHotEncoderEstimator.KEEP_INVALID
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, $(dropLast), outputColName, keepInvalid)
+}
+StructType(schema.fields ++ outputFields)
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-12-25 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r158659174
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,479 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data presented as an extra categorical 
feature) or
+   * 'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data presented as an extra categorical 
feature) " +
+"or error (throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  require(schema(inputColName).dataType.isInstanceOf[NumericType],
+s"Input column must be of type NumericType but got 
${schema(inputColName).dataType}")
+  require(!existingFields.exists(_.name == outputColName),
+s"Output column $outputColName already exists.")
+}
+
+// Prepares output columns with proper attributes by examining input 
columns.
+val inputFields = $(inputCols).map(schema(_))
+val keepInvalid = $(handleInvalid) == 
OneHotEncoderEstimator.KEEP_INVALID
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, $(dropLast), outputColName, keepInvalid)
+}
+StructType(schema.fields ++ outputFields)
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0

[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20076
  
**[Test build #85388 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85388/testReport)**
 for PR 20076 at commit 
[`2ab2d29`](https://github.com/apache/spark/commit/2ab2d293a0548b66070e840372e589eb2949a0ff).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20076
  
Retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20067: [SPARK-22894][SQL] DateTimeOperations should accept SQL ...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20067
  
**[Test build #85387 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85387/testReport)**
 for PR 20067 at commit 
[`ae998ec`](https://github.com/apache/spark/commit/ae998ec2b5548b7028d741da4813473dde1ad81e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...

2017-12-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20079
  
Thanks you, @gatorsmile and @wangyum


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20067: [SPARK-22894][SQL] DateTimeOperations should accept SQL ...

2017-12-25 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20067
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20067: [SPARK-22894][SQL] DateTimeOperations should acce...

2017-12-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20067#discussion_r158657480
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2760,6 +2760,17 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  test("SPARK-22894: DateTimeOperations should accept SQL like string 
type") {
+val date = "2017-12-24"
+val str = sql(s"SELECT CAST('$date' as STRING) + interval 2 months 2 
seconds")
--- End diff --

I saw the original PR 
https://github.com/apache/spark/pull/7754/files#r35821191

Maybe the SQL API should support it since we do support it in DataFrame 
APIs. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-25 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20023
  
The problem found by this PR is just one of the issues that returns NULL 
when SPARK SQL is unable to process it. Below is another example. I believe we 
can find more.

```SQL
SELECT CAST('a' AS TIMESTAMP)
```

Before we deciding how to fix these issues (one by one or as a whole), we 
need to do more investigation and identify all of them. We also need to clearly 
document our current behaviors and then our users can know what is the result 
they can expect. 

Yeah! Please go ahead to create a new PR for adding more tests. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of...

2017-12-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20079


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...

2017-12-25 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20079
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...

2017-12-25 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20079
  
LGTM, thanks @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20059
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85385/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20059
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20059
  
**[Test build #85385 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85385/testReport)**
 for PR 20059 at commit 
[`818abaf`](https://github.com/apache/spark/commit/818abaf46d8cb4d92f9940e2b59ad6cf27e5da44).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19954
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85384/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19954
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19954
  
**[Test build #85384 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85384/testReport)**
 for PR 19954 at commit 
[`c51bc56`](https://github.com/apache/spark/commit/c51bc560bb2ae0d5ea8d914e84d7485d333f497e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20079
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20079
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85386/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20079
  
**[Test build #85386 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85386/testReport)**
 for PR 20079 at commit 
[`cb2868d`](https://github.com/apache/spark/commit/cb2868d9de9c2d1a89bcb410a314bbc29f1003f1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20072
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20072
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85383/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20072
  
**[Test build #85383 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85383/testReport)**
 for PR 20072 at commit 
[`ec275a8`](https://github.com/apache/spark/commit/ec275a841a7bb4c23b277f915debeed54e6cf7ea).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-25 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19929
  
@gatorsmile, yes, the reason why seed doesn't work is in the way Python 
UDFs are executed, i.e. a new python process is created for each partition to 
evaluate the Python UDF. Thus the seed is set only on the driver, but not in 
the process where the UDF is executed. What I am saying can be easily confirmed 
by this:
```
>>> from pyspark.sql.functions import udf
>>> import os
>>> pid_udf = udf(lambda: str(os.getpid()))
>>> spark.range(2).select(pid_udf()).show()
+--+

|()|
+--+
|  4132|
|  4130|
+--+
>>> os.getpid()
4070
```
Therefore there is no easy way to set the seed. If I set it inside the UDF, 
the UDF would become deterministic. Therefore I think that the best option is 
the current test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...

2017-12-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20079
  
Thank you, @gatorsmile !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20079
  
**[Test build #85386 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85386/testReport)**
 for PR 20079 at commit 
[`cb2868d`](https://github.com/apache/spark/commit/cb2868d9de9c2d1a89bcb410a314bbc29f1003f1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...

2017-12-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20079
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20064: [SPARK-22893][SQL] Unified the data type mismatch messag...

2017-12-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20064
  
Hi, @gatorsmile and @wangyum .
This PR seems to break Jenkins tests. Please see my hotfix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of...

2017-12-25 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/20079

[SPARK-22893][SQL][HOTFIX] Fix a error message of VersionsSuite

## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/20064 breaks Jenkins tests because it 
missed to update one error message for Hive 0.12 and Hive 0.13. This PR fixes 
that.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/3924/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/3977/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4226/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4260/

## How was this patch tested?

Pass the Jenkins without failure.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-22893

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20079.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20079


commit cb2868d9de9c2d1a89bcb410a314bbc29f1003f1
Author: Dongjoon Hyun 
Date:   2017-12-25T19:08:14Z

[SPARK-22893][SQL][HOTFIX] Fix a error message of VersionsSuite




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-25 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
Thanks @gatorsmile. Then should I create a follow up PR for #20008 in order 
to cover the cases 2 and 3 before going on with this PR or can we go on with 
this PR and the test cases added in this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20059
  
**[Test build #85385 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85385/testReport)**
 for PR 20059 at commit 
[`818abaf`](https://github.com/apache/spark/commit/818abaf46d8cb4d92f9940e2b59ad6cf27e5da44).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19954
  
**[Test build #85384 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85384/testReport)**
 for PR 19954 at commit 
[`c51bc56`](https://github.com/apache/spark/commit/c51bc560bb2ae0d5ea8d914e84d7485d333f497e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...

2017-12-25 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19954#discussion_r158651951
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala
 ---
@@ -45,6 +45,59 @@ private[spark] class KubernetesClusterManager extends 
ExternalClusterManager wit
   masterURL: String,
   scheduler: TaskScheduler): SchedulerBackend = {
 val sparkConf = sc.getConf
+val initContainerConfigMap = 
sparkConf.get(INIT_CONTAINER_CONFIG_MAP_NAME)
+val initContainerConfigMapKey = 
sparkConf.get(INIT_CONTAINER_CONFIG_MAP_KEY_CONF)
+
+if (initContainerConfigMap.isEmpty) {
+  logWarning("The executor's init-container config map was not 
specified. Executors will " +
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...

2017-12-25 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19954#discussion_r158651945
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/rest/k8s/SparkPodInitContainer.scala
 ---
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.rest.k8s
+
+import java.io.File
+import java.util.concurrent.TimeUnit
+
+import scala.concurrent.{ExecutionContext, Future}
+
+import org.apache.spark.{SecurityManager => SparkSecurityManager, 
SparkConf}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.deploy.k8s.Config._
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.{ThreadUtils, Utils}
+
+/**
+ * Process that fetches files from a resource staging server and/or 
arbitrary remote locations.
+ *
+ * The init-container can handle fetching files from any of those sources, 
but not all of the
+ * sources need to be specified. This allows for composing multiple 
instances of this container
+ * with different configurations for different download sources, or using 
the same container to
+ * download everything at once.
+ */
+private[spark] class SparkPodInitContainer(
+sparkConf: SparkConf,
+fileFetcher: FileFetcher) extends Logging {
+
+  private val maxThreadPoolSize = 
sparkConf.get(INIT_CONTAINER_MAX_THREAD_POOL_SIZE)
+  private implicit val downloadExecutor = 
ExecutionContext.fromExecutorService(
+ThreadUtils.newDaemonCachedThreadPool("download-executor", 
maxThreadPoolSize))
+
+  private val jarsDownloadDir = new 
File(sparkConf.get(JARS_DOWNLOAD_LOCATION))
+  private val filesDownloadDir = new 
File(sparkConf.get(FILES_DOWNLOAD_LOCATION))
+
+  private val remoteJars = sparkConf.get(INIT_CONTAINER_REMOTE_JARS)
+  private val remoteFiles = sparkConf.get(INIT_CONTAINER_REMOTE_FILES)
+
+  private val downloadTimeoutMinutes = 
sparkConf.get(INIT_CONTAINER_MOUNT_TIMEOUT)
+
+  def run(): Unit = {
+logInfo(s"Downloading remote jars: $remoteJars")
+downloadFiles(
+  remoteJars,
+  jarsDownloadDir,
+  s"Remote jars download directory specified at $jarsDownloadDir does 
not exist " +
+"or is not a directory.")
+
+logInfo(s"Downloading remote files: $remoteFiles")
+downloadFiles(
+  remoteFiles,
+  filesDownloadDir,
+  s"Remote files download directory specified at $filesDownloadDir 
does not exist " +
+"or is not a directory.")
+
+downloadExecutor.shutdown()
+downloadExecutor.awaitTermination(downloadTimeoutMinutes, 
TimeUnit.MINUTES)
+  }
+
+  private def downloadFiles(
+  filesCommaSeparated: Option[String],
+  downloadDir: File,
+  errMessageOnDestinationNotADirectory: String): Unit = {
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...

2017-12-25 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19954#discussion_r158651953
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala
 ---
@@ -45,6 +45,59 @@ private[spark] class KubernetesClusterManager extends 
ExternalClusterManager wit
   masterURL: String,
   scheduler: TaskScheduler): SchedulerBackend = {
 val sparkConf = sc.getConf
+val initContainerConfigMap = 
sparkConf.get(INIT_CONTAINER_CONFIG_MAP_NAME)
+val initContainerConfigMapKey = 
sparkConf.get(INIT_CONTAINER_CONFIG_MAP_KEY_CONF)
+
+if (initContainerConfigMap.isEmpty) {
+  logWarning("The executor's init-container config map was not 
specified. Executors will " +
+"therefore not attempt to fetch remote or submitted dependencies.")
+}
+
+if (initContainerConfigMapKey.isEmpty) {
+  logWarning("The executor's init-container config map key was not 
specified. Executors will " +
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19683
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85382/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...

2017-12-25 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19954#discussion_r158651933
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
 ---
@@ -132,30 +131,84 @@ private[spark] object Config extends Logging {
 
   val JARS_DOWNLOAD_LOCATION =
 ConfigBuilder("spark.kubernetes.mountDependencies.jarsDownloadDir")
-  .doc("Location to download jars to in the driver and executors. When 
using" +
-" spark-submit, this directory must be empty and will be mounted 
as an empty directory" +
-" volume on the driver and executor pod.")
+  .doc("Location to download jars to in the driver and executors. When 
using " +
+"spark-submit, this directory must be empty and will be mounted as 
an empty directory " +
+"volume on the driver and executor pod.")
   .stringConf
   .createWithDefault("/var/spark-data/spark-jars")
 
   val FILES_DOWNLOAD_LOCATION =
 ConfigBuilder("spark.kubernetes.mountDependencies.filesDownloadDir")
-  .doc("Location to download files to in the driver and executors. 
When using" +
-" spark-submit, this directory must be empty and will be mounted 
as an empty directory" +
-" volume on the driver and executor pods.")
+  .doc("Location to download files to in the driver and executors. 
When using " +
+"spark-submit, this directory must be empty and will be mounted as 
an empty directory " +
+"volume on the driver and executor pods.")
   .stringConf
   .createWithDefault("/var/spark-data/spark-files")
 
+  val INIT_CONTAINER_IMAGE =
+ConfigBuilder("spark.kubernetes.initContainer.image")
+  .doc("Image for the driver and executor's init-container for 
downloading dependencies.")
+  .stringConf
+  .createOptional
+
+  val INIT_CONTAINER_MOUNT_TIMEOUT =
+ConfigBuilder("spark.kubernetes.mountDependencies.timeout")
+  .doc("Timeout before aborting the attempt to download and unpack 
dependencies from remote " +
+"locations into the driver and executor pods.")
+  .timeConf(TimeUnit.MINUTES)
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19683
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...

2017-12-25 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19954#discussion_r158651930
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/DriverConfigOrchestrator.scala
 ---
@@ -98,28 +109,62 @@ private[spark] class 
DriverConfigurationStepsOrchestrator(
   None
 }
 
-val sparkJars = submissionSparkConf.getOption("spark.jars")
+val sparkJars = sparkConf.getOption("spark.jars")
   .map(_.split(","))
   .getOrElse(Array.empty[String]) ++
   additionalMainAppJar.toSeq
-val sparkFiles = submissionSparkConf.getOption("spark.files")
+val sparkFiles = sparkConf.getOption("spark.files")
   .map(_.split(","))
   .getOrElse(Array.empty[String])
 
-val maybeDependencyResolutionStep = if (sparkJars.nonEmpty || 
sparkFiles.nonEmpty) {
-  Some(new DependencyResolutionStep(
+val dependencyResolutionStep = if (sparkJars.nonEmpty || 
sparkFiles.nonEmpty) {
+  Seq(new DependencyResolutionStep(
 sparkJars,
 sparkFiles,
 jarsDownloadPath,
 filesDownloadPath))
 } else {
-  None
+  Nil
+}
+
+val initContainerBootstrapStep = if 
(areAnyFilesNonContainerLocal(sparkJars ++ sparkFiles)) {
+  val orchestrator = new InitContainerConfigOrchestrator(
+sparkJars,
+sparkFiles,
+jarsDownloadPath,
+filesDownloadPath,
+imagePullPolicy,
+initContainerConfigMapName,
+INIT_CONTAINER_PROPERTIES_FILE_NAME,
+sparkConf)
+  val bootstrapStep = new DriverInitContainerBootstrapStep(
+orchestrator.getAllConfigurationSteps,
+initContainerConfigMapName,
+INIT_CONTAINER_PROPERTIES_FILE_NAME)
+
+  Seq(bootstrapStep)
+} else {
+  Nil
+}
+
+val mountSecretsStep = if (secretNamesToMountPaths.nonEmpty) {
+  Seq(new DriverMountSecretsStep(new 
MountSecretsBootstrap(secretNamesToMountPaths)))
+} else {
+  Nil
 }
 
 Seq(
   initialSubmissionStep,
-  driverAddressStep,
+  serviceBootstrapStep,
   kubernetesCredentialsStep) ++
-  maybeDependencyResolutionStep.toSeq
+  dependencyResolutionStep ++
+  initContainerBootstrapStep ++
+  mountSecretsStep
+  }
+
+  private def areAnyFilesNonContainerLocal(files: Seq[String]): Boolean = {
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...

2017-12-25 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19954#discussion_r158651915
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
 ---
@@ -132,30 +131,84 @@ private[spark] object Config extends Logging {
 
   val JARS_DOWNLOAD_LOCATION =
 ConfigBuilder("spark.kubernetes.mountDependencies.jarsDownloadDir")
-  .doc("Location to download jars to in the driver and executors. When 
using" +
-" spark-submit, this directory must be empty and will be mounted 
as an empty directory" +
-" volume on the driver and executor pod.")
+  .doc("Location to download jars to in the driver and executors. When 
using " +
+"spark-submit, this directory must be empty and will be mounted as 
an empty directory " +
+"volume on the driver and executor pod.")
   .stringConf
   .createWithDefault("/var/spark-data/spark-jars")
 
   val FILES_DOWNLOAD_LOCATION =
 ConfigBuilder("spark.kubernetes.mountDependencies.filesDownloadDir")
-  .doc("Location to download files to in the driver and executors. 
When using" +
-" spark-submit, this directory must be empty and will be mounted 
as an empty directory" +
-" volume on the driver and executor pods.")
+  .doc("Location to download files to in the driver and executors. 
When using " +
+"spark-submit, this directory must be empty and will be mounted as 
an empty directory " +
+"volume on the driver and executor pods.")
   .stringConf
   .createWithDefault("/var/spark-data/spark-files")
 
+  val INIT_CONTAINER_IMAGE =
+ConfigBuilder("spark.kubernetes.initContainer.image")
+  .doc("Image for the driver and executor's init-container for 
downloading dependencies.")
+  .stringConf
+  .createOptional
+
+  val INIT_CONTAINER_MOUNT_TIMEOUT =
+ConfigBuilder("spark.kubernetes.mountDependencies.timeout")
--- End diff --

Please see the response regarding 
`spark.kubernetes.mountDependencies.maxSimultaneousDownloads`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19683
  
**[Test build #85382 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85382/testReport)**
 for PR 19683 at commit 
[`6caa0d5`](https://github.com/apache/spark/commit/6caa0d5f75336d954808109eddd207c56262ad04).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...

2017-12-25 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19954#discussion_r158651909
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
 ---
@@ -132,30 +131,84 @@ private[spark] object Config extends Logging {
 
   val JARS_DOWNLOAD_LOCATION =
 ConfigBuilder("spark.kubernetes.mountDependencies.jarsDownloadDir")
-  .doc("Location to download jars to in the driver and executors. When 
using" +
-" spark-submit, this directory must be empty and will be mounted 
as an empty directory" +
-" volume on the driver and executor pod.")
+  .doc("Location to download jars to in the driver and executors. When 
using " +
+"spark-submit, this directory must be empty and will be mounted as 
an empty directory " +
+"volume on the driver and executor pod.")
   .stringConf
   .createWithDefault("/var/spark-data/spark-jars")
 
   val FILES_DOWNLOAD_LOCATION =
 ConfigBuilder("spark.kubernetes.mountDependencies.filesDownloadDir")
-  .doc("Location to download files to in the driver and executors. 
When using" +
-" spark-submit, this directory must be empty and will be mounted 
as an empty directory" +
-" volume on the driver and executor pods.")
+  .doc("Location to download files to in the driver and executors. 
When using " +
+"spark-submit, this directory must be empty and will be mounted as 
an empty directory " +
+"volume on the driver and executor pods.")
   .stringConf
   .createWithDefault("/var/spark-data/spark-files")
 
+  val INIT_CONTAINER_IMAGE =
+ConfigBuilder("spark.kubernetes.initContainer.image")
+  .doc("Image for the driver and executor's init-container for 
downloading dependencies.")
+  .stringConf
+  .createOptional
+
+  val INIT_CONTAINER_MOUNT_TIMEOUT =
+ConfigBuilder("spark.kubernetes.mountDependencies.timeout")
+  .doc("Timeout before aborting the attempt to download and unpack 
dependencies from remote " +
+"locations into the driver and executor pods.")
+  .timeConf(TimeUnit.MINUTES)
+  .createWithDefault(5)
+
+  val INIT_CONTAINER_MAX_THREAD_POOL_SIZE =
+
ConfigBuilder("spark.kubernetes.mountDependencies.maxSimultaneousDownloads")
--- End diff --

I think the current name is already pretty long. Adding `initContainer` 
makes it even longer without much added value. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...

2017-12-25 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19954#discussion_r158651386
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/rest/k8s/SparkPodInitContainer.scala
 ---
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.rest.k8s
+
+import java.io.File
+import java.util.concurrent.TimeUnit
+
+import scala.concurrent.{ExecutionContext, Future}
+
+import org.apache.spark.{SecurityManager => SparkSecurityManager, 
SparkConf}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.deploy.k8s.Config._
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.{ThreadUtils, Utils}
+
+/**
+ * Process that fetches files from a resource staging server and/or 
arbitrary remote locations.
+ *
+ * The init-container can handle fetching files from any of those sources, 
but not all of the
+ * sources need to be specified. This allows for composing multiple 
instances of this container
+ * with different configurations for different download sources, or using 
the same container to
+ * download everything at once.
+ */
+private[spark] class SparkPodInitContainer(
+sparkConf: SparkConf,
+fileFetcher: FileFetcher) extends Logging {
+
+  private val maxThreadPoolSize = 
sparkConf.get(INIT_CONTAINER_MAX_THREAD_POOL_SIZE)
+  private implicit val downloadExecutor = 
ExecutionContext.fromExecutorService(
+ThreadUtils.newDaemonCachedThreadPool("download-executor", 
maxThreadPoolSize))
+
+  private val jarsDownloadDir = new 
File(sparkConf.get(JARS_DOWNLOAD_LOCATION))
+  private val filesDownloadDir = new 
File(sparkConf.get(FILES_DOWNLOAD_LOCATION))
+
+  private val remoteJars = sparkConf.get(INIT_CONTAINER_REMOTE_JARS)
+  private val remoteFiles = sparkConf.get(INIT_CONTAINER_REMOTE_FILES)
+
+  private val downloadTimeoutMinutes = 
sparkConf.get(INIT_CONTAINER_MOUNT_TIMEOUT)
+
+  def run(): Unit = {
+logInfo(s"Downloading remote jars: $remoteJars")
+downloadFiles(
+  remoteJars,
+  jarsDownloadDir,
+  s"Remote jars download directory specified at $jarsDownloadDir does 
not exist " +
+"or is not a directory.")
+
+logInfo(s"Downloading remote files: $remoteFiles")
+downloadFiles(
+  remoteFiles,
+  filesDownloadDir,
+  s"Remote files download directory specified at $filesDownloadDir 
does not exist " +
+"or is not a directory.")
+
+downloadExecutor.shutdown()
+downloadExecutor.awaitTermination(downloadTimeoutMinutes, 
TimeUnit.MINUTES)
+  }
+
+  private def downloadFiles(
+  filesCommaSeparated: Option[String],
+  downloadDir: File,
+  errMessageOnDestinationNotADirectory: String): Unit = {
--- End diff --

nit: `errMessageOnDestinationNotADirectory` -> `errMessage`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...

2017-12-25 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19954#discussion_r158651523
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala
 ---
@@ -45,6 +45,59 @@ private[spark] class KubernetesClusterManager extends 
ExternalClusterManager wit
   masterURL: String,
   scheduler: TaskScheduler): SchedulerBackend = {
 val sparkConf = sc.getConf
+val initContainerConfigMap = 
sparkConf.get(INIT_CONTAINER_CONFIG_MAP_NAME)
+val initContainerConfigMapKey = 
sparkConf.get(INIT_CONTAINER_CONFIG_MAP_KEY_CONF)
+
+if (initContainerConfigMap.isEmpty) {
+  logWarning("The executor's init-container config map was not 
specified. Executors will " +
+"therefore not attempt to fetch remote or submitted dependencies.")
+}
+
+if (initContainerConfigMapKey.isEmpty) {
+  logWarning("The executor's init-container config map key was not 
specified. Executors will " +
--- End diff --

nit: `was not` -> `is not`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >