date:20171224

[GitHub] spark pull request #18029: [SPARK-20168] [DStream] Add changes to use kinesi...

2017-12-24 Thread brkyvz

Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/18029#discussion_r158627994
  
--- Diff: 
external/kinesis-asl/src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisInputDStreamBuilderSuite.java
 ---
@@ -45,18 +44,90 @@ public void testJavaKinesisDStreamBuilder() {
   .streamName(streamName)
   .endpointUrl(endpointUrl)
   .regionName(region)
-  .initialPositionInStream(initialPosition)
+  .initialPosition(initialPosition)
   .checkpointAppName(appName)
   .checkpointInterval(checkpointInterval)
   .storageLevel(storageLevel)
   .build();
 assert(kinesisDStream.streamName() == streamName);
 assert(kinesisDStream.endpointUrl() == endpointUrl);
 assert(kinesisDStream.regionName() == region);
-assert(kinesisDStream.initialPositionInStream() == initialPosition);
+assert(kinesisDStream.initialPosition().getPosition() == 
initialPosition.getPosition());
+assert(kinesisDStream.checkpointAppName() == appName);
+assert(kinesisDStream.checkpointInterval() == checkpointInterval);
+assert(kinesisDStream._storageLevel() == storageLevel);
+ssc.stop();
+  }
+
+  /**
+   * Test to ensure that the old API for InitialPositionInStream
+   * is supported in KinesisDStream.Builder.
+   * This test would be removed when we deprecate the KinesisUtils.
+   */
+  @Test
+  public void testJavaKinesisDStreamBuilderOldApi() {
+String streamName = "a-very-nice-stream-name";
+String endpointUrl = "https://kinesis.us-west-2.amazonaws.com";;
+String region = "us-west-2";
+String appName = "a-very-nice-kinesis-app";
+Duration checkpointInterval = Seconds.apply(30);
+StorageLevel storageLevel = StorageLevel.MEMORY_ONLY();
+
+KinesisInputDStream kinesisDStream = 
KinesisInputDStream.builder()
+.streamingContext(ssc)
+.streamName(streamName)
+.endpointUrl(endpointUrl)
+.regionName(region)
+.initialPositionInStream(InitialPositionInStream.LATEST)
+.checkpointAppName(appName)
+.checkpointInterval(checkpointInterval)
+.storageLevel(storageLevel)
+.build();
+assert(kinesisDStream.streamName() == streamName);
+assert(kinesisDStream.endpointUrl() == endpointUrl);
+assert(kinesisDStream.regionName() == region);
+assert(kinesisDStream.initialPosition().getPosition() == 
InitialPositionInStream.LATEST);
 assert(kinesisDStream.checkpointAppName() == appName);
 assert(kinesisDStream.checkpointInterval() == checkpointInterval);
 assert(kinesisDStream._storageLevel() == storageLevel);
 ssc.stop();
   }
+
+  /**
+   * Test to ensure that the old API for InitialPositionInStream
+   * is supported in KinesisDStream.Builder.
+   * Test old API doesn't support the InitialPositionInStream.AT_TIMESTAMP.
+   * This test would be removed when we deprecate the KinesisUtils.
+   */
+  @Test
+  public void testJavaKinesisDStreamBuilderOldApiAtTimestamp() {
--- End diff --

This test could be moved to become a Scala test instead, using 
```scala
intercept[UnsupportedOperationException] {
  ...
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18029: [SPARK-20168] [DStream] Add changes to use kinesi...

2017-12-24 Thread brkyvz

Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/18029#discussion_r158627640
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala
 ---
@@ -56,12 +57,13 @@ import org.apache.spark.util.Utils
  * @param endpointUrl  Url of Kinesis service (e.g., 
https://kinesis.us-east-1.amazonaws.com)
  * @param regionName  Region name used by the Kinesis Client Library for
  *DynamoDB (lease coordination and checkpointing) and 
CloudWatch (metrics)
- * @param initialPositionInStream  In the absence of Kinesis checkpoint 
info, this is the
+ * @param initialPosition  Instance of [[KinesisInitialPosition]]
+ * In the absence of Kinesis checkpoint 
info, this is the
--- End diff --

nit: indentation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18029: [SPARK-20168] [DStream] Add changes to use kinesi...

2017-12-24 Thread brkyvz

Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/18029#discussion_r158627941
  
--- Diff: 
external/kinesis-asl/src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisInputDStreamBuilderSuite.java
 ---
@@ -45,18 +44,90 @@ public void testJavaKinesisDStreamBuilder() {
   .streamName(streamName)
   .endpointUrl(endpointUrl)
   .regionName(region)
-  .initialPositionInStream(initialPosition)
+  .initialPosition(initialPosition)
   .checkpointAppName(appName)
   .checkpointInterval(checkpointInterval)
   .storageLevel(storageLevel)
   .build();
 assert(kinesisDStream.streamName() == streamName);
 assert(kinesisDStream.endpointUrl() == endpointUrl);
 assert(kinesisDStream.regionName() == region);
-assert(kinesisDStream.initialPositionInStream() == initialPosition);
+assert(kinesisDStream.initialPosition().getPosition() == 
initialPosition.getPosition());
+assert(kinesisDStream.checkpointAppName() == appName);
+assert(kinesisDStream.checkpointInterval() == checkpointInterval);
+assert(kinesisDStream._storageLevel() == storageLevel);
+ssc.stop();
+  }
+
+  /**
+   * Test to ensure that the old API for InitialPositionInStream
+   * is supported in KinesisDStream.Builder.
+   * This test would be removed when we deprecate the KinesisUtils.
+   */
+  @Test
+  public void testJavaKinesisDStreamBuilderOldApi() {
+String streamName = "a-very-nice-stream-name";
+String endpointUrl = "https://kinesis.us-west-2.amazonaws.com";;
+String region = "us-west-2";
+String appName = "a-very-nice-kinesis-app";
+Duration checkpointInterval = Seconds.apply(30);
+StorageLevel storageLevel = StorageLevel.MEMORY_ONLY();
+
+KinesisInputDStream kinesisDStream = 
KinesisInputDStream.builder()
+.streamingContext(ssc)
--- End diff --

nit: indentation should be 2 spaces.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18029: [SPARK-20168] [DStream] Add changes to use kinesi...

2017-12-24 Thread brkyvz

Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/18029#discussion_r158627744
  
--- Diff: 
external/kinesis-asl/src/main/java/org/apache/spark/streaming/kinesis/KinesisInitialPositions.java
 ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.kinesis;
+
+import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream;
+
+import java.io.Serializable;
+import java.util.Date;
+
+/**
+ * A java wrapper for exposing [[InitialPositionInStream]]
+ * to the corresponding Kinesis readers.
+ */
+interface KinesisInitialPosition {
+InitialPositionInStream getPosition();
+}
+
+public class KinesisInitialPositions {
+public static class Latest implements KinesisInitialPosition, 
Serializable {
+public Latest() {}
+
+@Override
+public InitialPositionInStream getPosition() {
+return InitialPositionInStream.LATEST;
+}
+}
+
+public static class TrimHorizon implements KinesisInitialPosition, 
Serializable {
+public TrimHorizon() {}
+
+@Override
+public InitialPositionInStream getPosition() {
+return InitialPositionInStream.TRIM_HORIZON;
+}
+}
+
+public static class AtTimestamp implements KinesisInitialPosition, 
Serializable {
+private Date timestamp;
+
+public AtTimestamp(Date timestamp) {
+this.timestamp = timestamp;
+}
+
+@Override
+public InitialPositionInStream getPosition() {
+return InitialPositionInStream.AT_TIMESTAMP;
+}
+
+public Date getTimestamp() {
+return timestamp;
+}
+}
+
+
+/**
+ * Returns instance of [[KinesisInitialPosition]] based on the passed 
[[InitialPositionInStream]].
+ * This method is used in KinesisUtils for translating the 
InitialPositionInStream
+ * to InitialPosition. This function would be removed when we 
deprecate the KinesisUtils.
+ *
+ * @return [[InitialPosition]]
+ */
+public static KinesisInitialPosition fromKinesisInitialPosition(
+InitialPositionInStream initialPositionInStream) throws 
UnsupportedOperationException {
+if (initialPositionInStream == InitialPositionInStream.LATEST) {
+return new Latest();
+} else if (initialPositionInStream == 
InitialPositionInStream.TRIM_HORIZON) {
+return new TrimHorizon();
+} else {
+// InitialPositionInStream.AT_TIMESTAMP is not supported.
+// Use InitialPosition.atTimestamp(timestamp) instead.
+throw new UnsupportedOperationException(
+"Only InitialPositionInStream.LATEST and 
InitialPositionInStream.TRIM_HORIZON " +
+"supported in initialPositionInStream(). 
Please use the initialPosition() from " +
+"builder API in KinesisInputDStream for using 
InitialPositionInStream.AT_TIMESTAMP");
+}
+}
+}
--- End diff --

nit: new line


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-24 Thread fjh100456

Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/20076
  
cc @gatorsmile 
No orc configuration found in "sql-programming-guide.md", so I did not add 
the precedence description to `spark.sql.orc.compression.codec `.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20076
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r158627363
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
 ---
@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions(
* Acceptable values are defined in 
[[shortParquetCompressionCodecNames]].
*/
   val compressionCodecClassName: String = {
--- End diff --

Can we change `compressionCodecClassName` to `compressionCodec` instead?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2017-12-24 Thread fjh100456

GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/20076

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the precedence order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.

2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

3.Change `compressionCode` to `compressionCodecClassName`.

## How was this patch tested?
Add test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark ParquetOptionIssue

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20076.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20076


commit 9bbfe6ef4b5a418373c2250ad676233fb05df7f7
Author: fjh100456 
Date:   2017-12-25T02:29:53Z

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

## How was this patch tested?
Manual test.

commit 48cf108ed5c3298eb860d9735b439ac89d65765e
Author: fjh100456 
Date:   2017-12-25T02:30:24Z

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

## How was this patch tested?
Manual test.

commit 5dbd3edf9e086433d3d3fe9c0ead887d799c61d3
Author: fjh100456 
Date:   2017-12-25T02:34:29Z

spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring 
'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to 
be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

## How was this patch tested?
Manual test.

commit 5124f1b560e942c0dc23af31336317a4b995dd8f
Author: fjh100456 
Date:   2017-12-25T07:06:26Z

spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring 
'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to 
be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".
3.Change `compressionCode` to `compressionCodecClassName`.

## How was this patch tested?
Manual test.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19498
  
Will check and see if I could make a min fix soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20075: [SPARK-21208][R] Adds setLocalProperty in R

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20075
  
**[Test build #85369 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85369/testReport)**
 for PR 20075 at commit 
[`c9c7aee`](https://github.com/apache/spark/commit/c9c7aeed2894dc3e06c75b1feeb3d952cb3f256e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20075: [SPARK-21208][R] Adds setLocalProperty in R

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20075
  
cc @felixcheung, could you take a look please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20075: [SPARK-21208][R] Adds setLocalProperty in R

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20075#discussion_r158626493
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -564,6 +564,23 @@ setJobDescription <- function(value) {
   invisible(callJMethod(sc, "setJobDescription", value))
 }
 
+#' Set a local property that affects jobs submitted from this thread, such 
as the
+#' Spark fair scheduler pool.
--- End diff --

I tried to make a better description for a while but then just ended up 
with copying Python's: 
https://github.com/apache/spark/blob/209b9361ac8a4410ff797cff1115e1888e2f7e66/python/pyspark/context.py#L934-L939


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20075: [SPARK-21208][R] Adds setLocalProperty in R

2017-12-24 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/20075

[SPARK-21208][R] Adds setLocalProperty in R

## What changes were proposed in this pull request?

This PR adds `setLocalProperty` in R.

```R
> df <- createDataFrame(iris)
> setLocalProperty("spark.job.description", "Hello world!")
> count(df)
> setLocalProperty("spark.job.description", "Hi !!")
> count(df)
```

https://user-images.githubusercontent.com/6477701/34335213-60655a7c-e990-11e7-88aa-12debe311627.png";>


## How was this patch tested?

Manually tested and a test in `R/pkg/tests/fulltests/test_context.R`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-21208

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20075.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20075


commit c9c7aeed2894dc3e06c75b1feeb3d952cb3f256e
Author: hyukjinkwon 
Date:   2017-12-25T07:19:57Z

Adds setLocalProperty in R




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...

2017-12-24 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20043
  
`StatementValue` means an output like `a + 1`. It is a java statement which 
doesn't rely on a local variable to hold the result.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19904: [SPARK-22707][ML] Optimize CrossValidator memory ...

2017-12-24 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19904


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20074: [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error mess...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20074
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-24 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19904
  
LGTM
Sorry for the delay & thanks for the PR!
Merging with master



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20074: [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error mess...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20074
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85367/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20074: [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error mess...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20074
  
**[Test build #85367 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85367/testReport)**
 for PR 20074 at commit 
[`ec1c92f`](https://github.com/apache/spark/commit/ec1c92f3d5e2cbcf77405768a35bd781cc1f93c0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

2017-12-24 Thread aa8y

Github user aa8y commented on the issue:

https://github.com/apache/spark/pull/20068
  
@HyukjinKwon I made code changes based on your suggestions. I also changed 
the tests to use the data mentioned in the ticket. However, you're right, the 
tests no longer pass. But that is because the Univocity `CsvParser`, when it 
encounters an empty string while parsing the data, replaces it with the 
`nullValue` we set (see 
[setNullValue()](http://docs.univocity.com/parsers/2.5.9/com/univocity/parsers/common/CommonSettings.html#setNullValue(java.lang.String))).
 And the `emptyValue` is only effective when the _empty string_ being read has 
quotes around it (see 
[setEmptyValue()](http://docs.univocity.com/parsers/2.5.9/com/univocity/parsers/csv/CsvParserSettings.html#setEmptyValue(java.lang.String))).
 So I believe, at this point, the issue needs to be fixed in the underlying 
library being used.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20068
  
**[Test build #85368 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85368/testReport)**
 for PR 20068 at commit 
[`156d755`](https://github.com/apache/spark/commit/156d755d5a734a00c4c69dfc3565364f3843fca1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20074: [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error mess...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20074
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85366/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20074: [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error mess...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20074
  
**[Test build #85366 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85366/testReport)**
 for PR 20074 at commit 
[`07873d7`](https://github.com/apache/spark/commit/07873d704ce2a14add3e476429a785c793e47aa8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20074: [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error mess...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20074
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20074: [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error mess...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20074
  
**[Test build #85367 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85367/testReport)**
 for PR 20074 at commit 
[`ec1c92f`](https://github.com/apache/spark/commit/ec1c92f3d5e2cbcf77405768a35bd781cc1f93c0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18754: [SPARK-21552][SQL] Add DecimalType support to ArrowWrite...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18754
  
Just took a quick look and looks fine to me too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20072
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85363/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20072
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20074: [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify err...

2017-12-24 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20074#discussion_r158622968
  
--- Diff: python/pyspark/sql/utils.py ---
@@ -118,7 +118,8 @@ def require_minimum_pandas_version():
 from distutils.version import LooseVersion
 import pandas
 if LooseVersion(pandas.__version__) < LooseVersion('0.19.2'):
-raise ImportError("Pandas >= 0.19.2 must be installed on calling 
Python process")
+raise ImportError("Pandas >= 0.19.2 must be installed on calling 
Python process: %s"
--- End diff --

Sounds good. I'll update them. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20072
  
**[Test build #85363 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85363/testReport)**
 for PR 20072 at commit 
[`e6065c7`](https://github.com/apache/spark/commit/e6065c75015b8a2c0eff9f3c6e1ebfe148b28e65).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20074: [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify err...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20074#discussion_r158622598
  
--- Diff: python/pyspark/sql/utils.py ---
@@ -118,7 +118,8 @@ def require_minimum_pandas_version():
 from distutils.version import LooseVersion
 import pandas
 if LooseVersion(pandas.__version__) < LooseVersion('0.19.2'):
-raise ImportError("Pandas >= 0.19.2 must be installed on calling 
Python process")
+raise ImportError("Pandas >= 0.19.2 must be installed on calling 
Python process: %s"
--- End diff --

Maybe, we could make it a bit better like .. `.. on calling Python process; 
however, your version was %s.`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20074: [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error mess...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20074
  
**[Test build #85366 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85366/testReport)**
 for PR 20074 at commit 
[`07873d7`](https://github.com/apache/spark/commit/07873d704ce2a14add3e476429a785c793e47aa8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20074: [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify err...

2017-12-24 Thread ueshin

GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/20074

[SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error messages to show actual 
versions.

## What changes were proposed in this pull request?

This is a follow-up pr of #20054 modifying error messages for both pandas 
and pyarrow to show actual versions.

## How was this patch tested?

Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-22874_fup1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20074.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20074


commit 07873d704ce2a14add3e476429a785c793e47aa8
Author: Takuya UESHIN 
Date:   2017-12-25T06:02:38Z

Modify error messages to show actual versions.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...

2017-12-24 Thread gczsjdy

Github user gczsjdy commented on the issue:

https://github.com/apache/spark/pull/20043
  
@viirya Thanks much. Actually local variable corresponds to `VariableValue` 
and `StatementValue`? IIUC `VariableValue` is value that depends on something 
else, but what is `StatementValue`? Maybe we can add more comments near the 
`xxValue` definition.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20073: [SPARK-22843][R] Adds localCheckpoint in R

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20073
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85364/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20073: [SPARK-22843][R] Adds localCheckpoint in R

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20073
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20073: [SPARK-22843][R] Adds localCheckpoint in R

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20073
  
**[Test build #85364 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85364/testReport)**
 for PR 20073 at commit 
[`ec27a80`](https://github.com/apache/spark/commit/ec27a80be663ab29ae563a3a9ca8f2a0d32436d5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18754: [SPARK-21552][SQL] Add DecimalType support to ArrowWrite...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18754
  
**[Test build #85365 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85365/testReport)**
 for PR 18754 at commit 
[`025f298`](https://github.com/apache/spark/commit/025f2987c54fb3c9c7de36c525a8caaa72d1f3ee).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #4024 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4024/testReport)**
 for PR 19904 at commit 
[`cad2104`](https://github.com/apache/spark/commit/cad210439b7a0bc3eb870f1d68fd96fbd0763aa8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20073: [SPARK-22843][R] Adds localCheckpoint in R

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20073
  
**[Test build #85364 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85364/testReport)**
 for PR 20073 at commit 
[`ec27a80`](https://github.com/apache/spark/commit/ec27a80be663ab29ae563a3a9ca8f2a0d32436d5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20073: [SPARK-22843][R] Adds localCheckpoint in R

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20073
  
cc @felixcheung, could you check if I understood your intention correctly?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20073: [SPARK-22843][R] Adds localCheckpoint in R

2017-12-24 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/20073

[SPARK-22843][R] Adds localCheckpoint in R

## What changes were proposed in this pull request?

This PR proposes to add `localCheckpoint(..)` in R API.

```r
df <- localCheckpoint(createDataFrame(iris))
```

## How was this patch tested?

Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-22843

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20073.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20073


commit ec27a80be663ab29ae563a3a9ca8f2a0d32436d5
Author: hyukjinkwon 
Date:   2017-12-25T04:58:15Z

Adds localCheckpoint in R




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18754: [SPARK-21552][SQL] Add DecimalType support to Arr...

2017-12-24 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18754#discussion_r158620106
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala 
---
@@ -214,6 +216,22 @@ private[arrow] class DoubleWriter(val valueVector: 
Float8Vector) extends ArrowFi
   }
 }
 
+private[arrow] class DecimalWriter(
+val valueVector: DecimalVector,
+precision: Int,
+scale: Int) extends ArrowFieldWriter {
+
+  override def setNull(): Unit = {
+valueVector.setNull(count)
+  }
+
+  override def setValue(input: SpecializedGetters, ordinal: Int): Unit = {
+val decimal = input.getDecimal(ordinal, precision, scale)
+decimal.changePrecision(precision, scale)
--- End diff --

Unfortunately, it depends on the implementation of `getDecimal` for now.
Btw, I guess we need to check the return value of `changePrecision()` and 
set `null` if the value is `false`, which means overflow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-12-24 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r158619213
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,479 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data presented as an extra categorical 
feature) or
+   * 'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data presented as an extra categorical 
feature) " +
+"or error (throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  require(schema(inputColName).dataType.isInstanceOf[NumericType],
+s"Input column must be of type NumericType but got 
${schema(inputColName).dataType}")
+  require(!existingFields.exists(_.name == outputColName),
+s"Output column $outputColName already exists.")
+}
+
+// Prepares output columns with proper attributes by examining input 
columns.
+val inputFields = $(inputCols).map(schema(_))
+val keepInvalid = $(handleInvalid) == 
OneHotEncoderEstimator.KEEP_INVALID
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, $(dropLast), outputColName, keepInvalid)
+}
+StructType(schema.fields ++ outputFields)
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0

[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-12-24 Thread MLnick

Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/19527
  
Agree on keeping the new OneHotEncoderEstimator as an alias for 3.0

On Fri, 1 Dec 2017 at 23:29, jkbradley  wrote:

> *@jkbradley* commented on this pull request.
> --
>
> In mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala
> :
>
> > @@ -41,8 +41,12 @@ import org.apache.spark.sql.types.{DoubleType, 
NumericType, StructType}
>   * The output vectors are sparse.
>   *
>   * @see `StringIndexer` for converting categorical values into category 
indices
> + * @deprecated `OneHotEncoderEstimator` will be renamed `OneHotEncoder` 
and this `OneHotEncoder`
>
> Note for the future: For 3.0, it'd be nice to do what you're describing
> here but also leave OneHotEncoderEstimator as a deprecated alias. That 
way,
> user code won't break but will have deprecation warnings when upgrading to
> 3.0.
>
> â
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> ,
> or mute the thread
> 

> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #4024 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4024/testReport)**
 for PR 19904 at commit 
[`cad2104`](https://github.com/apache/spark/commit/cad210439b7a0bc3eb870f1d68fd96fbd0763aa8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-12-24 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r158618991
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,479 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data presented as an extra categorical 
feature) or
+   * 'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data presented as an extra categorical 
feature) " +
+"or error (throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  require(schema(inputColName).dataType.isInstanceOf[NumericType],
+s"Input column must be of type NumericType but got 
${schema(inputColName).dataType}")
+  require(!existingFields.exists(_.name == outputColName),
+s"Output column $outputColName already exists.")
+}
+
+// Prepares output columns with proper attributes by examining input 
columns.
+val inputFields = $(inputCols).map(schema(_))
+val keepInvalid = $(handleInvalid) == 
OneHotEncoderEstimator.KEEP_INVALID
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, $(dropLast), outputColName, keepInvalid)
+}
+StructType(schema.fields ++ outputFields)
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0,

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-24 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20023
  
Following ANSI SQL compliance sounds good to me. However, many details are 
vendor-specific. That means, the query results still varies even if we can be 
100% ANSI SQL compliant. 

To avoid frequently introducing behavior breaking changes, we can also 
introduce a new mode `strict` for `spark.sql.typeCoercion.mode`. (Hive is also 
not 100% ANSI SQL compliant) Instead of inventing a completely new one, we can 
try to follow one of the mainstream open-source databases. For example, 
Postgres. 

Before introducing the new mode, we first need to understand the difference 
between Spark SQL and the other. That is the reason why we need to write the 
test cases first. Then, we can run them against different systems. This PR 
clearly shows the current test cases do not cover the scenarios of 2 and 3. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for addJar ...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19643
  
Let me leave this closed now and will reopen when I am ready to proceed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/19643


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616812
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
+
+val elems = Seq(("bar"), (""), (null: String))
+
+// Checks for new behavior where an empty string is not coerced to 
null.
+withTempDir { dir =>
+  val outDir = new File(dir, "out").getCanonicalPath
+  val nullValue = "\\N"
+
+  import sparkSession.implicits._
--- End diff --

I think we don't need this (by `import testImplicits._` above).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616834
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
+
+val elems = Seq(("bar"), (""), (null: String))
+
+// Checks for new behavior where an empty string is not coerced to 
null.
+withTempDir { dir =>
+  val outDir = new File(dir, "out").getCanonicalPath
+  val nullValue = "\\N"
+
+  import sparkSession.implicits._
+  val dsIn = spark.createDataset(elems)
--- End diff --

`Seq(("bar"), (""), (null: String)).toDS`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616740
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -152,7 +152,7 @@ class CSVOptions(
 
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
 
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
 writerSettings.setNullValue(nullValue)
-writerSettings.setEmptyValue(nullValue)
+writerSettings.setEmptyValue("")
--- End diff --

Could we leave some comments here and update the PR description too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616591
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
--- End diff --

I think we can just use `spark`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158617095
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
+
+val elems = Seq(("bar"), (""), (null: String))
+
+// Checks for new behavior where an empty string is not coerced to 
null.
+withTempDir { dir =>
+  val outDir = new File(dir, "out").getCanonicalPath
+  val nullValue = "\\N"
+
+  import sparkSession.implicits._
+  val dsIn = spark.createDataset(elems)
+  dsIn.write
+.option("nullValue", nullValue)
+.csv(outDir)
+  val dsOut = spark.read
+.option("nullValue", nullValue)
+.schema(dsIn.schema)
+.csv(outDir)
+.as[(String)]
+  val computed = dsOut.collect.toSeq
+  val expected = Seq(("bar"), (null: String))
--- End diff --

I don't think this is quite the expected output? Could we use the examples 
provided in the JIRA rather than single row ones?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616912
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
+
+val elems = Seq(("bar"), (""), (null: String))
+
+// Checks for new behavior where an empty string is not coerced to 
null.
+withTempDir { dir =>
+  val outDir = new File(dir, "out").getCanonicalPath
+  val nullValue = "\\N"
+
+  import sparkSession.implicits._
+  val dsIn = spark.createDataset(elems)
+  dsIn.write
+.option("nullValue", nullValue)
+.csv(outDir)
+  val dsOut = spark.read
+.option("nullValue", nullValue)
+.schema(dsIn.schema)
+.csv(outDir)
+.as[(String)]
+  val computed = dsOut.collect.toSeq
+  val expected = Seq(("bar"), (null: String))
+
+  assert(computed.size === 2)
+  assert(computed.sameElements(expected))
--- End diff --

We can use `checkAnswer(..: DataFrame, .. : DataFrame)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616872
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
+
+val elems = Seq(("bar"), (""), (null: String))
+
+// Checks for new behavior where an empty string is not coerced to 
null.
+withTempDir { dir =>
--- End diff --

We could do this `withTempPath { path =>`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20072
  
**[Test build #85363 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85363/testReport)**
 for PR 20072 at commit 
[`e6065c7`](https://github.com/apache/spark/commit/e6065c75015b8a2c0eff9f3c6e1ebfe148b28e65).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

2017-12-24 Thread CodingCat

GitHub user CodingCat opened a pull request:

https://github.com/apache/spark/pull/20072

[SPARK-22790][SQL] add a configurable factor to describe HadoopFsRelation's 
size

## What changes were proposed in this pull request?

as per discussion in 
https://github.com/apache/spark/pull/19864#discussion_r156847927

the current HadoopFsRelation is purely based on the underlying file size 
which is not accurate and makes the execution vulnerable to errors like OOM

Users can enable CBO with the functionalities in 
https://github.com/apache/spark/pull/19864 to avoid this issue

This JIRA proposes to add a configurable factor to sizeInBytes method in 
HadoopFsRelation class so that users can mitigate this problem without CBO

## How was this patch tested?

Existing tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/CodingCat/spark SPARK-22790

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20072


commit b02d857f20f594d87c8c48991bfbbe95a71b364a
Author: CodingCat 
Date:   2016-03-07T14:37:37Z

improve the doc for "spark.memory.offHeap.size"

commit e09d60f3dd212dc0ce9687b112970c7cf1e4c83b
Author: CodingCat 
Date:   2016-03-07T14:37:37Z

improve the doc for "spark.memory.offHeap.size"

commit 2ebc6caab7c540b43a50b7e0f27b8f4c278e5611
Author: CodingCat 
Date:   2016-03-07T19:00:16Z

fix

commit 9b87ba8830d102c2568e338787e8b49b284dd8b1
Author: CodingCat 
Date:   2016-03-07T19:00:16Z

fix

commit e6065c75015b8a2c0eff9f3c6e1ebfe148b28e65
Author: CodingCat 
Date:   2017-12-25T03:21:02Z

add a configurable factor to describe HadoopFsRelation's size




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-24 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19813
  
IMHO, in general, the output `ev.value` would be declared as local variable 
by parent as
```
s"""${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)};
```

Such as cases cannot have an expression in `ev.value`.  
As @viirya pointed out, I imagine there are a few scenarios. Would it be 
possible to show an example and place in source code where an expression is 
used as output in order to correctly understand the issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20067: [SPARK-22894][SQL] DateTimeOperations should accept SQL ...

2017-12-24 Thread gczsjdy

Github user gczsjdy commented on the issue:

https://github.com/apache/spark/pull/20067
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-12-24 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r158615298
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,479 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data presented as an extra categorical 
feature) or
+   * 'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data presented as an extra categorical 
feature) " +
+"or error (throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  require(schema(inputColName).dataType.isInstanceOf[NumericType],
+s"Input column must be of type NumericType but got 
${schema(inputColName).dataType}")
+  require(!existingFields.exists(_.name == outputColName),
+s"Output column $outputColName already exists.")
+}
+
+// Prepares output columns with proper attributes by examining input 
columns.
+val inputFields = $(inputCols).map(schema(_))
+val keepInvalid = $(handleInvalid) == 
OneHotEncoderEstimator.KEEP_INVALID
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, $(dropLast), outputColName, keepInvalid)
+}
+StructType(schema.fields ++ outputFields)
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-12-24 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r158615273
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,479 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data presented as an extra categorical 
feature) or
+   * 'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data presented as an extra categorical 
feature) " +
+"or error (throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  require(schema(inputColName).dataType.isInstanceOf[NumericType],
+s"Input column must be of type NumericType but got 
${schema(inputColName).dataType}")
+  require(!existingFields.exists(_.name == outputColName),
+s"Output column $outputColName already exists.")
+}
+
+// Prepares output columns with proper attributes by examining input 
columns.
+val inputFields = $(inputCols).map(schema(_))
+val keepInvalid = $(handleInvalid) == 
OneHotEncoderEstimator.KEEP_INVALID
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, $(dropLast), outputColName, keepInvalid)
+}
+StructType(schema.fields ++ outputFields)
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-24 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/19977
  
@gatorsmile @gatorsmile ping


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19675: [SPARK-14540][BUILD] Support Scala 2.12 closures and Jav...

2017-12-24 Thread jvican

Github user jvican commented on the issue:

https://github.com/apache/spark/pull/19675
  
Is this issue partially or finally fixed? I could try to get this to the 
finish line for a nice start of the year.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-12-24 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r158610281
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,479 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data presented as an extra categorical 
feature) or
+   * 'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data presented as an extra categorical 
feature) " +
+"or error (throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  require(schema(inputColName).dataType.isInstanceOf[NumericType],
+s"Input column must be of type NumericType but got 
${schema(inputColName).dataType}")
+  require(!existingFields.exists(_.name == outputColName),
+s"Output column $outputColName already exists.")
+}
+
+// Prepares output columns with proper attributes by examining input 
columns.
+val inputFields = $(inputCols).map(schema(_))
+val keepInvalid = $(handleInvalid) == 
OneHotEncoderEstimator.KEEP_INVALID
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, $(dropLast), outputColName, keepInvalid)
+}
+StructType(schema.fields ++ outputFields)
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0,

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19683
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19683
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85362/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19683
  
**[Test build #85362 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85362/testReport)**
 for PR 19683 at commit 
[`93816b6`](https://github.com/apache/spark/commit/93816b68db8472b5285a18fae48741b190af34ca).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19683
  
**[Test build #85362 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85362/testReport)**
 for PR 19683 at commit 
[`93816b6`](https://github.com/apache/spark/commit/93816b68db8472b5285a18fae48741b190af34ca).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19683
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19683
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85361/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19683
  
**[Test build #85361 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85361/testReport)**
 for PR 19683 at commit 
[`42aa32d`](https://github.com/apache/spark/commit/42aa32d0057aaef8e598d780d460f73c7634cdf2).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19683
  
**[Test build #85361 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85361/testReport)**
 for PR 19683 at commit 
[`42aa32d`](https://github.com/apache/spark/commit/42aa32d0057aaef8e598d780d460f73c7634cdf2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20002: [SPARK-22465][Core] Add a safety-check to RDD def...

2017-12-24 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20002


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20002: [SPARK-22465][Core] Add a safety-check to RDD defaultPar...

2017-12-24 Thread mridulm

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/20002
  
Merged, thanks for fixing this @sujithjay !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread aa8y

Github user aa8y commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158606107
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -152,7 +152,7 @@ class CSVOptions(
 
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
 
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
 writerSettings.setNullValue(nullValue)
-writerSettings.setEmptyValue(nullValue)
+writerSettings.setEmptyValue("")
--- End diff --

I disagree. I don't think the previous behavior should _not_ be exposed as 
an option as the previous behavior was a bug. All it did was that it _always_ 
coerced empty values to `null`s. If the `nullValue` was not set, then the it 
was set to `""` by default which coerced `""` to `null`. The empty value being 
set to `""` had no affect in this case. If it was set to something else, say 
`\N`, then the empty value was also set to `\N` which resulted in parsing both 
`\N` and `""` to `null`, as `""` was no longer considered as an empty value and 
the `""` being coerced to null is the Univocity parser's default.

Setting empty value explicitly to the `""` literal would ensure that an 
empty string is always parsed as empty string, unless `nullValue` is not set or 
it is set to `""`, which is what people would do if they want `""` to be parsed 
as `null`, which would be the old behavior.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20002
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20002
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85360/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20002
  
**[Test build #85360 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85360/testReport)**
 for PR 20002 at commit 
[`3b08951`](https://github.com/apache/spark/commit/3b089518e66bc4facf7bc07db1d12663dd567393).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19929
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85359/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19929
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19929
  
**[Test build #85359 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85359/testReport)**
 for PR 19929 at commit 
[`47801c7`](https://github.com/apache/spark/commit/47801c7dc532aa9a19d59cdef1fe021c61a0b2c8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20002
  
**[Test build #85360 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85360/testReport)**
 for PR 20002 at commit 
[`3b08951`](https://github.com/apache/spark/commit/3b089518e66bc4facf7bc07db1d12663dd567393).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-24 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20002
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-24 Thread sujithjay

Github user sujithjay commented on the issue:

https://github.com/apache/spark/pull/20002
  
The failed unit test (in HistoryServerSuite.scala) seems unrelated to this 
PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19929
  
**[Test build #85359 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85359/testReport)**
 for PR 19929 at commit 
[`47801c7`](https://github.com/apache/spark/commit/47801c7dc532aa9a19d59cdef1fe021c61a0b2c8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19929
  
**[Test build #85358 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85358/testReport)**
 for PR 19929 at commit 
[`a40ba73`](https://github.com/apache/spark/commit/a40ba7384db1030b6facb14b741349da09562d1f).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19929
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19929
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85358/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20071: SPARK-22896 Improvement in String interpolation | Graphx

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20071
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20071: SPARK-22896 Improvement in String interpolation |...

2017-12-24 Thread chetkhatri

GitHub user chetkhatri opened a pull request:

https://github.com/apache/spark/pull/20071

SPARK-22896 Improvement in String interpolation | Graphx

## What changes were proposed in this pull request?
* String interpolation in scala style corrected.
## How was this patch tested?
* Manually tested

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chetkhatri/spark graphx-contrib

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20071.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20071


commit 9916fd1f67234b1fa5608231181bdf3b08718981
Author: chetkhatri 
Date:   2017-12-24T08:33:49Z

SPARK-22896 Improvement in String interpolation

commit 162ac276cfb5aa3215e3e0bdf2723f3e7aacf7d5
Author: chetkhatri 
Date:   2017-12-24T08:37:25Z

SPARK-22896 Improvement in String interpolation - fixed typo

commit aa2de00b62f920c8691c81a085402533f76c036d
Author: chetkhatri 
Date:   2017-12-24T08:54:43Z

Merge branch 'master' of https://github.com/apache/spark into 
mllib-chetan-contrib

commit 8186a34178b108f71bea4f7b21080a2b527b445e
Author: chetkhatri 
Date:   2017-12-24T11:25:56Z

SPARK-22896 Improvement in String interpolation




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19929
  
**[Test build #85358 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85358/testReport)**
 for PR 19929 at commit 
[`a40ba73`](https://github.com/apache/spark/commit/a40ba7384db1030b6facb14b741349da09562d1f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20069: [SPARK-22895] [SQL] Push down the deterministic predicat...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20069
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85356/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20069: [SPARK-22895] [SQL] Push down the deterministic predicat...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20069
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20069: [SPARK-22895] [SQL] Push down the deterministic predicat...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20069
  
**[Test build #85356 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85356/testReport)**
 for PR 20069 at commit 
[`ad6607c`](https://github.com/apache/spark/commit/ad6607c642ffac811f0fa84d9256524676c9c75e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20002
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20002
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85357/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-24 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20002
  
**[Test build #85357 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85357/testReport)**
 for PR 20002 at commit 
[`3b08951`](https://github.com/apache/spark/commit/3b089518e66bc4facf7bc07db1d12663dd567393).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-24 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
Thanks for your analysis @gatorsmile. Actually the rule you specified for 
Oracle is what it uses when casting, rather then when doing arithmetic 
operations. Yes DB2 has rather different rules to define the output type of 
operations. Anyway, we can have a behavior practically identical to DB2 by 
changing the value of `MINIMUM_ADJUSTED_SCALE` to 31. Therefore, I'd propose, 
instead of using the configuration you pointed out, to use a configuration for 
the `MINIMUM_ADJUSTED_SCALE`, changing which we can have both the behavior of 
Hive and SQLServer and the one of DB2. What do you think?

The reason why I am suggesting this is that my first concern is not Hive 
compliance, but SQL standard compliance. Indeed, as you con see from the 
summary, on point 1 there is not a uniform behavior (but this is OK to SQL 
standard since it gives freedom). But on point 2 we are the only ones who are 
not compliant to SQL standard. And having this behavior by default doesn't look 
the right thing to do IMHO. On point 3, only we and Hive are not compliant. 
Thus I think also that should be changed. But in that case, we can't use the 
same flag, because it would be inconsistent. What do you think?

I can understand and agree that loosing precision looks scary. But to me 
returning `NULL` is even more scary if possible: indeed, `NULL` is what should 
be returned if either if the two operands are `NULL`. Thus queries running on 
other DBs which relies on this might return very bad result. For instance, 
let's think to a report where we join a prices table and a sold_product table 
per country. In this use case, we can assume that if the result is `NULL`, it 
means that there was no sold product in that country and then coalesce the 
output of the multiplication to 0. This would work well on any DB but Spark. 
With my proposal of tuning the `MINIMUM_ADJUSTED_SCALE`, each customer can 
decide (query by query) how much precision loss they can tolerate. And if we 
agree to change point 3 behavior to the SQL standard, in case of it is not 
possible to meet their desires we throw an exception, giving them the choice 
about what to do: allow more precision loss, change their input data type, et
 c. etc. This is the safer way IMHO.

I would ne happy to help improving test cases. May I just kindly ask you 
how you meant to do that? What would you like to be tested more? Would you like 
me to add more test cases in scope of this PR or to open a new one for that?

Thank you for your time reading my long messages. I just want to take the 
best choice and give you all the elements I have to decide for the best all 
together.
Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 132 matches

Mail list logo