[GitHub] spark issue #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff test Pyth...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20904
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff test Pyth...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20904
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89041/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff test Pyth...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20904
  
**[Test build #89041 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89041/testReport)**
 for PR 20904 at commit 
[`e999d60`](https://github.com/apache/spark/commit/e999d6079097eed4964b1d112e96f2e8bd115f10).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with rele...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20923
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89038/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with rele...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20923
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with rele...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20923
  
**[Test build #89038 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89038/testReport)**
 for PR 20923 at commit 
[`7c93d98`](https://github.com/apache/spark/commit/7c93d98aae8d74e0f0606cb03e68b0ac94bde177).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20925: [SPARK-22941][core] Do not exit JVM when submit f...

2018-04-08 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/20925#discussion_r179987224
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java 
---
@@ -154,9 +165,17 @@
 
   List buildSparkSubmitArgs() {
 List args = new ArrayList<>();
-SparkSubmitOptionParser parser = new SparkSubmitOptionParser();
+OptionParser parser = new OptionParser(false);
--- End diff --

whats the reason for allowing unknown args here?  Is it so an old launcher 
can start a newer spark, which may accept more args?  A comment would be helpful


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20925: [SPARK-22941][core] Do not exit JVM when submit f...

2018-04-08 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/20925#discussion_r179987409
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java 
---
@@ -154,9 +165,17 @@
 
   List buildSparkSubmitArgs() {
 List args = new ArrayList<>();
-SparkSubmitOptionParser parser = new SparkSubmitOptionParser();
+OptionParser parser = new OptionParser(false);
+boolean isStartingApp = isAppResourceReq;
+
+// If the user args array is not empty, we need to parse it to detect 
exactly what
+// the user is trying to run, so that checks below are correct.
+if (!userArgs.isEmpty()) {
+  parser.parse(userArgs);
+  isStartingApp = parser.isAppResourceReq;
--- End diff --

I don't really care whether the name is `isStartingApp` or 
`isAppResourceReq`, but seems the name should be same here and in OptionParser, 
unless there is some difference I'm missing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20925: [SPARK-22941][core] Do not exit JVM when submit f...

2018-04-08 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/20925#discussion_r179952361
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala ---
@@ -775,17 +781,17 @@ class SparkSubmitSuite
   }
 
   test("SPARK_CONF_DIR overrides spark-defaults.conf") {
-forConfDir(Map("spark.executor.memory" -> "2.3g")) { path =>
+forConfDir(Map("spark.executor.memory" -> "3g")) { path =>
--- End diff --

why this change?  you no longer support fractional values?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20981: [SPARK-23873][SQL] Use accessors in interpreted LambdaVa...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20981
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2083/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20981: [SPARK-23873][SQL] Use accessors in interpreted LambdaVa...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20981
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff test Pyth...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20904
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff test Pyth...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20904
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2082/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20981: [SPARK-23873][SQL] Use accessors in interpreted LambdaVa...

2018-04-08 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20981
  
@hvanhovell Addressed your comment. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20981: [SPARK-23873][SQL] Use accessors in interpreted LambdaVa...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20981
  
**[Test build #89042 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89042/testReport)**
 for PR 20981 at commit 
[`2eb2bf1`](https://github.com/apache/spark/commit/2eb2bf1853a0ba4de8f4a3adfe8407d04a075b22).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20981: [SPARK-23873][SQL] Use accessors in interpreted LambdaVa...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20981
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2081/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20981: [SPARK-23873][SQL] Use accessors in interpreted LambdaVa...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20981
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20981: [SPARK-23873][SQL] Use accessors in interpreted LambdaVa...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20981
  
**[Test build #89040 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89040/testReport)**
 for PR 20981 at commit 
[`a8cdbe8`](https://github.com/apache/spark/commit/a8cdbe8baf2d508fb2583862042f1213cf0eae7b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff test Pyth...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20904
  
**[Test build #89041 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89041/testReport)**
 for PR 20904 at commit 
[`e999d60`](https://github.com/apache/spark/commit/e999d6079097eed4964b1d112e96f2e8bd115f10).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff test Pyth...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20904
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2080/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff test Pyth...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20904
  
Build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff test Pyth...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20904
  
**[Test build #89039 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89039/testReport)**
 for PR 20904 at commit 
[`49a7ddb`](https://github.com/apache/spark/commit/49a7ddb45cb9a0035e3faed5906ecd37890333e1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20319: [SPARK-22884][ML][TESTS] ML test for StructuredStreaming...

2018-04-08 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/20319
  
@smurakozi Thanks for the PR! Could you resolve conflicts first? and then I 
will make a review. If you're busy I can also take over it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20940: [SPARK-23429][CORE] Add executor memory metrics t...

2018-04-08 Thread edwinalu
Github user edwinalu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20940#discussion_r179978448
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -772,6 +772,12 @@ private[spark] class Executor(
 val accumUpdates = new ArrayBuffer[(Long, Seq[AccumulatorV2[_, _]])]()
 val curGCTime = computeTotalGcTime()
 
+// get executor level memory metrics
+val executorUpdates = new ExecutorMetrics(System.currentTimeMillis(),
+  ManagementFactory.getMemoryMXBean.getHeapMemoryUsage().getUsed(),
--- End diff --

It would be useful to have more information about offheap memory usage. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20940: [SPARK-23429][CORE] Add executor memory metrics t...

2018-04-08 Thread edwinalu
Github user edwinalu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20940#discussion_r179978186
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala 
---
@@ -251,6 +260,163 @@ class EventLoggingListenerSuite extends SparkFunSuite 
with LocalSparkContext wit
 }
   }
 
+  /**
+   * Test executor metrics update logging functionality. This checks that a
+   * SparkListenerExecutorMetricsUpdate event is added to the Spark history
+   * log if one of the executor metrics is larger than any previously
+   * recorded value for the metric, per executor per stage. The task 
metrics
+   * should not be added.
+   */
+  private def testExecutorMetricsUpdateEventLogging() {
+val conf = getLoggingConf(testDirPath, None)
+val logName = "executorMetricsUpdated-test"
+val eventLogger = new EventLoggingListener(logName, None, 
testDirPath.toUri(), conf)
+val listenerBus = new LiveListenerBus(conf)
+
+// list of events and if they should be logged
+val events = Array(
+  (SparkListenerApplicationStart("executionMetrics", None,
+1L, "update", None), true),
+  (createExecutorAddedEvent(1), true),
+  (createExecutorAddedEvent(2), true),
+  (createStageSubmittedEvent(0), true),
+  (createExecutorMetricsUpdateEvent(1, 10L, 5000L, 50L, 0L, 0L, 0L), 
true), // new stage
+  (createExecutorMetricsUpdateEvent(2, 10L, 3500L, 20L, 0L, 0L, 0L), 
true), // new stage
+  (createExecutorMetricsUpdateEvent(1, 15L, 4000L, 50L, 0L, 0L, 0L), 
false),
+  (createExecutorMetricsUpdateEvent(2, 15L, 3500L, 10L, 0L, 20L, 0L), 
true), // onheap storage
+  (createExecutorMetricsUpdateEvent(1, 20L, 6000L, 50L, 0L, 30L, 0L), 
true), // JVM used
+  (createExecutorMetricsUpdateEvent(2, 20L, 3500L, 15L, 0L, 20L, 0L), 
true), // onheap unified
+  (createStageSubmittedEvent(1), true),
+  (createExecutorMetricsUpdateEvent(1, 25L, 3000L, 15L, 0L, 0L, 0L), 
true), // new stage
+  (createExecutorMetricsUpdateEvent(2, 25L, 6000L, 50L, 0L, 0L, 0L), 
true), // new stage
+  (createStageCompletedEvent(0), true),
+  (createExecutorMetricsUpdateEvent(1, 30L, 3000L, 20L, 0L, 0L, 0L), 
true), // onheap execution
+  (createExecutorMetricsUpdateEvent(2, 30L, 5500L, 20L, 0L, 0L, 0L), 
false),
+  (createExecutorMetricsUpdateEvent(1, 35L, 3000L, 5L, 25L, 0L, 0L), 
true), // offheap execution
+  (createExecutorMetricsUpdateEvent(2, 35L, 5500L, 25L, 0L, 0L, 30L), 
true), // offheap storage
+  (createExecutorMetricsUpdateEvent(1, 40L, 3000L, 8L, 20L, 0L, 0L), 
false),
+  (createExecutorMetricsUpdateEvent(2, 40L, 5500L, 25L, 0L, 0L, 30L), 
false),
+  (createStageCompletedEvent(1), true),
+  (SparkListenerApplicationEnd(1000L), true))
+
+// play the events for the event logger
+eventLogger.start()
+listenerBus.start(Mockito.mock(classOf[SparkContext]), 
Mockito.mock(classOf[MetricsSystem]))
+listenerBus.addToEventLogQueue(eventLogger)
+for ((event, included) <- events) {
+  listenerBus.post(event)
+}
+listenerBus.stop()
+eventLogger.stop()
+
+// Verify the log file contains the expected events
+val logData = EventLoggingListener.openEventLog(new 
Path(eventLogger.logPath), fileSystem)
+try {
+  val lines = readLines(logData)
+  val logStart = SparkListenerLogStart(SPARK_VERSION)
+  assert(lines.size === 19)
+  assert(lines(0).contains("SparkListenerLogStart"))
+  assert(lines(1).contains("SparkListenerApplicationStart"))
+  assert(JsonProtocol.sparkEventFromJson(parse(lines(0))) === logStart)
+  var i = 1
+  for ((event, included) <- events) {
+if (included) {
+  checkEvent(lines(i), event)
+  i += 1
+}
+  }
+} finally {
+  logData.close()
+}
+  }
+
+  /** Create a stage submitted event for the specified stage Id. */
+  private def createStageSubmittedEvent(stageId: Int) =
+SparkListenerStageSubmitted(new StageInfo(stageId, 0, 
stageId.toString, 0,
+  Seq.empty, Seq.empty, "details"))
+
+  /** Create a stage completed event for the specified stage Id. */
+  private def createStageCompletedEvent(stageId: Int) =
+SparkListenerStageCompleted(new StageInfo(stageId, 0, 
stageId.toString, 0,
+  Seq.empty, Seq.empty, "details"))
+
+  /** Create an executor added event for the specified executor Id. */
+  private def createExecutorAddedEvent(executorId: Int) =
+SparkListenerExecutorAdded(0L, executorId.toString, new 
ExecutorInfo("host1", 1, Map.empty))
+
+  /** Create an executor metrics update 

[GitHub] spark pull request #20940: [SPARK-23429][CORE] Add executor memory metrics t...

2018-04-08 Thread edwinalu
Github user edwinalu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20940#discussion_r179978222
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala 
---
@@ -251,6 +260,163 @@ class EventLoggingListenerSuite extends SparkFunSuite 
with LocalSparkContext wit
 }
   }
 
+  /**
+   * Test executor metrics update logging functionality. This checks that a
+   * SparkListenerExecutorMetricsUpdate event is added to the Spark history
+   * log if one of the executor metrics is larger than any previously
+   * recorded value for the metric, per executor per stage. The task 
metrics
+   * should not be added.
+   */
+  private def testExecutorMetricsUpdateEventLogging() {
+val conf = getLoggingConf(testDirPath, None)
+val logName = "executorMetricsUpdated-test"
+val eventLogger = new EventLoggingListener(logName, None, 
testDirPath.toUri(), conf)
+val listenerBus = new LiveListenerBus(conf)
+
+// list of events and if they should be logged
+val events = Array(
+  (SparkListenerApplicationStart("executionMetrics", None,
+1L, "update", None), true),
+  (createExecutorAddedEvent(1), true),
+  (createExecutorAddedEvent(2), true),
+  (createStageSubmittedEvent(0), true),
+  (createExecutorMetricsUpdateEvent(1, 10L, 5000L, 50L, 0L, 0L, 0L), 
true), // new stage
+  (createExecutorMetricsUpdateEvent(2, 10L, 3500L, 20L, 0L, 0L, 0L), 
true), // new stage
+  (createExecutorMetricsUpdateEvent(1, 15L, 4000L, 50L, 0L, 0L, 0L), 
false),
+  (createExecutorMetricsUpdateEvent(2, 15L, 3500L, 10L, 0L, 20L, 0L), 
true), // onheap storage
+  (createExecutorMetricsUpdateEvent(1, 20L, 6000L, 50L, 0L, 30L, 0L), 
true), // JVM used
+  (createExecutorMetricsUpdateEvent(2, 20L, 3500L, 15L, 0L, 20L, 0L), 
true), // onheap unified
+  (createStageSubmittedEvent(1), true),
+  (createExecutorMetricsUpdateEvent(1, 25L, 3000L, 15L, 0L, 0L, 0L), 
true), // new stage
+  (createExecutorMetricsUpdateEvent(2, 25L, 6000L, 50L, 0L, 0L, 0L), 
true), // new stage
+  (createStageCompletedEvent(0), true),
+  (createExecutorMetricsUpdateEvent(1, 30L, 3000L, 20L, 0L, 0L, 0L), 
true), // onheap execution
+  (createExecutorMetricsUpdateEvent(2, 30L, 5500L, 20L, 0L, 0L, 0L), 
false),
+  (createExecutorMetricsUpdateEvent(1, 35L, 3000L, 5L, 25L, 0L, 0L), 
true), // offheap execution
+  (createExecutorMetricsUpdateEvent(2, 35L, 5500L, 25L, 0L, 0L, 30L), 
true), // offheap storage
+  (createExecutorMetricsUpdateEvent(1, 40L, 3000L, 8L, 20L, 0L, 0L), 
false),
+  (createExecutorMetricsUpdateEvent(2, 40L, 5500L, 25L, 0L, 0L, 30L), 
false),
+  (createStageCompletedEvent(1), true),
+  (SparkListenerApplicationEnd(1000L), true))
+
+// play the events for the event logger
+eventLogger.start()
+listenerBus.start(Mockito.mock(classOf[SparkContext]), 
Mockito.mock(classOf[MetricsSystem]))
+listenerBus.addToEventLogQueue(eventLogger)
+for ((event, included) <- events) {
+  listenerBus.post(event)
+}
+listenerBus.stop()
+eventLogger.stop()
+
+// Verify the log file contains the expected events
+val logData = EventLoggingListener.openEventLog(new 
Path(eventLogger.logPath), fileSystem)
+try {
+  val lines = readLines(logData)
+  val logStart = SparkListenerLogStart(SPARK_VERSION)
+  assert(lines.size === 19)
+  assert(lines(0).contains("SparkListenerLogStart"))
+  assert(lines(1).contains("SparkListenerApplicationStart"))
+  assert(JsonProtocol.sparkEventFromJson(parse(lines(0))) === logStart)
+  var i = 1
+  for ((event, included) <- events) {
+if (included) {
+  checkEvent(lines(i), event)
+  i += 1
+}
+  }
+} finally {
+  logData.close()
+}
+  }
+
+  /** Create a stage submitted event for the specified stage Id. */
+  private def createStageSubmittedEvent(stageId: Int) =
+SparkListenerStageSubmitted(new StageInfo(stageId, 0, 
stageId.toString, 0,
+  Seq.empty, Seq.empty, "details"))
+
+  /** Create a stage completed event for the specified stage Id. */
+  private def createStageCompletedEvent(stageId: Int) =
+SparkListenerStageCompleted(new StageInfo(stageId, 0, 
stageId.toString, 0,
+  Seq.empty, Seq.empty, "details"))
+
+  /** Create an executor added event for the specified executor Id. */
+  private def createExecutorAddedEvent(executorId: Int) =
+SparkListenerExecutorAdded(0L, executorId.toString, new 
ExecutorInfo("host1", 1, Map.empty))
+
+  /** Create an executor metrics update 

[GitHub] spark pull request #20940: [SPARK-23429][CORE] Add executor memory metrics t...

2018-04-08 Thread edwinalu
Github user edwinalu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20940#discussion_r179978192
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala 
---
@@ -251,6 +260,163 @@ class EventLoggingListenerSuite extends SparkFunSuite 
with LocalSparkContext wit
 }
   }
 
+  /**
+   * Test executor metrics update logging functionality. This checks that a
+   * SparkListenerExecutorMetricsUpdate event is added to the Spark history
+   * log if one of the executor metrics is larger than any previously
+   * recorded value for the metric, per executor per stage. The task 
metrics
+   * should not be added.
+   */
+  private def testExecutorMetricsUpdateEventLogging() {
+val conf = getLoggingConf(testDirPath, None)
+val logName = "executorMetricsUpdated-test"
+val eventLogger = new EventLoggingListener(logName, None, 
testDirPath.toUri(), conf)
+val listenerBus = new LiveListenerBus(conf)
+
+// list of events and if they should be logged
+val events = Array(
+  (SparkListenerApplicationStart("executionMetrics", None,
+1L, "update", None), true),
+  (createExecutorAddedEvent(1), true),
+  (createExecutorAddedEvent(2), true),
+  (createStageSubmittedEvent(0), true),
+  (createExecutorMetricsUpdateEvent(1, 10L, 5000L, 50L, 0L, 0L, 0L), 
true), // new stage
+  (createExecutorMetricsUpdateEvent(2, 10L, 3500L, 20L, 0L, 0L, 0L), 
true), // new stage
+  (createExecutorMetricsUpdateEvent(1, 15L, 4000L, 50L, 0L, 0L, 0L), 
false),
+  (createExecutorMetricsUpdateEvent(2, 15L, 3500L, 10L, 0L, 20L, 0L), 
true), // onheap storage
+  (createExecutorMetricsUpdateEvent(1, 20L, 6000L, 50L, 0L, 30L, 0L), 
true), // JVM used
+  (createExecutorMetricsUpdateEvent(2, 20L, 3500L, 15L, 0L, 20L, 0L), 
true), // onheap unified
+  (createStageSubmittedEvent(1), true),
+  (createExecutorMetricsUpdateEvent(1, 25L, 3000L, 15L, 0L, 0L, 0L), 
true), // new stage
+  (createExecutorMetricsUpdateEvent(2, 25L, 6000L, 50L, 0L, 0L, 0L), 
true), // new stage
+  (createStageCompletedEvent(0), true),
+  (createExecutorMetricsUpdateEvent(1, 30L, 3000L, 20L, 0L, 0L, 0L), 
true), // onheap execution
+  (createExecutorMetricsUpdateEvent(2, 30L, 5500L, 20L, 0L, 0L, 0L), 
false),
+  (createExecutorMetricsUpdateEvent(1, 35L, 3000L, 5L, 25L, 0L, 0L), 
true), // offheap execution
+  (createExecutorMetricsUpdateEvent(2, 35L, 5500L, 25L, 0L, 0L, 30L), 
true), // offheap storage
+  (createExecutorMetricsUpdateEvent(1, 40L, 3000L, 8L, 20L, 0L, 0L), 
false),
+  (createExecutorMetricsUpdateEvent(2, 40L, 5500L, 25L, 0L, 0L, 30L), 
false),
+  (createStageCompletedEvent(1), true),
+  (SparkListenerApplicationEnd(1000L), true))
+
+// play the events for the event logger
+eventLogger.start()
+listenerBus.start(Mockito.mock(classOf[SparkContext]), 
Mockito.mock(classOf[MetricsSystem]))
+listenerBus.addToEventLogQueue(eventLogger)
+for ((event, included) <- events) {
+  listenerBus.post(event)
+}
+listenerBus.stop()
+eventLogger.stop()
+
+// Verify the log file contains the expected events
+val logData = EventLoggingListener.openEventLog(new 
Path(eventLogger.logPath), fileSystem)
+try {
+  val lines = readLines(logData)
+  val logStart = SparkListenerLogStart(SPARK_VERSION)
+  assert(lines.size === 19)
+  assert(lines(0).contains("SparkListenerLogStart"))
+  assert(lines(1).contains("SparkListenerApplicationStart"))
+  assert(JsonProtocol.sparkEventFromJson(parse(lines(0))) === logStart)
+  var i = 1
+  for ((event, included) <- events) {
+if (included) {
+  checkEvent(lines(i), event)
+  i += 1
+}
+  }
+} finally {
+  logData.close()
+}
+  }
+
+  /** Create a stage submitted event for the specified stage Id. */
+  private def createStageSubmittedEvent(stageId: Int) =
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20940: [SPARK-23429][CORE] Add executor memory metrics t...

2018-04-08 Thread edwinalu
Github user edwinalu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20940#discussion_r179978182
  
--- Diff: core/src/main/scala/org/apache/spark/status/LiveEntity.scala ---
@@ -268,6 +268,9 @@ private class LiveExecutor(val executorId: String, 
_addTime: Long) extends LiveE
 
   def hasMemoryInfo: Boolean = totalOnHeap >= 0L
 
+  // peak values for executor level metrics
+  var peakExecutorMetrics = new PeakExecutorMetrics
--- End diff --

Yes, thanks for catching this, it should be val.

This is more properly part of SPARK-23431, and I can remove for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20940: [SPARK-23429][CORE] Add executor memory metrics t...

2018-04-08 Thread edwinalu
Github user edwinalu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20940#discussion_r179978102
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -234,8 +244,22 @@ private[spark] class EventLoggingListener(
 }
   }
 
-  // No-op because logging every update would be overkill
-  override def onExecutorMetricsUpdate(event: 
SparkListenerExecutorMetricsUpdate): Unit = { }
+  /**
+   * Log if there is a new peak value for one of the memory metrics for 
the given executor.
+   * Metrics are cleared out when a new stage is started in 
onStageSubmitted, so this will
+   * log new peak memory metric values per executor per stage.
+   */
+  override def onExecutorMetricsUpdate(event: 
SparkListenerExecutorMetricsUpdate): Unit = {
--- End diff --

For a longer running stage, once it ramps up, hopefully there wouldn't be a 
lot of new peak values. Looking at a subset of our applications, the extra 
logging overhead has mostly been between 0.25% to 1%, but it can be 8%. 

By logging each peak value at the time they occur (and reinitializing when 
a stage starts), it's possible to tell which stages are active at the time, and 
it would potentially be possible to graph these changes on a timeline -- this 
information wouldn't be available if the metrics are only logged at stage end, 
and the times are lost.

Logging at stage end would limit the amount of extra logging. If we add 
more metrics (such as for offheap), then there could be more new peaks and more 
extra logging with the current approach. Excess logging is a concern, and I can 
move to stage end if the overhead is too much. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21005: [SPARK-23898][SQL] Simplify add & subtract code g...

2018-04-08 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21005#discussion_r179971588
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 ---
@@ -117,16 +117,22 @@ abstract class BinaryArithmetic extends 
BinaryOperator with NullIntolerant {
 
   override def dataType: DataType = left.dataType
 
-  override lazy val resolved = childrenResolved && 
checkInputDataTypes().isSuccess
+  override lazy val resolved: Boolean = childrenResolved && 
checkInputDataTypes().isSuccess
 
   /** Name of the function for this expression on a [[Decimal]] type. */
   def decimalMethod: String =
 sys.error("BinaryArithmetics must override either decimalMethod or 
genCode")
 
+  /** Name of the function for this expression on a [[CalendarInterval]] 
type. */
+  def calendarIntervalMethod: String =
+sys.error("BinaryArithmetics must override either 
calendarIntervalMethod or genCode")
+
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = 
dataType match {
-case dt: DecimalType =>
+case _: DecimalType =>
   defineCodeGen(ctx, ev, (eval1, eval2) => 
s"$eval1.$decimalMethod($eval2)")
 // byte and short are casted into int when add, minus, times or divide
+case CalendarIntervalType =>
+  defineCodeGen(ctx, ev, (eval1, eval2) => 
s"$eval1.$calendarIntervalMethod($eval2)")
--- End diff --

How about moving these interval arithmetics to a new trait like 
`IntervalArithmetics`?

BTW, we currently support `add` and `subtract` for intervals, but it seems 
`multiply` and `divide` is also supported in the standard? 
http://download.mimer.com/pub/developer/docs/html_110/Mimer_SQL_Engine_DocSet/Syntax_Rules5.html#wp1113535


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with rele...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20923
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2079/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20938: [SPARK-23821][SQL] Collection function: flatten

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20938#discussion_r179970273
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -287,3 +289,165 @@ case class ArrayContains(left: Expression, right: 
Expression)
 
   override def prettyName: String = "array_contains"
 }
+
+/**
+ * Transforms an array of arrays into a single array.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(arrayOfArrays) - Transforms an array of arrays into a 
single array.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(array(1, 2), array(3, 4));
+   [1,2,3,4]
+  """)
--- End diff --

```
  """,
  since = "2.4.0")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20938: [SPARK-23821][SQL] Collection function: flatten

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20938#discussion_r179970154
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -413,6 +413,86 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
 )
   }
 
+  test("flatten function") {
+val oneRowDF = Seq((1, "a", Seq(1, 2, 3))).toDF("i", "s", "arr")
+
+// Test cases with a primitive type
+val intDF = Seq(
+  (Seq(Seq(1, 2, 3), Seq(4, 5), Seq(6))),
+  (Seq(Seq(1, 2))),
+  (Seq(Seq(1), Seq.empty)),
+  (Seq(Seq.empty, Seq(1))),
+  (Seq(Seq.empty, Seq.empty)),
+  (Seq(Seq(1), null)),
+  (Seq(null, Seq(1))),
+  (Seq(null, null))
+).toDF("i")
+
+val intDFResult = Seq(
+  Row(Seq(1, 2, 3, 4, 5, 6)),
+  Row(Seq(1, 2)),
+  Row(Seq(1)),
+  Row(Seq(1)),
+  Row(Seq.empty),
+  Row(null),
+  Row(null),
+  Row(null)
+)
+
+checkAnswer(intDF.select(flatten($"i")), intDFResult)
+checkAnswer(intDF.selectExpr("flatten(i)"), intDFResult)
+checkAnswer(
+  oneRowDF.selectExpr("flatten(array(arr, array(null, 5), array(6, 
null)))"),
+  Seq(Row(Seq(1, 2, 3, null, 5, 6, null)))
+)
--- End diff --

Let's move it up.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20938: [SPARK-23821][SQL] Collection function: flatten

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20938#discussion_r179970092
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3300,6 +3300,14 @@ object functions {
*/
   def sort_array(e: Column, asc: Boolean): Column = withExpr { 
SortArray(e.expr, lit(asc).expr) }
 
+  /**
+   * Creates a single array from an array of arrays. If a structure of 
nested arrays is deeper than
+   * two levels, only one level of nesting is removed.
+   * @group collection_funcs
+   * @since 2.4.0
+   */
+  def flatten(e: Column): Column = withExpr{ Flatten(e.expr) }
--- End diff --

`r{` -> `r {`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with rele...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20923
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20938: [SPARK-23821][SQL] Collection function: flatten

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20938#discussion_r179970006
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -287,3 +289,165 @@ case class ArrayContains(left: Expression, right: 
Expression)
 
   override def prettyName: String = "array_contains"
 }
+
+/**
+ * Transforms an array of arrays into a single array.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(arrayOfArrays) - Transforms an array of arrays into a 
single array.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(array(1, 2), array(3, 4));
+   [1,2,3,4]
+  """)
+case class Flatten(child: Expression) extends UnaryExpression {
+
+  override def nullable: Boolean = child.nullable || dataType.containsNull
+
+  override def dataType: ArrayType = {
+child
+  .dataType.asInstanceOf[ArrayType]
+  .elementType.asInstanceOf[ArrayType]
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (
+  ArrayType.acceptsType(child.dataType) &&
+  
ArrayType.acceptsType(child.dataType.asInstanceOf[ArrayType].elementType)
+) {
+  TypeCheckResult.TypeCheckSuccess
+} else {
+  TypeCheckResult.TypeCheckFailure(
+s"The argument should be an array of arrays, " +
+s"but '${child.sql}' is of ${child.dataType.simpleString} type."
+  )
+}
+  }
+
+  override def nullSafeEval(array: Any): Any = {
+val elements = array.asInstanceOf[ArrayData].toObjectArray(dataType)
+
+if (elements.contains(null)) {
+  null
+} else {
+  val flattened = elements.flatMap(
+_.asInstanceOf[ArrayData].toObjectArray(dataType.elementType)
+  )
+  new GenericArrayData(flattened)
+}
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+nullSafeCodeGen(ctx, ev, c => {
+  val code =
+if (CodeGenerator.isPrimitiveType(dataType.elementType)) {
--- End diff --

very tiny nit: shall we move this line up?

```
val code = if (Code...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20938: [SPARK-23821][SQL] Collection function: flatten

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20938#discussion_r179969902
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -287,3 +289,165 @@ case class ArrayContains(left: Expression, right: 
Expression)
 
   override def prettyName: String = "array_contains"
 }
+
+/**
+ * Transforms an array of arrays into a single array.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(arrayOfArrays) - Transforms an array of arrays into a 
single array.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(array(1, 2), array(3, 4));
+   [1,2,3,4]
+  """)
+case class Flatten(child: Expression) extends UnaryExpression {
+
+  override def nullable: Boolean = child.nullable || dataType.containsNull
+
+  override def dataType: ArrayType = {
+child
+  .dataType.asInstanceOf[ArrayType]
+  .elementType.asInstanceOf[ArrayType]
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (
+  ArrayType.acceptsType(child.dataType) &&
+  
ArrayType.acceptsType(child.dataType.asInstanceOf[ArrayType].elementType)
+) {
--- End diff --

How about this?

```scala
child.dataType match {
  case _: ArrayType(_: ArrayType, _) =>
TypeCheckResult.TypeCheckSuccess
  case _: => 
TypeCheckResult.TypeCheckFailure(
  "The argument should be an array of arrays, " +
  s"but '${child.sql}' is of ${child.dataType.simpleString} type.")
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20938: [SPARK-23821][SQL] Collection function: flatten

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20938#discussion_r179970510
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -287,3 +289,165 @@ case class ArrayContains(left: Expression, right: 
Expression)
 
   override def prettyName: String = "array_contains"
 }
+
+/**
+ * Transforms an array of arrays into a single array.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(arrayOfArrays) - Transforms an array of arrays into a 
single array.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(array(1, 2), array(3, 4));
+   [1,2,3,4]
+  """)
+case class Flatten(child: Expression) extends UnaryExpression {
+
+  override def nullable: Boolean = child.nullable || dataType.containsNull
+
+  override def dataType: ArrayType = {
+child
+  .dataType.asInstanceOf[ArrayType]
+  .elementType.asInstanceOf[ArrayType]
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (
+  ArrayType.acceptsType(child.dataType) &&
+  
ArrayType.acceptsType(child.dataType.asInstanceOf[ArrayType].elementType)
+) {
+  TypeCheckResult.TypeCheckSuccess
+} else {
+  TypeCheckResult.TypeCheckFailure(
+s"The argument should be an array of arrays, " +
+s"but '${child.sql}' is of ${child.dataType.simpleString} type."
+  )
+}
+  }
+
+  override def nullSafeEval(array: Any): Any = {
+val elements = array.asInstanceOf[ArrayData].toObjectArray(dataType)
+
+if (elements.contains(null)) {
+  null
--- End diff --

does this mean if input array has `null ` in the elements, return `null` 
ignoring other elements when we are not in codegen?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20938: [SPARK-23821][SQL] Collection function: flatten

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20938#discussion_r179970035
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -287,3 +289,165 @@ case class ArrayContains(left: Expression, right: 
Expression)
 
   override def prettyName: String = "array_contains"
 }
+
+/**
+ * Transforms an array of arrays into a single array.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(arrayOfArrays) - Transforms an array of arrays into a 
single array.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(array(1, 2), array(3, 4));
+   [1,2,3,4]
+  """)
+case class Flatten(child: Expression) extends UnaryExpression {
+
+  override def nullable: Boolean = child.nullable || dataType.containsNull
+
+  override def dataType: ArrayType = {
+child
+  .dataType.asInstanceOf[ArrayType]
+  .elementType.asInstanceOf[ArrayType]
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (
+  ArrayType.acceptsType(child.dataType) &&
+  
ArrayType.acceptsType(child.dataType.asInstanceOf[ArrayType].elementType)
+) {
+  TypeCheckResult.TypeCheckSuccess
+} else {
+  TypeCheckResult.TypeCheckFailure(
+s"The argument should be an array of arrays, " +
+s"but '${child.sql}' is of ${child.dataType.simpleString} type."
+  )
+}
+  }
+
+  override def nullSafeEval(array: Any): Any = {
+val elements = array.asInstanceOf[ArrayData].toObjectArray(dataType)
+
+if (elements.contains(null)) {
+  null
+} else {
+  val flattened = elements.flatMap(
+_.asInstanceOf[ArrayData].toObjectArray(dataType.elementType)
+  )
+  new GenericArrayData(flattened)
+}
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+nullSafeCodeGen(ctx, ev, c => {
+  val code =
+if (CodeGenerator.isPrimitiveType(dataType.elementType)) {
+  genCodeForConcatOfPrimitiveElements(ctx, c, ev.value)
+} else {
+  genCodeForConcatOfComplexElements(ctx, c, ev.value)
+}
+  nullElementsProtection(ev, c, code)
+})
+  }
+
+  private def nullElementsProtection(
+  ev: ExprCode,
+  childVariableName: String,
+  coreLogic: String): String = {
+s"""
+   |for(int z=0; z < $childVariableName.numElements(); z++) {
+   |  ${ev.isNull} |= $childVariableName.isNullAt(z);
+   |}
+   |if(!${ev.isNull}) {
+   |  $coreLogic
+   |}
+   """.stripMargin
--- End diff --

indentation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/13599
  
Hey @zjffdu, would you have some time to update this PR? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with rele...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20923
  
**[Test build #89038 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89038/testReport)**
 for PR 20923 at commit 
[`7c93d98`](https://github.com/apache/spark/commit/7c93d98aae8d74e0f0606cb03e68b0ac94bde177).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with rele...

2018-04-08 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/20923
  
I think you should also update "test-dependencies.sh" to make the new deps 
file work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with rele...

2018-04-08 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/20923
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with rele...

2018-04-08 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/20923
  
Sorry @steveloughran for the late response. most of the deps in file is 
similar to "spark-deps-hadoop-2.7", so copy/rename it and run 
"test-dependencies.sh" will show you the diffs, based on the diffs to update 
the deps file will get a new hadoop3 deps file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21005: [SPARK-23898][SQL] Simplify add & subtract code g...

2018-04-08 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21005#discussion_r179967381
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 ---
@@ -117,16 +117,22 @@ abstract class BinaryArithmetic extends 
BinaryOperator with NullIntolerant {
 
   override def dataType: DataType = left.dataType
 
-  override lazy val resolved = childrenResolved && 
checkInputDataTypes().isSuccess
+  override lazy val resolved: Boolean = childrenResolved && 
checkInputDataTypes().isSuccess
 
   /** Name of the function for this expression on a [[Decimal]] type. */
   def decimalMethod: String =
 sys.error("BinaryArithmetics must override either decimalMethod or 
genCode")
 
+  /** Name of the function for this expression on a [[CalendarInterval]] 
type. */
+  def calendarIntervalMethod: String =
+sys.error("BinaryArithmetics must override either 
calendarIntervalMethod or genCode")
+
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = 
dataType match {
-case dt: DecimalType =>
+case _: DecimalType =>
   defineCodeGen(ctx, ev, (eval1, eval2) => 
s"$eval1.$decimalMethod($eval2)")
 // byte and short are casted into int when add, minus, times or divide
--- End diff --

move this comment to 136?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179966200
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -361,6 +361,15 @@ class JacksonParser(
 // For such records, all fields other than the field configured by
 // `columnNameOfCorruptRecord` are set to `null`.
 throw BadRecordException(() => recordLiteral(record), () => None, 
e)
+  case e: CharConversionException if options.encoding.isEmpty =>
+val msg =
+  """Failed to parse a character. Encoding was detected 
automatically.
--- End diff --

My point is automatic detection is true only when multuline is enabled and 
the message looks like it's always true. I don't think we should expose an 
ancomplete (or accidential) functionality in any case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21005: [SPARK-23898][SQL] Simplify add & subtract code generati...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21005
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89036/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21005: [SPARK-23898][SQL] Simplify add & subtract code generati...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21005
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21005: [SPARK-23898][SQL] Simplify add & subtract code generati...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21005
  
**[Test build #89036 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89036/testReport)**
 for PR 21005 at commit 
[`15c13e3`](https://github.com/apache/spark/commit/15c13e356606b936f9755a3e8b1d203f06ffd01a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20937
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20937
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89037/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20937
  
**[Test build #89037 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89037/testReport)**
 for PR 20937 at commit 
[`b817184`](https://github.com/apache/spark/commit/b817184d35d0e2589682f1dcd88b9f29b2063f5b).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2018-04-08 Thread jsigee87
Github user jsigee87 commented on the issue:

https://github.com/apache/spark/pull/13440
  
Is this still being considered?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIndex

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21004
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIndex

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21004
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89034/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIndex

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21004
  
**[Test build #89034 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89034/testReport)**
 for PR 21004 at commit 
[`35aff24`](https://github.com/apache/spark/commit/35aff24743ff13ccd370a8e3747a3044e8a671c9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20937
  
**[Test build #89037 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89037/testReport)**
 for PR 20937 at commit 
[`b817184`](https://github.com/apache/spark/commit/b817184d35d0e2589682f1dcd88b9f29b2063f5b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179958997
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -86,14 +85,34 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
-  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
-require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
-sep
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep")
+
+  /**
+   * Standard encoding (charset) name. For example UTF-8, UTF-16LE and 
UTF-32BE.
+   * If the encoding is not specified (None), it will be detected 
automatically.
+   */
+  val encoding: Option[String] = parameters.get("encoding")
+.orElse(parameters.get("charset")).map { enc =>
+  val blacklist = List("UTF16", "UTF32")
+  val isBlacklisted = 
blacklist.contains(enc.toUpperCase.replaceAll("-|_", ""))
+  require(multiLine || !isBlacklisted,
+s"""The ${enc} encoding must not be included in the blacklist:
+   | ${blacklist.mkString(", ")}""".stripMargin)
+
+  val forcingLineSep = !(multiLine == false && enc != "UTF-8" && 
lineSeparator.isEmpty)
+  require(forcingLineSep,
+s"""The lineSep option must be specified for the $enc encoding.
+   |Example: .option("lineSep", "|^|")
--- End diff --

I believe it is better to provide to users possibility to just copy-past 
the text and replace option's value in the example instead of searching in docs 
how to do that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20937
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20937
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89035/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20937
  
**[Test build #89035 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89035/testReport)**
 for PR 20937 at commit 
[`76dbbed`](https://github.com/apache/spark/commit/76dbbed8522c1b18323c0edc727f8004db013250).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179958858
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -86,14 +85,34 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
-  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
-require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
-sep
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep")
+
+  /**
+   * Standard encoding (charset) name. For example UTF-8, UTF-16LE and 
UTF-32BE.
+   * If the encoding is not specified (None), it will be detected 
automatically.
+   */
+  val encoding: Option[String] = parameters.get("encoding")
+.orElse(parameters.get("charset")).map { enc =>
+  val blacklist = List("UTF16", "UTF32")
+  val isBlacklisted = 
blacklist.contains(enc.toUpperCase.replaceAll("-|_", ""))
+  require(multiLine || !isBlacklisted,
+s"""The ${enc} encoding must not be included in the blacklist:
+   | ${blacklist.mkString(", ")}""".stripMargin)
+
+  val forcingLineSep = !(multiLine == false && enc != "UTF-8" && 
lineSeparator.isEmpty)
+  require(forcingLineSep,
+s"""The lineSep option must be specified for the $enc encoding.
+   |Example: .option("lineSep", "|^|")
+   |Note: lineSep can be detected automatically for UTF-8 
only.""".stripMargin)
--- End diff --

Default UTF-8 doesn't explain why lineSep is required for other encodings


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21005: [SPARK-23898][SQL] Simplify add & subtract code generati...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21005
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21005: [SPARK-23898][SQL] Simplify add & subtract code generati...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21005
  
**[Test build #89036 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89036/testReport)**
 for PR 21005 at commit 
[`15c13e3`](https://github.com/apache/spark/commit/15c13e356606b936f9755a3e8b1d203f06ffd01a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21005: [SPARK-23898][SQL] Simplify add & subtract code generati...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21005
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2078/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179958785
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -361,6 +361,15 @@ class JacksonParser(
 // For such records, all fields other than the field configured by
 // `columnNameOfCorruptRecord` are set to `null`.
 throw BadRecordException(() => recordLiteral(record), () => None, 
e)
+  case e: CharConversionException if options.encoding.isEmpty =>
+val msg =
+  """Failed to parse a character. Encoding was detected 
automatically.
--- End diff --

This message was added explicitly to tell our customers how to resolve 
issues like https://issues.apache.org/jira/browse/SPARK-23094 .  Describing 
that in docs is not enough from our experience. Customers will just create 
support tickets, and we will have to spend time to figure out the root causes. 
The tip can help the customers to solve the problem on their side. /cc @brkyvz 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21005: [SPARK-23898][SQL] Simplify add & subtract code g...

2018-04-08 Thread hvanhovell
GitHub user hvanhovell opened a pull request:

https://github.com/apache/spark/pull/21005

[SPARK-23898][SQL] Simplify add & subtract code generation

## What changes were proposed in this pull request?
Code generation for the `Add` and `Subtract` expressions was not done using 
the `BinaryArithmetic.doCodeGen` method because these expressions also support 
`CalendarInterval`. This leads to a bit of duplication.

This PR gets rid of that duplication by adding `calendarIntervalMethod` to 
`BinaryArithmetic` and doing the code generation for `CalendarInterval` in 
`BinaryArithmetic` instead.

## How was this patch tested?
Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hvanhovell/spark SPARK-23898

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21005.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21005


commit 15c13e356606b936f9755a3e8b1d203f06ffd01a
Author: Herman van Hovell 
Date:   2018-04-08T19:34:58Z

Simplify add & subtract code generation by centralizing code generation for 
calendar intervals.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20937
  
**[Test build #89035 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89035/testReport)**
 for PR 20937 at commit 
[`76dbbed`](https://github.com/apache/spark/commit/76dbbed8522c1b18323c0edc727f8004db013250).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20937
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20937
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89033/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20937
  
**[Test build #89033 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89033/testReport)**
 for PR 20937 at commit 
[`fcd0a21`](https://github.com/apache/spark/commit/fcd0a214448c74ebb60a0fb92a92a7c3b21166c4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20987: [SPARK-23816][CORE] Killed tasks should ignore FetchFail...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20987
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89032/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20987: [SPARK-23816][CORE] Killed tasks should ignore FetchFail...

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20987
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20987: [SPARK-23816][CORE] Killed tasks should ignore FetchFail...

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20987
  
**[Test build #89032 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89032/testReport)**
 for PR 20987 at commit 
[`3860aed`](https://github.com/apache/spark/commit/3860aeda33be3cf1cad5999bfd6fbb274ac0bc24).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIndex

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21004
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIndex

2018-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21004
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2077/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179957573
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -366,6 +366,9 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* `java.text.SimpleDateFormat`. This applies to timestamp type.
* `multiLine` (default `false`): parse one record, which may span 
multiple lines,
* per file
+   * `encoding` (by default it is not set): allows to forcibly set one 
of standard basic
+   * or extended charsets for input jsons. For example UTF-8, UTF-16BE, 
UTF-32. If the encoding
+   * is not specified (by default), it will be detected automatically.
--- End diff --

If `encoding` is not set, it will be detected by Jackson independently from 
`multiline`. In particular, this PR addresses the issues when encoding is 
detected incorrectly. Look at the ticket:  
https://issues.apache.org/jira/browse/SPARK-23094 .  Without this PR, users 
cannot read such files by the way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIndex

2018-04-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21004
  
**[Test build #89034 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89034/testReport)**
 for PR 21004 at commit 
[`35aff24`](https://github.com/apache/spark/commit/35aff24743ff13ccd370a8e3747a3044e8a671c9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIn...

2018-04-08 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21004#discussion_r179957491
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
 ---
@@ -126,35 +126,35 @@ abstract class PartitioningAwareFileIndex(
 val caseInsensitiveOptions = CaseInsensitiveMap(parameters)
 val timeZoneId = 
caseInsensitiveOptions.get(DateTimeUtils.TIMEZONE_OPTION)
   .getOrElse(sparkSession.sessionState.conf.sessionLocalTimeZone)
-
-userPartitionSchema match {
+val inferredPartitionSpec = PartitioningUtils.parsePartitions(
+  leafDirs,
+  typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
+  basePaths = basePaths,
+  timeZoneId = timeZoneId)
+userSpecifiedSchema match {
   case Some(userProvidedSchema) if userProvidedSchema.nonEmpty =>
-val spec = PartitioningUtils.parsePartitions(
-  leafDirs,
-  typeInference = false,
-  basePaths = basePaths,
-  timeZoneId = timeZoneId)
+val userPartitionSchema =
+  
combineInferredAndUserSpecifiedPartitionSchema(inferredPartitionSpec)
 
-// Without auto inference, all of value in the `row` should be 
null or in StringType,
 // we need to cast into the data type that user specified.
 def castPartitionValuesToUserSchema(row: InternalRow) = {
   InternalRow((0 until row.numFields).map { i =>
+val expr = 
inferredPartitionSpec.partitionColumns.fields(i).dataType match {
+  case StringType => Literal.create(row.getUTF8String(i), 
StringType)
+  case otherType => Literal.create(row.get(i, otherType))
--- End diff --

Here I am not very sure that all the other cases are covered.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIn...

2018-04-08 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/21004

[SPARK-23896][SQL]Improve PartitioningAwareFileIndex

## What changes were proposed in this pull request?

Currently `PartitioningAwareFileIndex` accepts an optional parameter 
`userPartitionSchema`. If provided, it will combine the inferred partition 
schema with the parameter.

However,
1. to get `userPartitionSchema`, we need to  combine inferred partition 
schema with `userSpecifiedSchema`
2. to get the inferred partition schema, we have to create a temporary file 
index.

Only after that, a final version of `PartitioningAwareFileIndex` can be 
created.

This can be improved by passing `userSpecifiedSchema` to 
`PartitioningAwareFileIndex`.

With the improvement, we can reduce redundant code and avoid parsing the 
file partition twice. 
## How was this patch tested?
Unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark PartitioningAwareFileIndex

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21004.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21004


commit 35aff24743ff13ccd370a8e3747a3044e8a671c9
Author: Gengliang Wang 
Date:   2018-04-08T18:19:48Z

improve PartitioningAwareFileIndex




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21002: [SPARK-23893][Core][SQL] Avoid possible integer o...

2018-04-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21002


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21002: [SPARK-23893][Core][SQL] Avoid possible integer overflow...

2018-04-08 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/21002
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20938: [SPARK-23821][SQL] Collection function: flatten

2018-04-08 Thread mn-mikke
Github user mn-mikke commented on the issue:

https://github.com/apache/spark/pull/20938
  
Any other comments?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21000: [SPARK-23892][Test] Improve converge and fix lint...

2018-04-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21000


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21000: [SPARK-23892][Test] Improve converge and fix lint error ...

2018-04-08 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/21000
  
LGTM - merging to master. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179955288
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala
 ---
@@ -92,26 +93,30 @@ object TextInputJsonDataSource extends JsonDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: JSONOptions): StructType = {
-val json: Dataset[String] = createBaseDataset(
-  sparkSession, inputPaths, parsedOptions.lineSeparator)
+val json: Dataset[String] = createBaseDataset(sparkSession, 
inputPaths, parsedOptions)
+
 inferFromDataset(json, parsedOptions)
   }
 
   def inferFromDataset(json: Dataset[String], parsedOptions: JSONOptions): 
StructType = {
 val sampled: Dataset[String] = JsonUtils.sample(json, parsedOptions)
-val rdd: RDD[UTF8String] = 
sampled.queryExecution.toRdd.map(_.getUTF8String(0))
-JsonInferSchema.infer(rdd, parsedOptions, 
CreateJacksonParser.utf8String)
+val rdd: RDD[InternalRow] = sampled.queryExecution.toRdd
+val rowParser = parsedOptions.encoding.map { enc =>
+  CreateJacksonParser.internalRow(enc, _: JsonFactory, _: InternalRow, 
0)
--- End diff --

I moved 0 to here to address @cloud-fan review comment why the field by 
position 0 is taken from InternalRow. It should be clear here why internal rows 
have only one field at the 0 position. Please, tell me how could omit 0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179954099
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -86,14 +85,34 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
-  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
-require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
-sep
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep")
+
+  /**
+   * Standard encoding (charset) name. For example UTF-8, UTF-16LE and 
UTF-32BE.
+   * If the encoding is not specified (None), it will be detected 
automatically.
+   */
+  val encoding: Option[String] = parameters.get("encoding")
+.orElse(parameters.get("charset")).map { enc =>
+  val blacklist = List("UTF16", "UTF32")
--- End diff --

I hesitated what to take `List` or `Set`. `Set` because order is not 
important here, `List` because `blacklist` is common used name, and `List` is 
more appropriate for the value which has the `list` word in its name. I don't 
see any reasons for `Seq` but if you believe it is important to have `Seq` 
instead of `List` here I will replace it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179952254
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -361,6 +361,15 @@ class JacksonParser(
 // For such records, all fields other than the field configured by
 // `columnNameOfCorruptRecord` are set to `null`.
 throw BadRecordException(() => recordLiteral(record), () => None, 
e)
+  case e: CharConversionException if options.encoding.isEmpty =>
+val msg =
+  """Failed to parse a character. Encoding was detected 
automatically.
--- End diff --

I think automatic detection is true only when multuline is enabled. We can 
just describe it in documentation and, reward this message or even just remove 
this message too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179951972
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -86,14 +85,34 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
-  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
-require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
-sep
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep")
+
+  /**
+   * Standard encoding (charset) name. For example UTF-8, UTF-16LE and 
UTF-32BE.
+   * If the encoding is not specified (None), it will be detected 
automatically.
+   */
+  val encoding: Option[String] = parameters.get("encoding")
+.orElse(parameters.get("charset")).map { enc =>
+  val blacklist = List("UTF16", "UTF32")
--- End diff --

I believe we use `Seq` if there isn't specific reason for `List`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179952617
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2162,4 +2162,262 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 
 assert(ds.schema == new StructType().add("f1", LongType))
   }
+
+  def testFile(fileName: String): String = {
+
Thread.currentThread().getContextClassLoader.getResource(fileName).toString
+  }
+
+  test("SPARK-23723: json in UTF-16 with BOM") {
+val fileName = "json-tests/utf16WithBOM.json"
+val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
+val jsonDF = spark.read.schema(schema)
+  .option("multiline", "true")
+  .option("encoding", "UTF-16")
+  .json(testFile(fileName))
+
+checkAnswer(jsonDF, Seq(Row("Chris", "Baird"), Row("Doug", "Rood")))
+  }
+
+  test("SPARK-23723: multi-line json in UTF-32BE with BOM") {
+val fileName = "json-tests/utf32BEWithBOM.json"
+val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
+val jsonDF = spark.read.schema(schema)
+  .option("multiline", "true")
+  .json(testFile(fileName))
+
+checkAnswer(jsonDF, Seq(Row("Chris", "Baird")))
+  }
+
+  test("SPARK-23723: Use user's encoding in reading of multi-line json in 
UTF-16LE") {
+val fileName = "json-tests/utf16LE.json"
+val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
+val jsonDF = spark.read.schema(schema)
+  .option("multiline", "true")
+  .options(Map("encoding" -> "UTF-16LE"))
+  .json(testFile(fileName))
+
+checkAnswer(jsonDF, Seq(Row("Chris", "Baird")))
+  }
+
+  test("SPARK-23723: Unsupported encoding name") {
+val invalidCharset = "UTF-128"
+val exception = intercept[java.io.UnsupportedEncodingException] {
+  spark.read
+.options(Map("encoding" -> invalidCharset, "lineSep" -> "\n"))
+.json(testFile("json-tests/utf16LE.json"))
+.count()
+}
+
+assert(exception.getMessage.contains(invalidCharset))
+  }
+
+  test("SPARK-23723: checking that the encoding option is case agnostic") {
+val fileName = "json-tests/utf16LE.json"
+val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
+val jsonDF = spark.read.schema(schema)
+  .option("multiline", "true")
+  .options(Map("encoding" -> "uTf-16lE"))
+  .json(testFile(fileName))
+
+checkAnswer(jsonDF, Seq(Row("Chris", "Baird")))
+  }
+
+
+  test("SPARK-23723: specified encoding is not matched to actual 
encoding") {
+val fileName = "json-tests/utf16LE.json"
+val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
+val exception = intercept[SparkException] {
+  spark.read.schema(schema)
+.option("mode", "FAILFAST")
+.option("multiline", "true")
+.options(Map("encoding" -> "UTF-16BE"))
+.json(testFile(fileName))
+.count()
+}
+val errMsg = exception.getMessage
+
+assert(errMsg.contains("Malformed records are detected in record 
parsing"))
+  }
+
+  def checkEncoding(
+  expectedEncoding: String,
+  pathToJsonFiles: String,
+  expectedContent: String): Unit = {
+val jsonFiles = new File(pathToJsonFiles)
+  .listFiles()
+  .filter(_.isFile)
+  .filter(_.getName.endsWith("json"))
+val actualContent = jsonFiles.map { file =>
+  new String(Files.readAllBytes(file.toPath), expectedEncoding)
+}.mkString.trim.replaceAll(" ", "")
+
+assert(actualContent == expectedContent)
+  }
+
+  test("SPARK-23723: save json in UTF-32BE") {
+val encoding = "UTF-32BE"
+withTempPath { path =>
+  val df = spark.createDataset(Seq(("Dog", 42)))
+  df.write
+.options(Map("encoding" -> encoding, "lineSep" -> "\n"))
+.format("json").mode("overwrite")
+.save(path.getCanonicalPath)
+
+  checkEncoding(
+expectedEncoding = encoding,
+pathToJsonFiles = path.getCanonicalPath,
+expectedContent = """{"_1":"Dog","_2":42}""")
+}
+  }
+
+  test("SPARK-23723: save json in default encoding - UTF-8") {
+withTempPath { path =>
+  val df = spark.createDataset(Seq(("Dog", 42)))
+  df.write
+.format("json").mode("overwrite")
+.save(path.getCanonicalPath)
+
+  checkEncoding(

[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179952150
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -86,14 +85,34 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
-  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
-require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
-sep
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep")
+
+  /**
+   * Standard encoding (charset) name. For example UTF-8, UTF-16LE and 
UTF-32BE.
+   * If the encoding is not specified (None), it will be detected 
automatically.
+   */
+  val encoding: Option[String] = parameters.get("encoding")
+.orElse(parameters.get("charset")).map { enc =>
+  val blacklist = List("UTF16", "UTF32")
+  val isBlacklisted = 
blacklist.contains(enc.toUpperCase.replaceAll("-|_", ""))
+  require(multiLine || !isBlacklisted,
+s"""The ${enc} encoding must not be included in the blacklist:
+   | ${blacklist.mkString(", ")}""".stripMargin)
--- End diff --

let's add "when multiLine is disabled".


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179951938
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -86,14 +85,34 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
-  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
-require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
--- End diff --

Seems this nonempty requirement is removed. Did I miss something or was it 
mistake?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179952240
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -86,14 +85,34 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
-  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
-require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
-sep
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep")
+
+  /**
+   * Standard encoding (charset) name. For example UTF-8, UTF-16LE and 
UTF-32BE.
+   * If the encoding is not specified (None), it will be detected 
automatically.
+   */
+  val encoding: Option[String] = parameters.get("encoding")
+.orElse(parameters.get("charset")).map { enc =>
+  val blacklist = List("UTF16", "UTF32")
+  val isBlacklisted = 
blacklist.contains(enc.toUpperCase.replaceAll("-|_", ""))
+  require(multiLine || !isBlacklisted,
+s"""The ${enc} encoding must not be included in the blacklist:
+   | ${blacklist.mkString(", ")}""".stripMargin)
+
+  val forcingLineSep = !(multiLine == false && enc != "UTF-8" && 
lineSeparator.isEmpty)
+  require(forcingLineSep,
+s"""The lineSep option must be specified for the $enc encoding.
+   |Example: .option("lineSep", "|^|")
--- End diff --

I think we are fine to remove this example .. Can we just use prose, for 
example, `'lineSep' option must be explicitly set when 'encoding' option is 
specified.` (feel free to not use it as is. just was thinking)? It doesn't 
describe SQL syntax too ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179952386
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -366,6 +366,9 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* `java.text.SimpleDateFormat`. This applies to timestamp type.
* `multiLine` (default `false`): parse one record, which may span 
multiple lines,
* per file
+   * `encoding` (by default it is not set): allows to forcibly set one 
of standard basic
+   * or extended charsets for input jsons. For example UTF-8, UTF-16BE, 
UTF-32. If the encoding
--- End diff --

I think `UTF-32` is blacklisted.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179951956
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -86,14 +85,34 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
-  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
-require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
-sep
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep")
+
+  /**
+   * Standard encoding (charset) name. For example UTF-8, UTF-16LE and 
UTF-32BE.
+   * If the encoding is not specified (None), it will be detected 
automatically.
+   */
+  val encoding: Option[String] = parameters.get("encoding")
+.orElse(parameters.get("charset")).map { enc =>
+  val blacklist = List("UTF16", "UTF32")
--- End diff --

Mind if I ask to leave a simple comment explaining why these are blocked 
for now? (or do you plan to fix them soon?)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179952127
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -86,14 +85,34 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
-  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
-require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
-sep
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep")
+
+  /**
+   * Standard encoding (charset) name. For example UTF-8, UTF-16LE and 
UTF-32BE.
+   * If the encoding is not specified (None), it will be detected 
automatically.
+   */
+  val encoding: Option[String] = parameters.get("encoding")
+.orElse(parameters.get("charset")).map { enc =>
+  val blacklist = List("UTF16", "UTF32")
+  val isBlacklisted = 
blacklist.contains(enc.toUpperCase.replaceAll("-|_", ""))
--- End diff --

Can we maybe do

```scala
val blacklist = Seq(Charset.forName("UTF-8"), Charset.forName("UTF-32"))
val isBlacklisted = blacklist.contains(Charset.forName(enc))
``` 

?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179951859
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -237,6 +237,8 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
 :param allowUnquotedControlChars: allows JSON Strings to contain 
unquoted control
   characters (ASCII characters 
with value less than 32,
   including tab and line feed 
characters) or not.
+:param encoding: standard encoding (charset) name, for example 
UTF-8, UTF-16LE and UTF-32BE.
+If None is set, the encoding of input JSON will be 
detected automatically.
--- End diff --

one more leading space for `If None ...`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179952336
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -361,6 +361,15 @@ class JacksonParser(
 // For such records, all fields other than the field configured by
 // `columnNameOfCorruptRecord` are set to `null`.
 throw BadRecordException(() => recordLiteral(record), () => None, 
e)
+  case e: CharConversionException if options.encoding.isEmpty =>
+val msg =
+  """Failed to parse a character. Encoding was detected 
automatically.
+|You might want to set it explicitly via the encoding option 
like:
+|  .option("encoding", "UTF-8")
--- End diff --

ditto for prose ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179952204
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -86,14 +85,34 @@ private[sql] class JSONOptions(
 
   val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
 
-  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
-require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
-sep
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep")
+
+  /**
+   * Standard encoding (charset) name. For example UTF-8, UTF-16LE and 
UTF-32BE.
+   * If the encoding is not specified (None), it will be detected 
automatically.
+   */
+  val encoding: Option[String] = parameters.get("encoding")
+.orElse(parameters.get("charset")).map { enc =>
+  val blacklist = List("UTF16", "UTF32")
+  val isBlacklisted = 
blacklist.contains(enc.toUpperCase.replaceAll("-|_", ""))
+  require(multiLine || !isBlacklisted,
+s"""The ${enc} encoding must not be included in the blacklist:
+   | ${blacklist.mkString(", ")}""".stripMargin)
+
+  val forcingLineSep = !(multiLine == false && enc != "UTF-8" && 
lineSeparator.isEmpty)
+  require(forcingLineSep,
+s"""The lineSep option must be specified for the $enc encoding.
+   |Example: .option("lineSep", "|^|")
+   |Note: lineSep can be detected automatically for UTF-8 
only.""".stripMargin)
--- End diff --

It's UTF-8 by default .. I think we don't have to explain this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

2018-04-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20937#discussion_r179952689
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala
 ---
@@ -175,11 +188,15 @@ object MultiLineJsonDataSource extends JsonDataSource 
{
   .values
   }
 
-  private def createParser(jsonFactory: JsonFactory, record: 
PortableDataStream): JsonParser = {
+  private def createParser(
+  jsonFactory: JsonFactory,
+  record: PortableDataStream,
+  encoding: Option[String] = None): JsonParser = {
--- End diff --

`encoding: Option[String] = None` -> `encoding: Option[String]`. You don't 
have to worry about signature for private ones. Theoretically MiMa will detect 
the signature changes for public APIs, which make the Jenkins tests failed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >