[GitHub] spark issue #21501: [SPARK-15064][ML] Locale support in StopWordsRemover

2018-06-12 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21501
  
@dongjinleekr You are welcome and thanks for your contribution!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21539
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3958/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21539
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21527: [SPARK-24519] MapStatus has 2000 hardcoded

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21527
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21527: [SPARK-24519] MapStatus has 2000 hardcoded

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21527
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91724/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21527: [SPARK-24519] MapStatus has 2000 hardcoded

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21527
  
**[Test build #91724 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91724/testReport)**
 for PR 21527 at commit 
[`4c8acfa`](https://github.com/apache/spark/commit/4c8acfa5899ccbdafeb630f38ce44b23332b80f2).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21542: [WIP][SPARK-24529][Build] Add spotbugs into maven build ...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21542
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21542: [WIP][SPARK-24529][Build] Add spotbugs into maven build ...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21542
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91720/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21539
  
**[Test build #91735 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91735/testReport)**
 for PR 21539 at commit 
[`d5832f4`](https://github.com/apache/spark/commit/d5832f4b50d4c8f84feb462291d8da37e87b192f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21539
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21542: [WIP][SPARK-24529][Build] Add spotbugs into maven build ...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21542
  
**[Test build #91720 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91720/testReport)**
 for PR 21542 at commit 
[`3a356ad`](https://github.com/apache/spark/commit/3a356ada17a5cad00cb49fea20fb473a9e8392d1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21539
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/69/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...

2018-06-12 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21539
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21543
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21543
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91729/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21543
  
**[Test build #91729 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91729/testReport)**
 for PR 21543 at commit 
[`08461b4`](https://github.com/apache/spark/commit/08461b45bb4dc11e74976e6335269907476d02ca).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21082
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21082
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91718/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream ...

2018-06-12 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/21546#discussion_r194904013
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
 ---
@@ -183,34 +182,131 @@ private[sql] object ArrowConverters {
   }
 
   /**
-   * Convert a byte array to an ArrowRecordBatch.
+   * Load a serialized ArrowRecordBatch.
*/
-  private[arrow] def byteArrayToBatch(
+  private[arrow] def loadBatch(
   batchBytes: Array[Byte],
   allocator: BufferAllocator): ArrowRecordBatch = {
-val in = new ByteArrayReadableSeekableByteChannel(batchBytes)
-val reader = new ArrowFileReader(in, allocator)
-
-// Read a batch from a byte stream, ensure the reader is closed
-Utils.tryWithSafeFinally {
-  val root = reader.getVectorSchemaRoot  // throws IOException
-  val unloader = new VectorUnloader(root)
-  reader.loadNextBatch()  // throws IOException
-  unloader.getRecordBatch
-} {
-  reader.close()
-}
+val in = new ByteArrayInputStream(batchBytes)
+MessageSerializer.deserializeMessageBatch(new 
ReadChannel(Channels.newChannel(in)), allocator)
+  .asInstanceOf[ArrowRecordBatch]  // throws IOException
   }
 
+  /**
+   * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches.
+   */
   private[sql] def toDataFrame(
-  payloadRDD: JavaRDD[Array[Byte]],
+  arrowBatchRDD: JavaRDD[Array[Byte]],
   schemaString: String,
   sqlContext: SQLContext): DataFrame = {
-val rdd = payloadRDD.rdd.mapPartitions { iter =>
+val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
+val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone
+val rdd = arrowBatchRDD.rdd.mapPartitions { iter =>
   val context = TaskContext.get()
-  ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), 
context)
+  ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context)
 }
-val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
 sqlContext.internalCreateDataFrame(rdd, schema)
   }
+
+  /**
+   * Read a file as an Arrow stream and return an RDD of serialized 
ArrowRecordBatches.
+   */
+  private[sql] def readArrowStreamFromFile(sqlContext: SQLContext, 
filename: String):
+  JavaRDD[Array[Byte]] = {
+val fileStream = new FileInputStream(filename)
+try {
+  // Create array so that we can safely close the file
+  val batches = getBatchesFromStream(fileStream.getChannel).toArray
+  // Parallelize the record batches to create an RDD
+  JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, 
batches.length))
+} finally {
+  fileStream.close()
+}
+  }
+
+  /**
+   * Read an Arrow stream input and return an iterator of serialized 
ArrowRecordBatches.
+   */
+  private[sql] def getBatchesFromStream(in: SeekableByteChannel): 
Iterator[Array[Byte]] = {
+
+// TODO: simplify in super class
+class RecordBatchMessageReader(inputChannel: SeekableByteChannel) {
--- End diff --

made https://issues.apache.org/jira/browse/ARROW-2704 to track


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21082
  
**[Test build #91718 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91718/testReport)**
 for PR 21082 at commit 
[`328b2c4`](https://github.com/apache/spark/commit/328b2c4e09502a66939d47d6967ceea7ceab6c8c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21539: [SPARK-24500][SQL] Make sure streams are material...

2018-06-12 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/21539#discussion_r194902636
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala ---
@@ -679,6 +679,13 @@ class PlannerSuite extends SharedSQLContext {
 }
 assert(rangeExecInZeroPartition.head.outputPartitioning == 
UnknownPartitioning(0))
   }
+
+  test("SPARK-24500: create union with stream of children") {
+val df = Union(Stream(
+  Range(1, 1, 1, 1),
+  Range(1, 2, 1, 1)))
+df.queryExecution.executedPlan.execute()
--- End diff --

Yeah, it would throw an `UnsupportedOperationException` before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21539: [SPARK-24500][SQL] Make sure streams are material...

2018-06-12 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/21539#discussion_r194902354
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala 
---
@@ -301,6 +290,37 @@ abstract class TreeNode[BaseType <: 
TreeNode[BaseType]] extends Product {
   def mapChildren(f: BaseType => BaseType): BaseType = {
 if (children.nonEmpty) {
   var changed = false
+  def mapChild(child: Any): Any = child match {
+case arg: TreeNode[_] if containsChild(arg) =>
--- End diff --

Yeah, so was about to do that but then I noticed that they handle different 
cases, `mapChild` handles `TreeNode` and `(TreeNode, TreeNode)`, whereas 
L326-L349 handles `TreeNode`, `Option[TreeNode]` and `Map[_, TreeNode]`. I am 
not sure if combining them is useful, and if it is then I'd rather do it in a 
different PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21319
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91722/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21319
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21319
  
**[Test build #91722 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91722/testReport)**
 for PR 21319 at commit 
[`91fdedc`](https://github.com/apache/spark/commit/91fdedc4d91a7abde5f6b64dbfcf354b67d89a48).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream ...

2018-06-12 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/21546#discussion_r194898976
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala
 ---
@@ -51,11 +51,11 @@ class ArrowConvertersSuite extends SharedSQLContext 
with BeforeAndAfterAll {
 
   test("collect to arrow record batch") {
 val indexData = (1 to 6).toDF("i")
-val arrowPayloads = indexData.toArrowPayload.collect()
-assert(arrowPayloads.nonEmpty)
-assert(arrowPayloads.length == indexData.rdd.getNumPartitions)
+val arrowBatches = indexData.getArrowBatchRdd.collect()
+assert(arrowBatches.nonEmpty)
+assert(arrowBatches.length == indexData.rdd.getNumPartitions)
--- End diff --

Most of these changes are just renames to be consistent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream ...

2018-06-12 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/21546#discussion_r194898793
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
 ---
@@ -183,34 +182,131 @@ private[sql] object ArrowConverters {
   }
 
   /**
-   * Convert a byte array to an ArrowRecordBatch.
+   * Load a serialized ArrowRecordBatch.
*/
-  private[arrow] def byteArrayToBatch(
+  private[arrow] def loadBatch(
   batchBytes: Array[Byte],
   allocator: BufferAllocator): ArrowRecordBatch = {
-val in = new ByteArrayReadableSeekableByteChannel(batchBytes)
-val reader = new ArrowFileReader(in, allocator)
-
-// Read a batch from a byte stream, ensure the reader is closed
-Utils.tryWithSafeFinally {
-  val root = reader.getVectorSchemaRoot  // throws IOException
-  val unloader = new VectorUnloader(root)
-  reader.loadNextBatch()  // throws IOException
-  unloader.getRecordBatch
-} {
-  reader.close()
-}
+val in = new ByteArrayInputStream(batchBytes)
+MessageSerializer.deserializeMessageBatch(new 
ReadChannel(Channels.newChannel(in)), allocator)
+  .asInstanceOf[ArrowRecordBatch]  // throws IOException
   }
 
+  /**
+   * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches.
+   */
   private[sql] def toDataFrame(
-  payloadRDD: JavaRDD[Array[Byte]],
+  arrowBatchRDD: JavaRDD[Array[Byte]],
   schemaString: String,
   sqlContext: SQLContext): DataFrame = {
-val rdd = payloadRDD.rdd.mapPartitions { iter =>
+val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
+val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone
+val rdd = arrowBatchRDD.rdd.mapPartitions { iter =>
   val context = TaskContext.get()
-  ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), 
context)
+  ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context)
 }
-val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
 sqlContext.internalCreateDataFrame(rdd, schema)
   }
+
+  /**
+   * Read a file as an Arrow stream and return an RDD of serialized 
ArrowRecordBatches.
+   */
+  private[sql] def readArrowStreamFromFile(sqlContext: SQLContext, 
filename: String):
+  JavaRDD[Array[Byte]] = {
+val fileStream = new FileInputStream(filename)
+try {
+  // Create array so that we can safely close the file
+  val batches = getBatchesFromStream(fileStream.getChannel).toArray
+  // Parallelize the record batches to create an RDD
+  JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, 
batches.length))
+} finally {
+  fileStream.close()
+}
+  }
+
+  /**
+   * Read an Arrow stream input and return an iterator of serialized 
ArrowRecordBatches.
+   */
+  private[sql] def getBatchesFromStream(in: SeekableByteChannel): 
Iterator[Array[Byte]] = {
+
+// TODO: simplify in super class
+class RecordBatchMessageReader(inputChannel: SeekableByteChannel) {
--- End diff --

I had to modify the existing Arrow code to allow for this, but I will work 
on getting these changes into Arrow for 0.10.0 and then this class can be 
simplified a lot.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21515: [SPARK-24372][build] Add scripts to help with preparing ...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21515
  
**[Test build #91734 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91734/testReport)**
 for PR 21515 at commit 
[`04f6371`](https://github.com/apache/spark/commit/04f6371a3aa0f03ba1c37b5b450bea69923388e9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21515: [SPARK-24372][build] Add scripts to help with preparing ...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21515
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3957/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21515: [SPARK-24372][build] Add scripts to help with preparing ...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21515
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21515: [SPARK-24372][build] Add scripts to help with preparing ...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21515
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21515: [SPARK-24372][build] Add scripts to help with preparing ...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21515
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/68/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21546
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/67/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21546
  
**[Test build #91733 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91733/testReport)**
 for PR 21546 at commit 
[`a5a1fbe`](https://github.com/apache/spark/commit/a5a1fbe7121c5b0dd93876a56c29ad17dcd9b168).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21515: [SPARK-24372][build] Add scripts to help with pre...

2018-06-12 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21515#discussion_r194897476
  
--- Diff: dev/create-release/vote.tmpl ---
@@ -0,0 +1,64 @@
+Please vote on releasing the following candidate as Apache Spark version 
{version}.
+
+The vote is open until {deadline} and passes if a majority of at least 3 
+1 PMC votes are cast.
--- End diff --

The rules are a little bit more complicated than that. Updated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21546
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21546
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21546
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3956/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21515: [SPARK-24372][build] Add scripts to help with pre...

2018-06-12 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21515#discussion_r194896688
  
--- Diff: dev/create-release/spark-rm/Dockerfile ---
@@ -0,0 +1,89 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Image for building Spark releases. Based on Ubuntu 16.04.
+#
+# Includes:
+# * Java 8
+# * Ivy
+# * Python/PyPandoc (2.7.12/3.5.2)
+# * R-base/R-base-dev (3.3.2+)
+# * Ruby 2.3 build utilities
+
+FROM ubuntu:16.04
+
+# These arguments are just for reuse and not really meant to be customized.
+ARG APT_INSTALL="apt-get install --no-install-recommends -y"
+
+# Install extra needed repos and refresh.
+# - CRAN repo
+# - Ruby repo (for doc generation)
+RUN echo 'deb http://cran.cnr.Berkeley.edu/bin/linux/ubuntu xenial/' >> 
/etc/apt/sources.list && \
+  gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 && \
+  gpg -a --export E084DAB9 | apt-key add - && \
+  apt-get clean && \
+  rm -rf /var/lib/apt/lists/* && \
+  apt-get clean && \
+  apt-get update && \
+  $APT_INSTALL software-properties-common && \
+  apt-add-repository -y ppa:brightbox/ruby-ng && \
+  apt-get update
+
+# Install openjdk 8.
+RUN $APT_INSTALL openjdk-8-jdk && \
+  update-alternatives --set java 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
+
+# Install build / source control tools
+RUN $APT_INSTALL curl wget git maven ivy subversion make gcc libffi-dev \
+pandoc pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && \
+  ln -s -T /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar && \
+  curl -sL https://deb.nodesource.com/setup_4.x | bash && \
+  $APT_INSTALL nodejs
+
+# Install needed python packages. Use pip for installing packages (for 
consistency).
+ARG BASE_PIP_PKGS="setuptools wheel virtualenv"
+ARG PIP_PKGS="pyopenssl pypandoc numpy pygments sphinx"
+
+RUN $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip && 
\
+  pip install $BASE_PIP_PKGS && \
+  pip install $PIP_PKGS && \
+  cd && \
+  virtualenv -p python3 p35 && \
+  . p35/bin/activate && \
+  pip install $BASE_PIP_PKGS && \
+  pip install $PIP_PKGS
+
+# Install R packages and dependencies used when building.
+# R depends on pandoc*, libssl (which are installed above).
+RUN $APT_INSTALL r-base r-base-dev && \
+  $APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf 
&& \
+  Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 
'testthat', 'knitr', 'rmarkdown', 'roxygen2', 'e1071', 'survival'), 
repos='http://cran.us.r-project.org/')" && \
+  Rscript -e "devtools::install_github('jimhester/lintr')"
+
+# Install tools needed to build the documentation.
+RUN $APT_INSTALL ruby2.3 ruby2.3-dev && \
+  gem install jekyll --no-rdoc --no-ri && \
+  gem install jekyll-redirect-from && \
+  gem install pygments.rb
+
+WORKDIR /opt/spark-rm/output
+
+ARG UID
+RUN useradd -m -s /bin/bash -p spark-rm -u $UID spark-rm
--- End diff --

The gpg signature is completely unrelated to the user running the process.

Also, Docker being the weird thing it is, the user name in the container 
doesn't really matter much. It's running with the same UID as the host user who 
built the image, which in this case is expected the be the person preparing the 
release, and this will only really matter for files that are written back to 
the host's file system.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

2018-06-12 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/21546
  
This is a WIP because I had to hack up some of the message processing code 
in Arrow.  This should be done in Arrow, and then it can be cleaned up here.  I 
will make these changes for version 0.10.0 and complete this once we have 
upgraded.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream ...

2018-06-12 Thread BryanCutler
GitHub user BryanCutler opened a pull request:

https://github.com/apache/spark/pull/21546

[WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format for creating from 
and collecting Pandas DataFrames

## What changes were proposed in this pull request?

This changes the calls of `toPandas()` and `createDataFrame()` to use the 
Arrow stream format, when Arrow is enabled.  Previously, Arrow data was written 
to byte arrays where each chunk is an output of the Arrow file format.  This 
was mainly due to constraints at the time, and caused some overhead by writing 
the schema/footer on each chunk of data and then having to read multiple Arrow 
file inputs and concat them together.

Using the Arrow stream format has improved these by increasing performance, 
lower memory overhead for the average case, and simplified the code.  Here are 
the details of this change:

**toPandas()**

_Before:_
Spark internal rows are converted to Arrow file format, each group of 
records is a complete Arrow file which contains the schema and other metadata.  
Next a collect is done and an Array of Arrow files is the result.  After that 
each Arrow file is sent to Python driver which then loads each file and concats 
them to a single Arrow DataFrame.

_After:_
Spark internal rows are converted to ArrowRecordBatches directly, which is 
the simplest Arrow component for IPC data transfers.  The driver JVM then 
immediately starts serving data to Python as an Arrow stream, sending the 
schema first. It then starts Spark jobs with a custom handler such that when a 
partition is received (and in the correct order) the ArrowRecordBatches can be 
sent to python as soon as possible.  This improves performance, simplifies 
memory usage on executors, and improves the average memory usage on the JVM 
driver.  Since the order of partitions must be preserved, the worst case is 
that the first partition will be the last to arrive and all data must be kept 
in memory until then.  This case is no worse that before when doing a full 
collect.

**createDataFrame()**

_Before:_
A Pandas DataFrame is split into parts and each part is made into an Arrow 
file.  Then each file is prefixed by the buffer size and written to a temp 
file.  The temp file is read and each Arrow file is parallelized as a byte 
array.

_After:_
A Pandas DataFrame is split into parts, then an Arrow stream is written to 
a temp file where each part is an ArrowRecordBatch.  The temp file is read as a 
stream and the Arrow messages are examined.  If the message is an 
ArrowRecordBatch, the data is saved as a byte array.  After reading the file, 
each ArrowRecordBatch is parallelized as a byte array.  This has slightly more 
processing than before because we must look each Arrow message to extract the 
record batches, but performance remains the same.  It is cleaner in the sense 
that IPC from Python to JVM is done over a single Arrow stream.

## How was this patch tested?

Added new unit tests for the additions to ArrowConverters in Scala, 
existing tests for Python.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/BryanCutler/spark 
arrow-toPandas-stream-SPARK-23030

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21546.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21546


commit 9af482170ee95831bbda139e6e931ba2631df386
Author: Bryan Cutler 
Date:   2018-01-10T22:02:15Z

change ArrowConverters to stream format

commit d617f0da8eff1509da465bb707340e391314bec4
Author: Bryan Cutler 
Date:   2018-01-10T22:14:07Z

Change ArrowSerializer to use stream format

commit f10d5d9cd3cece7f56749e1de7fe01699e4759a0
Author: Bryan Cutler 
Date:   2018-01-12T00:40:36Z

toPandas is working with RecordBatch payloads, using custom handler to 
stream ordered partitions

commit 03653c687473b82bbfb6653504479498a2a3c63b
Author: Bryan Cutler 
Date:   2018-02-10T00:23:17Z

cleanup and removed ArrowPayload, createDataFrame now working

commit 1b932463bca0815e79f3a8d61d1c816e62949698
Author: Bryan Cutler 
Date:   2018-03-09T00:14:06Z

toPandas and createDataFrame working but tests fail with date cols

commit ce22d8ad18e052d150528752b727c6cfe11485f7
Author: Bryan Cutler 
Date:   2018-03-27T00:32:03Z

removed usage of seekableByteChannel

commit dede0bd96921c439747a9176f24c9ecbb9c8ce0a
Author: Bryan Cutler 
Date:   2018-03-28T00:28:54Z

for toPandas, set old collection result to null and add comments

commit 9e29b092cb7d45fa486db0215c3bd4a99c5f8d98
Author: Bryan Cutler 
Date:   2018-03-28T18:28:18Z

cleanup, not yet passing ArrowConvertersSuite

commit ceb8d38a6c83c3b6dae040c9e8d860811ecad0cc
Author: Bryan Cutler 
Date:   2018-03-29T21:14:03Z

fix

[GitHub] spark pull request #21515: [SPARK-24372][build] Add scripts to help with pre...

2018-06-12 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21515#discussion_r194896168
  
--- Diff: dev/.rat-excludes ---
@@ -106,3 +106,4 @@ spark-warehouse
 structured-streaming/*
 kafka-source-initial-offset-version-2.1.0.bin
 kafka-source-initial-offset-future-version.bin
+vote.tmpl
--- End diff --

Are you saying this file should not be packaged in the source release? Not 
sure I see why that would be the case. There's a lot of stuff in 
`.rat-excludes` that is still packaged.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21503#discussion_r194895594
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
@@ -17,15 +17,56 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.Strategy
+import org.apache.spark.sql.{execution, Strategy}
+import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, 
AttributeSet}
+import org.apache.spark.sql.catalyst.planning.PhysicalOperation
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.execution.SparkPlan
 import 
org.apache.spark.sql.execution.streaming.continuous.{WriteToContinuousDataSource,
 WriteToContinuousDataSourceExec}
 
 object DataSourceV2Strategy extends Strategy {
   override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
-case r: DataSourceV2Relation =>
-  DataSourceV2ScanExec(r.output, r.source, r.options, r.pushedFilters, 
r.reader) :: Nil
+case PhysicalOperation(project, filters, relation: 
DataSourceV2Relation) =>
+  val projectSet = AttributeSet(project.flatMap(_.references))
+  val filterSet = AttributeSet(filters.flatMap(_.references))
+
+  val projection = if (filterSet.subsetOf(projectSet) &&
+  AttributeSet(relation.output) == projectSet) {
+// When the required projection contains all of the filter columns 
and column pruning alone
+// can produce the required projection, push the required 
projection.
+// A final projection may still be needed if the data source 
produces a different column
+// order or if it cannot prune all of the nested columns.
+relation.output
+  } else {
+// When there are filter columns not already in the required 
projection or when the required
+// projection is more complicated than column pruning, base column 
pruning on the set of
+// all columns needed by both.
+(projectSet ++ filterSet).toSeq
+  }
+
+  val reader = relation.newReader
--- End diff --

Yea the second proposal is what happens for the v1 data sources. For 
file-based data source we kind of pick the third proposal and add an optimizer 
rule `PruneFileSourcePartitions` to push down some of the filters to data 
source at the logical phase, to get precise stats.

I'd like to pick from the 2nd and 3rd proposals(the 3rd proposal is also 
temporary, before we move stats to physical plan). Applying pushdown twice is 
hard to workaround(need to cache), while we can keep the 
`PruneFileSourcePartitions` rule to work around the issue in 2nd proposal for 
file-based data sources.

Let's also get more inputs from other people.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21531: [SPARK-24521][SQL][TEST] Fix ineffective test in CachedT...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21531
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21050: [SPARK-23912][SQL]add array_distinct

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21050
  
**[Test build #91732 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91732/testReport)**
 for PR 21050 at commit 
[`ba0d60f`](https://github.com/apache/spark/commit/ba0d60f2e7fd7de4e571d78af362b7f06935b3e8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21531: [SPARK-24521][SQL][TEST] Fix ineffective test in CachedT...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21531
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91719/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21050: [SPARK-23912][SQL]add array_distinct

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21050
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21050: [SPARK-23912][SQL]add array_distinct

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21050
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21050: [SPARK-23912][SQL]add array_distinct

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21050
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3955/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21050: [SPARK-23912][SQL]add array_distinct

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21050
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/66/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21531: [SPARK-24521][SQL][TEST] Fix ineffective test in CachedT...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21531
  
**[Test build #91719 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91719/testReport)**
 for PR 21531 at commit 
[`658539a`](https://github.com/apache/spark/commit/658539a71971bdea62e89afd11775bf6f51d766d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21050: [SPARK-23912][SQL]add array_distinct

2018-06-12 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21050#discussion_r194893615
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -1882,3 +1883,134 @@ case class ArrayRepeat(left: Expression, right: 
Expression)
   }
 
 }
+
+/**
+ * Removes duplicate values from the array.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(array) - Removes duplicate values from the array.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(1, 2, 3, null, 3));
+   [1,2,3,null]
+  """, since = "2.4.0")
+case class ArrayDistinct(child: Expression)
+  extends UnaryExpression with ExpectsInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(ArrayType)
+
+  override def dataType: DataType = child.dataType
+
+  lazy val elementType: DataType = 
dataType.asInstanceOf[ArrayType].elementType
+
+  override def nullSafeEval(array: Any): Any = {
+val elementType = child.dataType.asInstanceOf[ArrayType].elementType
+val data = 
array.asInstanceOf[ArrayData].toArray[AnyRef](elementType).distinct
+new GenericArrayData(data.asInstanceOf[Array[Any]])
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+nullSafeCodeGen(ctx, ev, (array) => {
+  val i = ctx.freshName("i")
+  val j = ctx.freshName("j")
+  val hs = ctx.freshName("hs")
+  val foundNullElement = ctx.freshName("foundNullElement")
+  val distinctArrayLen = ctx.freshName("distinctArrayLen")
+  val getValue = CodeGenerator.getValue(array, elementType, i)
+  val openHashSet = classOf[OpenHashSet[_]].getName
+  val classTag = s"scala.reflect.ClassTag$$.MODULE$$.Object()"
+  s"""
+ |int $distinctArrayLen = 0;
+ |boolean $foundNullElement = false;
+ |$openHashSet $hs = new $openHashSet($classTag);
+ |for (int $i = 0; $i < $array.numElements(); $i++) {
+ |  if ($array.isNullAt($i)) {
+ |if (!($foundNullElement)) {
+ |  $distinctArrayLen = $distinctArrayLen + 1;
+ |  $foundNullElement = true;
+ |}
+ |  }
+ |  else {
+ |if (!($hs.contains($getValue))) {
+ |  $hs.add($getValue);
+ |  $distinctArrayLen = $distinctArrayLen + 1;
+ |}
+ |  }
+ |}
--- End diff --

@ueshin Thanks for your review. I made some changes based on your comments. 
Could you please review one more time? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21366
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3807/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21398: [SPARK-24338][SQL] Fixed Hive CREATETABLE error in Sentr...

2018-06-12 Thread skonto
Github user skonto commented on the issue:

https://github.com/apache/spark/pull/21398
  
@vanzin @cloud-fan. Going back to the facts. The fix/work-around in 
Cloudera's distro obviously exists because the distro integrates with Sentry 
anyway. Removing the location did fix the Spark Job at the customer side. To 
re-phrase the last request from @cloud-fan, if we remove the explicit location 
what are the consequences, when do you need the explicit location for managed 
tables? Can this be a default behavior in general?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21545
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3806/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21366
  
Kubernetes integration test starting
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3807/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21366
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21366
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/65/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21545
  
Kubernetes integration test starting
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3806/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21366
  
Kubernetes integration test status failure
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3804/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-12 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21503#discussion_r194875888
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
@@ -17,15 +17,56 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.Strategy
+import org.apache.spark.sql.{execution, Strategy}
+import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, 
AttributeSet}
+import org.apache.spark.sql.catalyst.planning.PhysicalOperation
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.execution.SparkPlan
 import 
org.apache.spark.sql.execution.streaming.continuous.{WriteToContinuousDataSource,
 WriteToContinuousDataSourceExec}
 
 object DataSourceV2Strategy extends Strategy {
   override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
-case r: DataSourceV2Relation =>
-  DataSourceV2ScanExec(r.output, r.source, r.options, r.pushedFilters, 
r.reader) :: Nil
+case PhysicalOperation(project, filters, relation: 
DataSourceV2Relation) =>
+  val projectSet = AttributeSet(project.flatMap(_.references))
+  val filterSet = AttributeSet(filters.flatMap(_.references))
+
+  val projection = if (filterSet.subsetOf(projectSet) &&
+  AttributeSet(relation.output) == projectSet) {
+// When the required projection contains all of the filter columns 
and column pruning alone
+// can produce the required projection, push the required 
projection.
+// A final projection may still be needed if the data source 
produces a different column
+// order or if it cannot prune all of the nested columns.
+relation.output
+  } else {
+// When there are filter columns not already in the required 
projection or when the required
+// projection is more complicated than column pruning, base column 
pruning on the set of
+// all columns needed by both.
+(projectSet ++ filterSet).toSeq
+  }
+
+  val reader = relation.newReader
--- End diff --

I don't mind either option #1 or #2. #2 is basically what happens for 
non-v2 data sources right now. Plus, both should be temporary.

I think it is a bad idea to continue with hacky code that uses the reader 
in the logical plan. It is much cleaner otherwise and we've spend too much time 
making sure that everything still works. The main example that comes to mind is 
setting the requested projection and finding out what output is using pushdown. 
I think hacks are slowing progress on the v2 sources.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21366
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21366
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3954/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21545
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21545
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/64/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21503#discussion_r194871032
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
@@ -17,15 +17,56 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.Strategy
+import org.apache.spark.sql.{execution, Strategy}
+import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, 
AttributeSet}
+import org.apache.spark.sql.catalyst.planning.PhysicalOperation
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.execution.SparkPlan
 import 
org.apache.spark.sql.execution.streaming.continuous.{WriteToContinuousDataSource,
 WriteToContinuousDataSourceExec}
 
 object DataSourceV2Strategy extends Strategy {
   override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
-case r: DataSourceV2Relation =>
-  DataSourceV2ScanExec(r.output, r.source, r.options, r.pushedFilters, 
r.reader) :: Nil
+case PhysicalOperation(project, filters, relation: 
DataSourceV2Relation) =>
+  val projectSet = AttributeSet(project.flatMap(_.references))
+  val filterSet = AttributeSet(filters.flatMap(_.references))
+
+  val projection = if (filterSet.subsetOf(projectSet) &&
+  AttributeSet(relation.output) == projectSet) {
+// When the required projection contains all of the filter columns 
and column pruning alone
+// can produce the required projection, push the required 
projection.
+// A final projection may still be needed if the data source 
produces a different column
+// order or if it cannot prune all of the nested columns.
+relation.output
+  } else {
+// When there are filter columns not already in the required 
projection or when the required
+// projection is more complicated than column pruning, base column 
pruning on the set of
+// all columns needed by both.
+(projectSet ++ filterSet).toSeq
+  }
+
+  val reader = relation.newReader
--- End diff --

OK we need to make a decision here:
1. Apply operator pushdown twice(proposed by tihs PR). This moves the 
pushdown logic to planner which is more ideal and cleaner. The drawback is, 
before moving statistics to physical plan, we have some duplicated pushdown 
code in `DataSourceV2Relation` and applying pushdown twice has performance 
penalty.
2. Apply operator pushdown only once in the planner. Same as 1, it's 
cleaner. The drawback is, before moving statistics to physical plan, data 
source v2 can't report statistics after filter.
3. Apply operator pushdown only once in the optimizer(proposed by 
https://github.com/apache/spark/pull/21319). It has no performance penalty and 
we can report statistics after filter. The drawback is, before moving 
statistics to physical plan, we have a temporary `DataSourceReader` in 
`DataSourceV2Relation`, which is hacky.

The tradeoff is: shall we bear with hacky code and move forward with the 
data source v2 operator pushdown support? or shall we make the code cleaner and 
bear with some performance pemalty(apply pushdown twice or not report stats 
after filter)? or shall we just hold back and think about how to move stats to 
physical plan?

cc @marmbrus @jose-torres 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21366
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3951/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21545
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3953/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21545
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1

2018-06-12 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21543
  
cc @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21366
  
Kubernetes integration test starting
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3804/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21366
  
Kubernetes integration test status failure
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3803/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21366
  
**[Test build #91731 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91731/testReport)**
 for PR 21366 at commit 
[`8b0a211`](https://github.com/apache/spark/commit/8b0a211f4c5046065b0e2abcac0ee71119ff6bd6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21545
  
**[Test build #91730 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91730/testReport)**
 for PR 21545 at commit 
[`00d9850`](https://github.com/apache/spark/commit/00d9850698350abba2c6ac10a1c753e332e47751).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21504: [SPARK-24479][SS] Added config for registering streaming...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21504
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91715/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21504: [SPARK-24479][SS] Added config for registering streaming...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21504
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21504: [SPARK-24479][SS] Added config for registering streaming...

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21504
  
**[Test build #91715 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91715/testReport)**
 for PR 21504 at commit 
[`b87c90b`](https://github.com/apache/spark/commit/b87c90b6bb356e0faaf8230cb6b20fbcdd65c858).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21545: [SPARK-23010][BUILD] Fix java checkstyle failure ...

2018-06-12 Thread jiangxb1987
GitHub user jiangxb1987 opened a pull request:

https://github.com/apache/spark/pull/21545

[SPARK-23010][BUILD] Fix java checkstyle failure of 
kubernetes-integration-tests

## What changes were proposed in this pull request?

Fix java checkstyle failure of kubernetes-integration-tests

## How was this patch tested?

Checked manually on my local environment.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jiangxb1987/spark k8s-checkstyle

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21545.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21545


commit 00d9850698350abba2c6ac10a1c753e332e47751
Author: Xingbo Jiang 
Date:   2018-06-12T19:39:50Z

fix java checkstyle failure broken by kubernetes integration tests




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21544: add one supported type missing from the javadoc

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21544
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21544: add one supported type missing from the javadoc

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21544
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21544: add one supported type missing from the javadoc

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21544
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21529: [SPARK-24495][SQL] EnsureRequirement returns wron...

2018-06-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21529#discussion_r194863357
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala ---
@@ -679,6 +679,17 @@ class PlannerSuite extends SharedSQLContext {
 }
 assert(rangeExecInZeroPartition.head.outputPartitioning == 
UnknownPartitioning(0))
   }
+
+  test("SPARK-24495: EnsureRequirements can return wrong plan when reusing 
the same key in join") {
+withSQLConf(("spark.sql.shuffle.partitions", "1"),
+  ("spark.sql.constraintPropagation.enabled", "false"),
+  ("spark.sql.autoBroadcastJoinThreshold", "-1")) {
+  val df1 = spark.range(100).repartition(2, $"id", $"id")
--- End diff --

no, the issue can happen with range partition(because of the double 
transformation issue), the code in the ticket can reproduce the bug and it has 
no hash partitioning.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21544: add one supported type missing from the javadoc

2018-06-12 Thread yuj
GitHub user yuj opened a pull request:

https://github.com/apache/spark/pull/21544

add one supported type missing from the javadoc

## What changes were proposed in this pull request?

The supported java.math.BigInteger type is not mentioned in the javadoc of 
Encoders.bean()

## How was this patch tested?

only Javadoc fix

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yuj/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21544.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21544


commit 635378aa616f535ee77a3ac9d4236d54c7ef093e
Author: James Yu 
Date:   2018-06-12T19:32:19Z

add one supported type missing from the javadoc




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21258: [SPARK-23933][SQL] Add map_from_arrays function

2018-06-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21258


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21529: [SPARK-24495][SQL] EnsureRequirement returns wrong plan ...

2018-06-12 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21529
  
> I think we need to support both. testing different physical operators 
needs same result, testing something like type coercion mode needs different 
result. Anyway let's discuss it in the followup.

ok @cloud-fan, I'll try and send a proposal in the next days.

> @mgaido91 Could you help improve the test coverage of joins 
org.apache.spark.sql.JoinSuite? Due to the incomplete test case coverage, we 
did not discover this at the very beginning. We need to add more test cases to 
cover different join algorithms.

Sure, @gatorsmile, I am happy to. Do you mean running the existing tests 
for every type of join or do you have something different in mind? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21543
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/63/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21543
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1

2018-06-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21543
  
**[Test build #91729 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91729/testReport)**
 for PR 21543 at commit 
[`08461b4`](https://github.com/apache/spark/commit/08461b45bb4dc11e74976e6335269907476d02ca).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21366
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21366
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/62/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21067: [SPARK-23980][K8S] Resilient Spark driver on Kube...

2018-06-12 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/21067#discussion_r194862373
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala
 ---
@@ -67,12 +68,19 @@ private[spark] class BasicExecutorFeatureStep(
 }
   private val executorLimitCores = 
kubernetesConf.get(KUBERNETES_EXECUTOR_LIMIT_CORES)
 
-  override def configurePod(pod: SparkPod): SparkPod = {
-val name = 
s"$executorPodNamePrefix-exec-${kubernetesConf.roleSpecificConf.executorId}"
+  // If the driver pod is killed, the new driver pod will try to
+  // create new executors with the same name, but it will fail
+  // and hangs indefinitely because a terminating executors blocks
+  // the creation of the new ones, so to avoid that apply salt
+  private val executorNameSalt = 
Random.alphanumeric.take(4).mkString("").toLowerCase
--- End diff --

4 digits does not have enough randomness to be a reliable salt, which is 
why I suggest this to avoid collisions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21067: [SPARK-23980][K8S] Resilient Spark driver on Kube...

2018-06-12 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/21067#discussion_r194862246
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala
 ---
@@ -67,12 +68,19 @@ private[spark] class BasicExecutorFeatureStep(
 }
   private val executorLimitCores = 
kubernetesConf.get(KUBERNETES_EXECUTOR_LIMIT_CORES)
 
-  override def configurePod(pod: SparkPod): SparkPod = {
-val name = 
s"$executorPodNamePrefix-exec-${kubernetesConf.roleSpecificConf.executorId}"
+  // If the driver pod is killed, the new driver pod will try to
+  // create new executors with the same name, but it will fail
+  // and hangs indefinitely because a terminating executors blocks
+  // the creation of the new ones, so to avoid that apply salt
+  private val executorNameSalt = 
Random.alphanumeric.take(4).mkString("").toLowerCase
--- End diff --

We should just have the executor pod name be 
`s"$applicationId-exec-$executorId` then. Don't think the pod name prefix has 
to strictly be tied to the application name. The application name should be 
applied to a label so the executor pod can be located using that when using the 
dashboard and kubectl, etc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21543
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3952/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21543
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21529: [SPARK-24495][SQL] EnsureRequirement returns wron...

2018-06-12 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21529#discussion_r194861969
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala ---
@@ -679,6 +679,17 @@ class PlannerSuite extends SharedSQLContext {
 }
 assert(rangeExecInZeroPartition.head.outputPartitioning == 
UnknownPartitioning(0))
   }
+
+  test("SPARK-24495: EnsureRequirements can return wrong plan when reusing 
the same key in join") {
+withSQLConf(("spark.sql.shuffle.partitions", "1"),
+  ("spark.sql.constraintPropagation.enabled", "false"),
+  ("spark.sql.autoBroadcastJoinThreshold", "-1")) {
+  val df1 = spark.range(100).repartition(2, $"id", $"id")
--- End diff --

this way would not be ok, as we would have a `RangePartitioning` while the 
issue appears only with `HashPartitioning`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...

2018-06-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21366
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function

2018-06-12 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21258
  
Thanks! merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-12 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21503#discussion_r194861645
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
@@ -17,15 +17,56 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.Strategy
+import org.apache.spark.sql.{execution, Strategy}
+import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, 
AttributeSet}
+import org.apache.spark.sql.catalyst.planning.PhysicalOperation
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.execution.SparkPlan
 import 
org.apache.spark.sql.execution.streaming.continuous.{WriteToContinuousDataSource,
 WriteToContinuousDataSourceExec}
 
 object DataSourceV2Strategy extends Strategy {
   override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
-case r: DataSourceV2Relation =>
-  DataSourceV2ScanExec(r.output, r.source, r.options, r.pushedFilters, 
r.reader) :: Nil
+case PhysicalOperation(project, filters, relation: 
DataSourceV2Relation) =>
+  val projectSet = AttributeSet(project.flatMap(_.references))
+  val filterSet = AttributeSet(filters.flatMap(_.references))
+
+  val projection = if (filterSet.subsetOf(projectSet) &&
+  AttributeSet(relation.output) == projectSet) {
+// When the required projection contains all of the filter columns 
and column pruning alone
+// can produce the required projection, push the required 
projection.
+// A final projection may still be needed if the data source 
produces a different column
+// order or if it cannot prune all of the nested columns.
+relation.output
+  } else {
+// When there are filter columns not already in the required 
projection or when the required
+// projection is more complicated than column pruning, base column 
pruning on the set of
+// all columns needed by both.
+(projectSet ++ filterSet).toSeq
+  }
+
+  val reader = relation.newReader
--- End diff --

I didn't realize you were talking about other v2 sources. Yes, two readers 
would be configured for v2. If you wanted to avoid it, you could cache when 
pushdown is expensive in the implementation or we could add something else that 
prevents that case.

We need to do *something* to fix the current behavior of doing pushdown in 
the optimizer. I'm perfectly happy with less accurate stats for v2 until stats 
use the physical plan, or a solution like this where pushdown happens twice. I 
just don't think it is a good idea to continue with the design where the 
logical plan needs to use the v2 reader APIs. I think we agree that that should 
happen once, and conversion to physical plan is where it makes the most sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2...

2018-06-12 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/21543

[SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1

## What changes were proposed in this pull request?

The PR updates the 2.3 version tested to the new release 2.3.1.

## How was this patch tested?

existing UTs

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21543.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21543


commit 08461b45bb4dc11e74976e6335269907476d02ca
Author: Marco Gaido 
Date:   2018-06-12T19:28:03Z

[SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21067: [SPARK-23980][K8S] Resilient Spark driver on Kube...

2018-06-12 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/21067#discussion_r194860793
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/KubernetesFeatureConfigStep.scala
 ---
@@ -18,54 +18,87 @@ package org.apache.spark.deploy.k8s.features
 
 import io.fabric8.kubernetes.api.model.HasMetadata
 
-import org.apache.spark.deploy.k8s.SparkPod
+import org.apache.spark.deploy.k8s.{SparkExecutorPod, SparkJob}
 
 /**
- * A collection of functions that together represent a "feature" in pods 
that are launched for
+ * A collection of functions that together represent a
+ * "feature" in jobs and pods that are launched for
  * Spark drivers and executors.
  */
 private[spark] trait KubernetesFeatureConfigStep {
 
   /**
-   * Apply modifications on the given pod in accordance to this feature. 
This can include attaching
+   * Apply modifications on the given job in accordance to this feature. 
This can include attaching
* volumes, adding environment variables, and adding labels/annotations.
* 
-   * Note that we should return a SparkPod that keeps all of the 
properties of the passed SparkPod
+   * Note that we should return a SparkJob that keeps all of the 
properties of the passed SparkJob
* object. So this is correct:
* 
-   * {@code val configuredPod = new PodBuilder(pod.pod)
+   * {@code val configuredJob = new JobBuilder(job.job)
* .editSpec()
* ...
* .build()
-   *   val configuredContainer = new ContainerBuilder(pod.container)
+   *   val configuredContainer = new ContainerBuilder(job.container)
* ...
* .build()
-   *   SparkPod(configuredPod, configuredContainer)
+   *   SparkJob(configuredJob, configuredContainer)
*  }
* 
* This is incorrect:
* 
-   * {@code val configuredPod = new PodBuilder() // Loses the original 
state
+   * {@code val configuredJob = new JobBuilder() // Loses the original 
state
* .editSpec()
* ...
* .build()
*   val configuredContainer = new ContainerBuilder() // Loses the 
original state
* ...
* .build()
-   *   SparkPod(configuredPod, configuredContainer)
+   *   SparkJob(configuredJob, configuredContainer)
*  }
* 
*/
-  def configurePod(pod: SparkPod): SparkPod
-
+  def configureJob(job: SparkJob): SparkJob = SparkJob.initialJob()
+ /**
+  * Apply modifications on the given executor pod in
+  * accordance to this feature. This can include attaching
+  * volumes, adding environment variables, and adding labels/annotations.
+  * 
+  * Note that we should return a SparkExecutorPod that keeps all
+  * of the properties of the passed SparkExecutor
+  * object. So this is correct:
+  * 
+  * {@code val configuredExecutorPod = new PodBuilder(pod.pod)
+  * .editSpec()
+  * ...
+  * .build()
+  *   val configuredContainer = new ContainerBuilder(pod.container)
+  * ...
+  * .build()
+  *   SparkExecutorPod(configuredPod, configuredContainer)
+  *  }
+  * 
+  * This is incorrect:
+  * 
+  * {@code val configuredExecutorPod = new PodBuilder() // Loses the 
original state
+  * .editSpec()
+  * ...
+  * .build()
+  *   val configuredContainer = new ContainerBuilder() // Loses the 
original state
+  * ...
+  * .build()
+  *   SparkExecutorPod(configuredPod, configuredContainer)
+  *  }
+  * 
+  */
+  def configureExecutorPod(pod: SparkExecutorPod): SparkExecutorPod = 
SparkExecutorPod.initialPod()
--- End diff --

Ok looking a bit more closely at the API - I think what we want is for all 
feature steps to be configuring a `SparkPod`, where:

```
case class SparkPod(podTemplate: PodTemplateSpec, container: Container)
```

Then the `KubernetesDriverBuilder` adapts the pod template into a Job, but 
the `KubernetesExecutorBuilder` adapts the pod template into just a `Pod`. For 
the job, the adapting part is trivial 
(`JobBuilder.withNewJobSpec().withSpec(...)`). We can translate a pod template 
spec into a pod as well::
* inject `PodTemplateSpec.getMetadata()` into `PodBuilder.withMetadata()`,
* inject `PodTemplateSpec.getSpec()` into `PodBuilder.withSpec()`, and
* `PodBuilder.editMetadata().withName()`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   >