date:20171226

[GitHub] spark issue #20026: [SPARK-22838][Core] Avoid unnecessary copying of data

2017-12-26 Thread ConeyLiu

Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/20026
  
cc @jiangxb1987 any comments on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20036
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20036
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85423/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20036
  
**[Test build #85423 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85423/testReport)**
 for PR 20036 at commit 
[`05da9d7`](https://github.com/apache/spark/commit/05da9d7dfa2aca359630e70eee96db5abf96c9e4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...

2017-12-26 Thread advancedxy

Github user advancedxy commented on a diff in the pull request:

https://github.com/apache/spark/pull/20082#discussion_r158774364
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala ---
@@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with 
BeforeAndAfter with LocalSpark
 assert(attemptIdsWithFailedTask.toSet === Set(0, 1))
   }
 
+  test("TaskContext.stageAttemptId getter") {
+sc = new SparkContext("local[1,2]", "test")
+
+// Check stage attemptIds are 0 for initial stage
+val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ =>
+  Seq(TaskContext.get().stageAttemptId()).iterator
+}.collect()
+assert(stageAttemptIds.toSet === Set(0))
+
+// Check stage attemptIds that are resubmitted when task fails
+val stageAttemptIdsWithFailedStage =
+  sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ 
=>
+  val stageAttemptId = TaskContext.get().stageAttemptId()
+  if (stageAttemptId < 2) {
+throw new FetchFailedException(null, 0, 0, 0, "Fake")
--- End diff --

Will do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...

2017-12-26 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20082#discussion_r158774143
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala ---
@@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with 
BeforeAndAfter with LocalSpark
 assert(attemptIdsWithFailedTask.toSet === Set(0, 1))
   }
 
+  test("TaskContext.stageAttemptId getter") {
+sc = new SparkContext("local[1,2]", "test")
+
+// Check stage attemptIds are 0 for initial stage
+val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ =>
+  Seq(TaskContext.get().stageAttemptId()).iterator
+}.collect()
+assert(stageAttemptIds.toSet === Set(0))
+
+// Check stage attemptIds that are resubmitted when task fails
+val stageAttemptIdsWithFailedStage =
+  sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ 
=>
+  val stageAttemptId = TaskContext.get().stageAttemptId()
+  if (stageAttemptId < 2) {
+throw new FetchFailedException(null, 0, 0, 0, "Fake")
--- End diff --

Please add comment to explain that `FetchFailedException` will trigger a 
new stage attempt, while a common `Exception` will only trigger a task retry.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158774077
  
--- Diff: python/README.md ---
@@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace 
all of the other use c
 
 ## Python Requirements
 
-At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy and 
pandas).
+At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy, pandas, 
and pyarrow).
--- End diff --

Yea, Pandas and PyArrow are optional. Maybe, it's nicer if we have some 
more details here too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...

2017-12-26 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20082#discussion_r158773971
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala ---
@@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with 
BeforeAndAfter with LocalSpark
 assert(attemptIdsWithFailedTask.toSet === Set(0, 1))
   }
 
+  test("TaskContext.stageAttemptId getter") {
+sc = new SparkContext("local[1,2]", "test")
+
+// Check stage attemptIds are 0 for initial stage
+val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ =>
+  Seq(TaskContext.get().stageAttemptId()).iterator
+}.collect()
+assert(stageAttemptIds.toSet === Set(0))
+
+// Check stage attemptIds that are resubmitted when task fails
+val stageAttemptIdsWithFailedStage =
+  sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ 
=>
+  val stageAttemptId = TaskContext.get().stageAttemptId()
+  if (stageAttemptId < 2) {
+throw new FetchFailedException(null, 0, 0, 0, "Fake")
--- End diff --

oh, right~


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158773775
  
--- Diff: python/setup.py ---
@@ -201,7 +201,7 @@ def _supports_symlinks():
 extras_require={
 'ml': ['numpy>=1.7'],
 'mllib': ['numpy>=1.7'],
-'sql': ['pandas>=0.19.2']
+'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0']
--- End diff --

Nope, `extras_require` does not do anything in normal cases but they can be 
installed together with a dev option via pip IIRC.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158773551
  
--- Diff: python/setup.py ---
@@ -201,7 +201,7 @@ def _supports_symlinks():
 extras_require={
 'ml': ['numpy>=1.7'],
 'mllib': ['numpy>=1.7'],
-'sql': ['pandas>=0.19.2']
+'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0']
--- End diff --

If no pyarrow is installed, will setup force users to install it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158773507
  
--- Diff: python/README.md ---
@@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace 
all of the other use c
 
 ## Python Requirements
 
-At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy and 
pandas).
+At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy, pandas, 
and pyarrow).
--- End diff --

This sounds like mandatory, but I think pyarrow is still an optional 
choice. Right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...

2017-12-26 Thread advancedxy

Github user advancedxy commented on a diff in the pull request:

https://github.com/apache/spark/pull/20082#discussion_r158773445
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala ---
@@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with 
BeforeAndAfter with LocalSpark
 assert(attemptIdsWithFailedTask.toSet === Set(0, 1))
   }
 
+  test("TaskContext.stageAttemptId getter") {
+sc = new SparkContext("local[1,2]", "test")
+
+// Check stage attemptIds are 0 for initial stage
+val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ =>
+  Seq(TaskContext.get().stageAttemptId()).iterator
+}.collect()
+assert(stageAttemptIds.toSet === Set(0))
+
+// Check stage attemptIds that are resubmitted when task fails
+val stageAttemptIdsWithFailedStage =
+  sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ 
=>
+  val stageAttemptId = TaskContext.get().stageAttemptId()
+  if (stageAttemptId < 2) {
+throw new FetchFailedException(null, 0, 0, 0, "Fake")
--- End diff --

Related to repartition part.

I use FetchFailedException to explicitly trigger a stage resubmission.  
Otherwise, the task would be resubmitted in the same stage if IIRC.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...

2017-12-26 Thread danielvdende

Github user danielvdende commented on the issue:

https://github.com/apache/spark/pull/20057
  
@gatorsmile would be great to hear why you doubt the value of the feature 
:). I know that for us it would be extremely valuable (at the moment we have to 
do an extra step in our data pipeline because this feature is missing in 
Spark), but of course we're not the only ones using Spark.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...

2017-12-26 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20082#discussion_r158772547
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala ---
@@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with 
BeforeAndAfter with LocalSpark
 assert(attemptIdsWithFailedTask.toSet === Set(0, 1))
   }
 
+  test("TaskContext.stageAttemptId getter") {
+sc = new SparkContext("local[1,2]", "test")
+
+// Check stage attemptIds are 0 for initial stage
+val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ =>
+  Seq(TaskContext.get().stageAttemptId()).iterator
+}.collect()
+assert(stageAttemptIds.toSet === Set(0))
+
+// Check stage attemptIds that are resubmitted when task fails
+val stageAttemptIdsWithFailedStage =
+  sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ 
=>
+  val stageAttemptId = TaskContext.get().stageAttemptId()
+  if (stageAttemptId < 2) {
+throw new FetchFailedException(null, 0, 0, 0, "Fake")
--- End diff --

Emmm... just throw an `Exception` is enough here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...

2017-12-26 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20082#discussion_r158772359
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala ---
@@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with 
BeforeAndAfter with LocalSpark
 assert(attemptIdsWithFailedTask.toSet === Set(0, 1))
   }
 
+  test("TaskContext.stageAttemptId getter") {
+sc = new SparkContext("local[1,2]", "test")
+
+// Check stage attemptIds are 0 for initial stage
+val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ =>
+  Seq(TaskContext.get().stageAttemptId()).iterator
+}.collect()
+assert(stageAttemptIds.toSet === Set(0))
+
+// Check stage attemptIds that are resubmitted when task fails
+val stageAttemptIdsWithFailedStage =
+  sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ 
=>
--- End diff --

You don't need `repartition` here, just `sc.parallelize(Seq(1, 2, 3, 4), 
1).mapPartitions {...}`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158769195
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
+  import OrcColumnarBatchReader._
+
+  /**
+   * ORC File Reader.
+   */
+  private var reader: Reader = _
+
+  /**
+   * Vectorized Row Batch.
+   */
+  private var batch: VectorizedRowBatch = _
+
+  /**
+   * Requested Column IDs.
+   */
+  private var requestedColIds: Array[Int] = _
+
+  /**
+   * Record reader from row batch.
+   */
+  private var rows: org.apache.orc.RecordReader = _
+
+  /**
+   * Required Schema.
+   */
+  private var requiredSchema: StructType = _
+
+  /**
+   * ColumnarBatch for vectorized execution by whole-stage codegen.
+   */
+  private var columnarBatch: ColumnarBatch = _
+
+  /**
+   * Writable columnVectors of ColumnarBatch.
+   */
+  private var columnVectors: Seq[WritableColumnVector] = _
+
+  /**
+   * The number of rows read and considered to be returned.
+   */
+  private var rowsReturned: Long = 0L
+
+  /**
+   * Total number of rows.
+   */
+  private var totalRowCount: Long = 0L
+
+  override def getCurrentKey: Void = null
+
+  override def getCurrentValue: ColumnarBatch = columnarBatch
+
+  override def getProgress: Float = rowsReturned.toFloat / totalRowCount
+
+  override def nextKeyValue(): Boolean = nextBatch()
+
+  override def close(): Unit = {
+if (columnarBatch != null) {
+  columnarBatch.close()
+  columnarBatch = null
+}
+if (rows != null) {
+  rows.close()
+  rows = null
+}
+  }
+
+  /**
+   * Initialize ORC file reader and batch record reader.
+   * Please note that `setRequiredSchema` is needed to be called after 
this.
+   */
+  override def initialize(inputSplit: InputSplit, taskAttemptContext: 
TaskAttemptContext): Unit = {
+val fileSplit = inputSplit.asInstanceOf[FileSplit]
+val conf = taskAttemptContext.getConfiguration
+reader = OrcFile.createReader(
+  fileSplit.getPath,
+  OrcFile.readerOptions(conf)
+.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf))
+.filesystem(fileSplit.getPath.getFileSystem(conf)))
+
+val options = OrcInputFormat.buildOptions(conf, reader, 
fileSplit.getStart, fileSplit.getLength)
+rows = reader.rows(options)
+  }
+
+  /**
+   * Set required schema and partition information.
+   * With this information, this creates ColumnarBatch with the full 
schema.
+   */
+  def setRequiredSchema(
+  orcSchema: TypeDescription,
+  requestedColIds: Array[Int],
+  resultSchema: StructType,
+  requiredSchema: StructType,
+  partitionValues: InternalRow): Unit = {
+batch = orcSchema.createRowBatch(DEFAULT_SIZE)
+assert(!batch.selectedInUse)
+

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158768650
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
+  import OrcColumnarBatchReader._
+
+  /**
+   * ORC File Reader.
+   */
+  private var reader: Reader = _
+
+  /**
+   * Vectorized Row Batch.
+   */
+  private var batch: VectorizedRowBatch = _
+
+  /**
+   * Requested Column IDs.
+   */
+  private var requestedColIds: Array[Int] = _
+
+  /**
+   * Record reader from row batch.
+   */
+  private var rows: org.apache.orc.RecordReader = _
+
+  /**
+   * Required Schema.
+   */
+  private var requiredSchema: StructType = _
+
+  /**
+   * ColumnarBatch for vectorized execution by whole-stage codegen.
+   */
+  private var columnarBatch: ColumnarBatch = _
+
+  /**
+   * Writable columnVectors of ColumnarBatch.
--- End diff --

`columnVectors` -> `ColumnVector`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158769663
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
+  import OrcColumnarBatchReader._
+
+  /**
+   * ORC File Reader.
+   */
+  private var reader: Reader = _
+
+  /**
+   * Vectorized Row Batch.
+   */
+  private var batch: VectorizedRowBatch = _
+
+  /**
+   * Requested Column IDs.
+   */
+  private var requestedColIds: Array[Int] = _
+
+  /**
+   * Record reader from row batch.
+   */
+  private var rows: org.apache.orc.RecordReader = _
+
+  /**
+   * Required Schema.
+   */
+  private var requiredSchema: StructType = _
+
+  /**
+   * ColumnarBatch for vectorized execution by whole-stage codegen.
+   */
+  private var columnarBatch: ColumnarBatch = _
+
+  /**
+   * Writable columnVectors of ColumnarBatch.
+   */
+  private var columnVectors: Seq[WritableColumnVector] = _
+
+  /**
+   * The number of rows read and considered to be returned.
+   */
+  private var rowsReturned: Long = 0L
+
+  /**
+   * Total number of rows.
+   */
+  private var totalRowCount: Long = 0L
+
+  override def getCurrentKey: Void = null
+
+  override def getCurrentValue: ColumnarBatch = columnarBatch
+
+  override def getProgress: Float = rowsReturned.toFloat / totalRowCount
+
+  override def nextKeyValue(): Boolean = nextBatch()
+
+  override def close(): Unit = {
+if (columnarBatch != null) {
+  columnarBatch.close()
+  columnarBatch = null
+}
+if (rows != null) {
+  rows.close()
+  rows = null
+}
+  }
+
+  /**
+   * Initialize ORC file reader and batch record reader.
+   * Please note that `setRequiredSchema` is needed to be called after 
this.
+   */
+  override def initialize(inputSplit: InputSplit, taskAttemptContext: 
TaskAttemptContext): Unit = {
+val fileSplit = inputSplit.asInstanceOf[FileSplit]
+val conf = taskAttemptContext.getConfiguration
+reader = OrcFile.createReader(
+  fileSplit.getPath,
+  OrcFile.readerOptions(conf)
+.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf))
+.filesystem(fileSplit.getPath.getFileSystem(conf)))
+
+val options = OrcInputFormat.buildOptions(conf, reader, 
fileSplit.getStart, fileSplit.getLength)
+rows = reader.rows(options)
+  }
+
+  /**
+   * Set required schema and partition information.
+   * With this information, this creates ColumnarBatch with the full 
schema.
+   */
+  def setRequiredSchema(
+  orcSchema: TypeDescription,
+  requestedColIds: Array[Int],
+  resultSchema: StructType,
+  requiredSchema: StructType,
+  partitionValues: InternalRow): Unit = {
+batch = orcSchema.createRowBatch(DEFAULT_SIZE)
+assert(!batch.selectedInUse)
+

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158768585
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
+  import OrcColumnarBatchReader._
+
+  /**
+   * ORC File Reader.
+   */
+  private var reader: Reader = _
+
+  /**
+   * Vectorized Row Batch.
+   */
+  private var batch: VectorizedRowBatch = _
+
+  /**
+   * Requested Column IDs.
+   */
+  private var requestedColIds: Array[Int] = _
+
+  /**
+   * Record reader from row batch.
+   */
+  private var rows: org.apache.orc.RecordReader = _
--- End diff --

`recordReader` or `rowReader`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20059
  
**[Test build #85428 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85428/testReport)**
 for PR 20059 at commit 
[`f4b5c03`](https://github.com/apache/spark/commit/f4b5c03a7d1947cceafcf2ac16ddb0318778b387).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...

2017-12-26 Thread liyinan926

Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158768984
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -528,51 +576,91 @@ specific to Spark on Kubernetes.
   
 
 
-   spark.kubernetes.driver.limit.cores
-   (none)
-   
- Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
-   
- 
- 
-   spark.kubernetes.executor.limit.cores
-   (none)
-   
- Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for each executor pod launched for the Spark Application.
-   
- 
- 
-   spark.kubernetes.node.selector.[labelKey]
-   (none)
-   
- Adds to the node selector of the driver pod and executor pods, with 
key labelKey and the value as the
- configuration's value. For example, setting 
spark.kubernetes.node.selector.identifier to 
myIdentifier
- will result in the driver pod and executors having a node selector 
with key identifier and value
-  myIdentifier. Multiple node selector keys can be added 
by setting multiple configurations with this prefix.
-
-  
- 
-   
spark.kubernetes.driverEnv.[EnvironmentVariableName]
-   (none)
-   
- Add the environment variable specified by 
EnvironmentVariableName to
- the Driver process. The user can specify multiple of these to set 
multiple environment variables.
-   
- 
-  
-
spark.kubernetes.mountDependencies.jarsDownloadDir
-/var/spark-data/spark-jars
-
-  Location to download jars to in the driver and executors.
-  This directory must be empty and will be mounted as an empty 
directory volume on the driver and executor pods.
-
-  
-   
- 
spark.kubernetes.mountDependencies.filesDownloadDir
- /var/spark-data/spark-files
- 
-   Location to download jars to in the driver and executors.
-   This directory must be empty and will be mounted as an empty 
directory volume on the driver and executor pods.
- 
-   
+  spark.kubernetes.driver.limit.cores
+  (none)
+  
+Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
+  
+
+
+  spark.kubernetes.executor.limit.cores
+  (none)
+  
+Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for each executor pod launched for the Spark Application.
+  
+
+
+  spark.kubernetes.node.selector.[labelKey]
+  (none)
+  
+Adds to the node selector of the driver pod and executor pods, with 
key labelKey and the value as the
+configuration's value. For example, setting 
spark.kubernetes.node.selector.identifier to 
myIdentifier
+will result in the driver pod and executors having a node selector 
with key identifier and value
+ myIdentifier. Multiple node selector keys can be added 
by setting multiple configurations with this prefix.
+  
+
+
+  
spark.kubernetes.driverEnv.[EnvironmentVariableName]
+  (none)
+  
+Add the environment variable specified by 
EnvironmentVariableName to
+the Driver process. The user can specify multiple of these to set 
multiple environment variables.
+  
+
+
+  spark.kubernetes.mountDependencies.jarsDownloadDir
+  /var/spark-data/spark-jars
+  
+Location to download jars to in the driver and executors.
+This directory must be empty and will be mounted as an empty directory 
volume on the driver and executor pods.
+  
+
+
+  spark.kubernetes.mountDependencies.filesDownloadDir
+  /var/spark-data/spark-files
+  
+Location to download jars to in the driver and executors.
+This directory must be empty and will be mounted as an empty directory 
volume on the driver and executor pods.
+  
+
+
+  spark.kubernetes.mountDependencies.timeout
+  300 seconds
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...

2017-12-26 Thread liyinan926

Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158768975
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -120,6 +120,54 @@ by their appropriate remote URIs. Also, application 
dependencies can be pre-moun
 Those dependencies can be added to the classpath by referencing them with 
`local://` URIs and/or setting the
 `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
 
+### Using Remote Dependencies
+When there are application dependencies hosted in remote locations like 
HDFS or HTTP servers, the driver and executor pods
+need a Kubernetes 
[init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
 for downloading
+the dependencies so the driver and executor containers can use them 
locally. This requires users to specify the container
+image for the init-container using the configuration property 
`spark.kubernetes.initContainer.image`. For example, users
+simply add the following option to the `spark-submit` command to specify 
the init-container image:
+
+```
+--conf spark.kubernetes.initContainer.image=
+```
+
+The init-container handles remote dependencies specified in `spark.jars` 
(or the `--jars` option of `spark-submit`) and
+`spark.files` (or the `--files` option of `spark-submit`). It also handles 
remotely hosted main application resources, e.g.,
+the main application jar. The following shows an example of using remote 
dependencies with the `spark-submit` command:
+
+```bash
+$ bin/spark-submit \
+--master k8s://https://: \
+--deploy-mode cluster \
+--name spark-pi \
+--class org.apache.spark.examples.SparkPi \
+--jars https://path/to/dependency1.jar,https://path/to/dependency2.jar
+--files hdfs://host:port/path/to/file1,hdfs://host:port/path/to/file2
+--conf spark.executor.instances=5 \
+--conf spark.kubernetes.driver.docker.image= \
+--conf spark.kubernetes.executor.docker.image= \
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19683
  
**[Test build #85427 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85427/testReport)**
 for PR 19683 at commit 
[`b6b8694`](https://github.com/apache/spark/commit/b6b8694a16554a272c96f1cdf4fc2695097d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158768353
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -170,6 +171,8 @@ case class FileSourceScanExec(
 
   val needsUnsafeRowConversion: Boolean = if 
(relation.fileFormat.isInstanceOf[ParquetSource]) {
 
SparkSession.getActiveSession.get.sessionState.conf.parquetVectorizedReaderEnabled
+  } else if (relation.fileFormat.isInstanceOf[OrcFileFormat]) {
+
SparkSession.getActiveSession.get.sessionState.conf.orcVectorizedReaderEnabled
--- End diff --

Different than Parquet, for now we enable vectorized ORC reader when batch 
output is supported. We don't need unsafe row conversion at all for ORC. 
Because once it supports batch, we go batch-based approach. If it doesn't 
support batch, we don't enable vectorized ORC reader at all, so we don't need 
unsafe row conversion too.

Once we can enable vectorized ORC even batch is not supported, we need to 
add this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...

2017-12-26 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158766743
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -120,6 +120,54 @@ by their appropriate remote URIs. Also, application 
dependencies can be pre-moun
 Those dependencies can be added to the classpath by referencing them with 
`local://` URIs and/or setting the
 `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
 
+### Using Remote Dependencies
+When there are application dependencies hosted in remote locations like 
HDFS or HTTP servers, the driver and executor pods
+need a Kubernetes 
[init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
 for downloading
+the dependencies so the driver and executor containers can use them 
locally. This requires users to specify the container
+image for the init-container using the configuration property 
`spark.kubernetes.initContainer.image`. For example, users
+simply add the following option to the `spark-submit` command to specify 
the init-container image:
+
+```
+--conf spark.kubernetes.initContainer.image=
+```
+
+The init-container handles remote dependencies specified in `spark.jars` 
(or the `--jars` option of `spark-submit`) and
+`spark.files` (or the `--files` option of `spark-submit`). It also handles 
remotely hosted main application resources, e.g.,
+the main application jar. The following shows an example of using remote 
dependencies with the `spark-submit` command:
+
+```bash
+$ bin/spark-submit \
+--master k8s://https://: \
+--deploy-mode cluster \
+--name spark-pi \
+--class org.apache.spark.examples.SparkPi \
+--jars https://path/to/dependency1.jar,https://path/to/dependency2.jar
+--files hdfs://host:port/path/to/file1,hdfs://host:port/path/to/file2
+--conf spark.executor.instances=5 \
+--conf spark.kubernetes.driver.docker.image= \
+--conf spark.kubernetes.executor.docker.image= \
--- End diff --

`container.image` instead of `docker.image`. We need to modify line 79-80 
as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...

2017-12-26 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158767810
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -528,51 +576,91 @@ specific to Spark on Kubernetes.
   
 
 
-   spark.kubernetes.driver.limit.cores
-   (none)
-   
- Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
-   
- 
- 
-   spark.kubernetes.executor.limit.cores
-   (none)
-   
- Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for each executor pod launched for the Spark Application.
-   
- 
- 
-   spark.kubernetes.node.selector.[labelKey]
-   (none)
-   
- Adds to the node selector of the driver pod and executor pods, with 
key labelKey and the value as the
- configuration's value. For example, setting 
spark.kubernetes.node.selector.identifier to 
myIdentifier
- will result in the driver pod and executors having a node selector 
with key identifier and value
-  myIdentifier. Multiple node selector keys can be added 
by setting multiple configurations with this prefix.
-
-  
- 
-   
spark.kubernetes.driverEnv.[EnvironmentVariableName]
-   (none)
-   
- Add the environment variable specified by 
EnvironmentVariableName to
- the Driver process. The user can specify multiple of these to set 
multiple environment variables.
-   
- 
-  
-
spark.kubernetes.mountDependencies.jarsDownloadDir
-/var/spark-data/spark-jars
-
-  Location to download jars to in the driver and executors.
-  This directory must be empty and will be mounted as an empty 
directory volume on the driver and executor pods.
-
-  
-   
- 
spark.kubernetes.mountDependencies.filesDownloadDir
- /var/spark-data/spark-files
- 
-   Location to download jars to in the driver and executors.
-   This directory must be empty and will be mounted as an empty 
directory volume on the driver and executor pods.
- 
-   
+  spark.kubernetes.driver.limit.cores
+  (none)
+  
+Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
+  
+
+
+  spark.kubernetes.executor.limit.cores
+  (none)
+  
+Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for each executor pod launched for the Spark Application.
+  
+
+
+  spark.kubernetes.node.selector.[labelKey]
+  (none)
+  
+Adds to the node selector of the driver pod and executor pods, with 
key labelKey and the value as the
+configuration's value. For example, setting 
spark.kubernetes.node.selector.identifier to 
myIdentifier
+will result in the driver pod and executors having a node selector 
with key identifier and value
+ myIdentifier. Multiple node selector keys can be added 
by setting multiple configurations with this prefix.
+  
+
+
+  
spark.kubernetes.driverEnv.[EnvironmentVariableName]
+  (none)
+  
+Add the environment variable specified by 
EnvironmentVariableName to
+the Driver process. The user can specify multiple of these to set 
multiple environment variables.
+  
+
+
+  spark.kubernetes.mountDependencies.jarsDownloadDir
+  /var/spark-data/spark-jars
+  
+Location to download jars to in the driver and executors.
+This directory must be empty and will be mounted as an empty directory 
volume on the driver and executor pods.
+  
+
+
+  spark.kubernetes.mountDependencies.filesDownloadDir
+  /var/spark-data/spark-files
+  
+Location to download jars to in the driver and executors.
+This directory must be empty and will be mounted as an empty directory 
volume on the driver and executor pods.
+  
+
+
+  spark.kubernetes.mountDependencies.timeout
+  300 seconds
--- End diff --

`300s` instead of `300 seconds`, which should be the form we can specify to 
the config string.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20089
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20089
  
**[Test build #85426 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85426/testReport)**
 for PR 20089 at commit 
[`896f752`](https://github.com/apache/spark/commit/896f752a01c96b09ede5ae9d6fc924d4898bfb70).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20089
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85426/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20082
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20082
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85421/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20082
  
**[Test build #85421 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85421/testReport)**
 for PR 20082 at commit 
[`59e4a9c`](https://github.com/apache/spark/commit/59e4a9c70c037729f3eb60b47b2e625208687385).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158767130
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20089
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20089
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85425/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20089
  
**[Test build #85425 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85425/testReport)**
 for PR 20089 at commit 
[`bee3c69`](https://github.com/apache/spark/commit/bee3c69b4b559f6bf7aa74366ad2178eb3dd299e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20089
  
**[Test build #85426 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85426/testReport)**
 for PR 20089 at commit 
[`896f752`](https://github.com/apache/spark/commit/896f752a01c96b09ede5ae9d6fc924d4898bfb70).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel sa...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20088
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85422/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel sa...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20088
  
**[Test build #85422 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85422/testReport)**
 for PR 20088 at commit 
[`5fd56c3`](https://github.com/apache/spark/commit/5fd56c350cb410740209fba32c19489847a1d019).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel sa...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20088
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20089
  
@HyukjinKwon I'll update it as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20089
  
**[Test build #85425 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85425/testReport)**
 for PR 20089 at commit 
[`bee3c69`](https://github.com/apache/spark/commit/bee3c69b4b559f6bf7aa74366ad2178eb3dd299e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158764605
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
+  import OrcColumnarBatchReader._
+
+  /**
+   * ORC File Reader.
+   */
+  private var reader: Reader = _
+
+  /**
+   * Vectorized Row Batch.
+   */
+  private var batch: VectorizedRowBatch = _
+
+  /**
+   * Requested Column IDs.
+   */
+  private var requestedColIds: Array[Int] = _
+
+  /**
+   * Record reader from row batch.
+   */
+  private var rows: org.apache.orc.RecordReader = _
+
+  /**
+   * Required Schema.
+   */
+  private var requiredSchema: StructType = _
+
+  /**
+   * ColumnarBatch for vectorized execution by whole-stage codegen.
+   */
+  private var columnarBatch: ColumnarBatch = _
+
+  /**
+   * Writable columnVectors of ColumnarBatch.
+   */
+  private var columnVectors: Seq[WritableColumnVector] = _
+
+  /**
+   * The number of rows read and considered to be returned.
+   */
+  private var rowsReturned: Long = 0L
+
+  /**
+   * Total number of rows.
+   */
+  private var totalRowCount: Long = 0L
+
+  override def getCurrentKey: Void = null
+
+  override def getCurrentValue: ColumnarBatch = columnarBatch
+
+  override def getProgress: Float = rowsReturned.toFloat / totalRowCount
+
+  override def nextKeyValue(): Boolean = nextBatch()
+
+  override def close(): Unit = {
+if (columnarBatch != null) {
+  columnarBatch.close()
+  columnarBatch = null
+}
+if (rows != null) {
+  rows.close()
+  rows = null
+}
+  }
+
+  /**
+   * Initialize ORC file reader and batch record reader.
+   * Please note that `setRequiredSchema` is needed to be called after 
this.
+   */
+  override def initialize(inputSplit: InputSplit, taskAttemptContext: 
TaskAttemptContext): Unit = {
+val fileSplit = inputSplit.asInstanceOf[FileSplit]
+val conf = taskAttemptContext.getConfiguration
+reader = OrcFile.createReader(
+  fileSplit.getPath,
+  OrcFile.readerOptions(conf)
+.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf))
+.filesystem(fileSplit.getPath.getFileSystem(conf)))
+
+val options = OrcInputFormat.buildOptions(conf, reader, 
fileSplit.getStart, fileSplit.getLength)
+rows = reader.rows(options)
+  }
+
+  /**
+   * Set required schema and partition information.
+   * With this information, this creates ColumnarBatch with the full 
schema.
+   */
+  def setRequiredSchema(
+  orcSchema: TypeDescription,
+  requestedColIds: Array[Int],
+  resultSchema: StructType,
+  requiredSchema: StructType,
+  partitionValues: InternalRow): Unit = {
+batch = orcSchema.createRowBatch(DEFAULT_SIZE)
+assert(!batch.selectedInUse)
+

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158764584
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
+  import OrcColumnarBatchReader._
+
+  /**
+   * ORC File Reader.
+   */
+  private var reader: Reader = _
+
+  /**
+   * Vectorized Row Batch.
+   */
+  private var batch: VectorizedRowBatch = _
+
+  /**
+   * Requested Column IDs.
+   */
+  private var requestedColIds: Array[Int] = _
+
+  /**
+   * Record reader from row batch.
+   */
+  private var rows: org.apache.orc.RecordReader = _
+
+  /**
+   * Required Schema.
+   */
+  private var requiredSchema: StructType = _
+
+  /**
+   * ColumnarBatch for vectorized execution by whole-stage codegen.
+   */
+  private var columnarBatch: ColumnarBatch = _
+
+  /**
+   * Writable columnVectors of ColumnarBatch.
+   */
+  private var columnVectors: Seq[WritableColumnVector] = _
+
+  /**
+   * The number of rows read and considered to be returned.
+   */
+  private var rowsReturned: Long = 0L
+
+  /**
+   * Total number of rows.
+   */
+  private var totalRowCount: Long = 0L
+
+  override def getCurrentKey: Void = null
+
+  override def getCurrentValue: ColumnarBatch = columnarBatch
+
+  override def getProgress: Float = rowsReturned.toFloat / totalRowCount
+
+  override def nextKeyValue(): Boolean = nextBatch()
+
+  override def close(): Unit = {
+if (columnarBatch != null) {
+  columnarBatch.close()
+  columnarBatch = null
+}
+if (rows != null) {
+  rows.close()
+  rows = null
+}
+  }
+
+  /**
+   * Initialize ORC file reader and batch record reader.
+   * Please note that `setRequiredSchema` is needed to be called after 
this.
+   */
+  override def initialize(inputSplit: InputSplit, taskAttemptContext: 
TaskAttemptContext): Unit = {
+val fileSplit = inputSplit.asInstanceOf[FileSplit]
+val conf = taskAttemptContext.getConfiguration
+reader = OrcFile.createReader(
+  fileSplit.getPath,
+  OrcFile.readerOptions(conf)
+.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf))
+.filesystem(fileSplit.getPath.getFileSystem(conf)))
+
+val options = OrcInputFormat.buildOptions(conf, reader, 
fileSplit.getStart, fileSplit.getLength)
+rows = reader.rows(options)
+  }
+
+  /**
+   * Set required schema and partition information.
+   * With this information, this creates ColumnarBatch with the full 
schema.
+   */
+  def setRequiredSchema(
+  orcSchema: TypeDescription,
+  requestedColIds: Array[Int],
+  resultSchema: StructType,
+  requiredSchema: StructType,
+  partitionValues: InternalRow): Unit = {
+batch = orcSchema.createRowBatch(DEFAULT_SIZE)
+assert(!batch.selectedInUse)
+

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158764569
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
+  import OrcColumnarBatchReader._
+
+  /**
+   * ORC File Reader.
+   */
+  private var reader: Reader = _
+
+  /**
+   * Vectorized Row Batch.
+   */
+  private var batch: VectorizedRowBatch = _
+
+  /**
+   * Requested Column IDs.
+   */
+  private var requestedColIds: Array[Int] = _
+
+  /**
+   * Record reader from row batch.
+   */
+  private var rows: org.apache.orc.RecordReader = _
+
+  /**
+   * Required Schema.
+   */
+  private var requiredSchema: StructType = _
+
+  /**
+   * ColumnarBatch for vectorized execution by whole-stage codegen.
+   */
+  private var columnarBatch: ColumnarBatch = _
+
+  /**
+   * Writable columnVectors of ColumnarBatch.
+   */
+  private var columnVectors: Seq[WritableColumnVector] = _
+
+  /**
+   * The number of rows read and considered to be returned.
+   */
+  private var rowsReturned: Long = 0L
+
+  /**
+   * Total number of rows.
+   */
+  private var totalRowCount: Long = 0L
+
+  override def getCurrentKey: Void = null
+
+  override def getCurrentValue: ColumnarBatch = columnarBatch
+
+  override def getProgress: Float = rowsReturned.toFloat / totalRowCount
+
+  override def nextKeyValue(): Boolean = nextBatch()
+
+  override def close(): Unit = {
+if (columnarBatch != null) {
+  columnarBatch.close()
+  columnarBatch = null
+}
+if (rows != null) {
+  rows.close()
+  rows = null
+}
+  }
+
+  /**
+   * Initialize ORC file reader and batch record reader.
+   * Please note that `setRequiredSchema` is needed to be called after 
this.
+   */
+  override def initialize(inputSplit: InputSplit, taskAttemptContext: 
TaskAttemptContext): Unit = {
+val fileSplit = inputSplit.asInstanceOf[FileSplit]
+val conf = taskAttemptContext.getConfiguration
+reader = OrcFile.createReader(
+  fileSplit.getPath,
+  OrcFile.readerOptions(conf)
+.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf))
+.filesystem(fileSplit.getPath.getFileSystem(conf)))
+
+val options = OrcInputFormat.buildOptions(conf, reader, 
fileSplit.getStart, fileSplit.getLength)
+rows = reader.rows(options)
+  }
+
+  /**
+   * Set required schema and partition information.
+   * With this information, this creates ColumnarBatch with the full 
schema.
+   */
+  def setRequiredSchema(
+  orcSchema: TypeDescription,
+  requestedColIds: Array[Int],
+  resultSchema: StructType,
+  requiredSchema: StructType,
+  partitionValues: InternalRow): Unit = {
+batch = orcSchema.createRowBatch(DEFAULT_SIZE)
+assert(!batch.selectedInUse)
+

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158764558
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
+  import OrcColumnarBatchReader._
+
+  /**
+   * ORC File Reader.
+   */
+  private var reader: Reader = _
+
+  /**
+   * Vectorized Row Batch.
+   */
+  private var batch: VectorizedRowBatch = _
+
+  /**
+   * Requested Column IDs.
+   */
+  private var requestedColIds: Array[Int] = _
+
+  /**
+   * Record reader from row batch.
+   */
+  private var rows: org.apache.orc.RecordReader = _
+
+  /**
+   * Required Schema.
+   */
+  private var requiredSchema: StructType = _
+
+  /**
+   * ColumnarBatch for vectorized execution by whole-stage codegen.
+   */
+  private var columnarBatch: ColumnarBatch = _
+
+  /**
+   * Writable columnVectors of ColumnarBatch.
+   */
+  private var columnVectors: Seq[WritableColumnVector] = _
+
+  /**
+   * The number of rows read and considered to be returned.
+   */
+  private var rowsReturned: Long = 0L
+
+  /**
+   * Total number of rows.
+   */
+  private var totalRowCount: Long = 0L
+
+  override def getCurrentKey: Void = null
+
+  override def getCurrentValue: ColumnarBatch = columnarBatch
+
+  override def getProgress: Float = rowsReturned.toFloat / totalRowCount
+
+  override def nextKeyValue(): Boolean = nextBatch()
+
+  override def close(): Unit = {
+if (columnarBatch != null) {
+  columnarBatch.close()
+  columnarBatch = null
+}
+if (rows != null) {
+  rows.close()
+  rows = null
+}
+  }
+
+  /**
+   * Initialize ORC file reader and batch record reader.
+   * Please note that `setRequiredSchema` is needed to be called after 
this.
+   */
+  override def initialize(inputSplit: InputSplit, taskAttemptContext: 
TaskAttemptContext): Unit = {
+val fileSplit = inputSplit.asInstanceOf[FileSplit]
+val conf = taskAttemptContext.getConfiguration
+reader = OrcFile.createReader(
+  fileSplit.getPath,
+  OrcFile.readerOptions(conf)
+.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf))
+.filesystem(fileSplit.getPath.getFileSystem(conf)))
+
+val options = OrcInputFormat.buildOptions(conf, reader, 
fileSplit.getStart, fileSplit.getLength)
+rows = reader.rows(options)
+  }
+
+  /**
+   * Set required schema and partition information.
+   * With this information, this creates ColumnarBatch with the full 
schema.
+   */
+  def setRequiredSchema(
+  orcSchema: TypeDescription,
+  requestedColIds: Array[Int],
+  resultSchema: StructType,
+  requiredSchema: StructType,
+  partitionValues: InternalRow): Unit = {
+batch = orcSchema.createRowBatch(DEFAULT_SIZE)
+assert(!batch.selectedInUse)
+

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158764573
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
+  import OrcColumnarBatchReader._
+
+  /**
+   * ORC File Reader.
+   */
+  private var reader: Reader = _
+
+  /**
+   * Vectorized Row Batch.
+   */
+  private var batch: VectorizedRowBatch = _
+
+  /**
+   * Requested Column IDs.
+   */
+  private var requestedColIds: Array[Int] = _
+
+  /**
+   * Record reader from row batch.
+   */
+  private var rows: org.apache.orc.RecordReader = _
+
+  /**
+   * Required Schema.
+   */
+  private var requiredSchema: StructType = _
+
+  /**
+   * ColumnarBatch for vectorized execution by whole-stage codegen.
+   */
+  private var columnarBatch: ColumnarBatch = _
+
+  /**
+   * Writable columnVectors of ColumnarBatch.
+   */
+  private var columnVectors: Seq[WritableColumnVector] = _
+
+  /**
+   * The number of rows read and considered to be returned.
+   */
+  private var rowsReturned: Long = 0L
+
+  /**
+   * Total number of rows.
+   */
+  private var totalRowCount: Long = 0L
+
+  override def getCurrentKey: Void = null
+
+  override def getCurrentValue: ColumnarBatch = columnarBatch
+
+  override def getProgress: Float = rowsReturned.toFloat / totalRowCount
+
+  override def nextKeyValue(): Boolean = nextBatch()
+
+  override def close(): Unit = {
+if (columnarBatch != null) {
+  columnarBatch.close()
+  columnarBatch = null
+}
+if (rows != null) {
+  rows.close()
+  rows = null
+}
+  }
+
+  /**
+   * Initialize ORC file reader and batch record reader.
+   * Please note that `setRequiredSchema` is needed to be called after 
this.
+   */
+  override def initialize(inputSplit: InputSplit, taskAttemptContext: 
TaskAttemptContext): Unit = {
+val fileSplit = inputSplit.asInstanceOf[FileSplit]
+val conf = taskAttemptContext.getConfiguration
+reader = OrcFile.createReader(
+  fileSplit.getPath,
+  OrcFile.readerOptions(conf)
+.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf))
+.filesystem(fileSplit.getPath.getFileSystem(conf)))
+
+val options = OrcInputFormat.buildOptions(conf, reader, 
fileSplit.getStart, fileSplit.getLength)
+rows = reader.rows(options)
+  }
+
+  /**
+   * Set required schema and partition information.
+   * With this information, this creates ColumnarBatch with the full 
schema.
+   */
+  def setRequiredSchema(
+  orcSchema: TypeDescription,
+  requestedColIds: Array[Int],
+  resultSchema: StructType,
+  requiredSchema: StructType,
+  partitionValues: InternalRow): Unit = {
+batch = orcSchema.createRowBatch(DEFAULT_SIZE)
+assert(!batch.selectedInUse)
+

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158764338
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
+  import OrcColumnarBatchReader._
+
+  /**
+   * ORC File Reader.
+   */
+  private var reader: Reader = _
+
+  /**
+   * Vectorized Row Batch.
+   */
+  private var batch: VectorizedRowBatch = _
+
+  /**
+   * Requested Column IDs.
+   */
+  private var requestedColIds: Array[Int] = _
+
+  /**
+   * Record reader from row batch.
+   */
+  private var rows: org.apache.orc.RecordReader = _
+
+  /**
+   * Required Schema.
+   */
+  private var requiredSchema: StructType = _
+
+  /**
+   * ColumnarBatch for vectorized execution by whole-stage codegen.
+   */
+  private var columnarBatch: ColumnarBatch = _
+
+  /**
+   * Writable columnVectors of ColumnarBatch.
+   */
+  private var columnVectors: Seq[WritableColumnVector] = _
+
+  /**
+   * The number of rows read and considered to be returned.
+   */
+  private var rowsReturned: Long = 0L
+
+  /**
+   * Total number of rows.
+   */
+  private var totalRowCount: Long = 0L
+
+  override def getCurrentKey: Void = null
+
+  override def getCurrentValue: ColumnarBatch = columnarBatch
+
+  override def getProgress: Float = rowsReturned.toFloat / totalRowCount
+
+  override def nextKeyValue(): Boolean = nextBatch()
+
+  override def close(): Unit = {
+if (columnarBatch != null) {
+  columnarBatch.close()
+  columnarBatch = null
+}
+if (rows != null) {
+  rows.close()
+  rows = null
+}
+  }
+
+  /**
+   * Initialize ORC file reader and batch record reader.
+   * Please note that `setRequiredSchema` is needed to be called after 
this.
+   */
+  override def initialize(inputSplit: InputSplit, taskAttemptContext: 
TaskAttemptContext): Unit = {
+val fileSplit = inputSplit.asInstanceOf[FileSplit]
+val conf = taskAttemptContext.getConfiguration
+reader = OrcFile.createReader(
+  fileSplit.getPath,
+  OrcFile.readerOptions(conf)
+.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf))
+.filesystem(fileSplit.getPath.getFileSystem(conf)))
+
+val options = OrcInputFormat.buildOptions(conf, reader, 
fileSplit.getStart, fileSplit.getLength)
+rows = reader.rows(options)
+  }
+
+  /**
+   * Set required schema and partition information.
+   * With this information, this creates ColumnarBatch with the full 
schema.
+   */
+  def setRequiredSchema(
+  orcSchema: TypeDescription,
+  requestedColIds: Array[Int],
+  resultSchema: StructType,
+  requiredSchema: StructType,
+  partitionValues: InternalRow): Unit = {
+batch = orcSchema.createRowBatch(DEFAULT_SIZE)
+assert(!batch.selectedInUse)
+

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20089
  
@HyukjinKwon Thanks! I'll add it soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20089
  
Yea, I think we could. I added the support and tested it before - 
SPARK-19019. I think it's okay to add it they are just metadata AFAIK.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20089
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20089
  
**[Test build #85424 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85424/testReport)**
 for PR 20089 at commit 
[`36614af`](https://github.com/apache/spark/commit/36614af4d8e00bb9564ef834a341859a0e96dfe4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20089
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85424/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18906
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85419/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18906
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20089
  
**[Test build #85424 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85424/testReport)**
 for PR 20089 at commit 
[`36614af`](https://github.com/apache/spark/commit/36614af4d8e00bb9564ef834a341859a0e96dfe4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18906
  
**[Test build #85419 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85419/testReport)**
 for PR 18906 at commit 
[`d992f93`](https://github.com/apache/spark/commit/d992f939886c488d00bad7ac0d43c4e8e1eb41b7).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...

2017-12-26 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20089
  
Btw, should we add `'Programming Language :: Python :: 3.6'` to 
`classifiers`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-26 Thread ueshin

GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/20089

[SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py file.

## What changes were proposed in this pull request?

This is a follow-up pr of #19884 updating setup.py file to add pyarrow 
dependency.

## How was this patch tested?

Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-22324/fup1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20089.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20089


commit 36614af4d8e00bb9564ef834a341859a0e96dfe4
Author: Takuya UESHIN 
Date:   2017-12-27T04:33:59Z

Add pyarrow to setup.py.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20036
  
**[Test build #85423 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85423/testReport)**
 for PR 20036 at commit 
[`05da9d7`](https://github.com/apache/spark/commit/05da9d7dfa2aca359630e70eee96db5abf96c9e4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Co...

2017-12-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20036#discussion_r158761989
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -283,7 +283,7 @@ case class InputAdapter(child: SparkPlan) extends 
UnaryExecNode with CodegenSupp
 
   override def doProduce(ctx: CodegenContext): String = {
 // Right now, InputAdapter is only used when there is one input RDD.
-// inline mutable state since an inputAdaptor in a task
+// inline mutable state since an InputAdapter in a task
--- End diff --

sure, done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread henrify

Github user henrify commented on the issue:

https://github.com/apache/spark/pull/19943
  
If i've understood Spark development process correctly, the 2.3 branch cut 
date is in couple of days, and if this PR doesn't get merged to master real 
soon, it'll have to wait until 2.4, about 6 months?

@dongjoon-hyun @cloud-fan Considering that the benchmarks show almost order 
of magnitude improvement in performance, it would be really great to get this 
in for Spark 2.3, and worry about the details of copy vs wrapper approach later.

Also, as this is anyway opt-in feature that needs to be enabled with config 
option, merging this shouldn't be "dangerous".. Thanks for your efforts & 
looking forward to get this PR to production!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel sa...

2017-12-26 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/20088
  
Currently I cannot construct a failed test for this issue, but the future 
PR (changing `RoundRobinPartitioning`) by @jiangxb1987 will trigger this bug.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel sa...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20088
  
**[Test build #85422 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85422/testReport)**
 for PR 20088 at commit 
[`5fd56c3`](https://github.com/apache/spark/commit/5fd56c350cb410740209fba32c19489847a1d019).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorM...

2017-12-26 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/20088

[SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel save implementation

## What changes were proposed in this pull request?

Currently, in `ChiSqSelectorModel`, save:

spark.createDataFrame(dataArray).repartition(1).write...
The default partition number used by createDataFrame is 
"defaultParallelism",
Current RoundRobinPartitioning won't guarantee the "repartition" generating 
the same order result with local array. We need fix it.

## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark fix_chisq_model_save

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20088.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20088


commit 5fd56c350cb410740209fba32c19489847a1d019
Author: WeichenXu 
Date:   2017-12-27T04:31:25Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...

2017-12-26 Thread bdrillard

Github user bdrillard commented on a diff in the pull request:

https://github.com/apache/spark/pull/20085#discussion_r158761511
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -390,8 +391,8 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 
   test("SPARK-22696: InitializeJavaBean should not use global variables") {
--- End diff --

Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...

2017-12-26 Thread bdrillard

Github user bdrillard commented on a diff in the pull request:

https://github.com/apache/spark/pull/20085#discussion_r158761155
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -106,27 +106,27 @@ trait InvokeLike extends Expression with 
NonSQLExpression {
 }
 
 /**
- * Invokes a static function, returning the result.  By default, any of 
the arguments being null
- * will result in returning null instead of calling the function.
- *
- * @param staticObject The target of the static call.  This can either be 
the object itself
- * (methods defined on scala objects), or the class 
object
- * (static methods defined in java).
- * @param dataType The expected return type of the function call
- * @param functionName The name of the method to call.
- * @param arguments An optional list of expressions to pass as arguments 
to the function.
- * @param propagateNull When true, and any of the arguments is null, null 
will be returned instead
- *  of calling the function.
- * @param returnNullable When false, indicating the invoked method will 
always return
- *   non-null value.
- */
+  * Invokes a static function, returning the result.  By default, any of 
the arguments being null
--- End diff --

Those additional spaces shouldn't be there, I've fixed them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20061
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85418/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20061
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20061
  
**[Test build #85418 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85418/testReport)**
 for PR 20061 at commit 
[`cfef0f1`](https://github.com/apache/spark/commit/cfef0f1511bb1ec8f9bd99ca41effce347968dd0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19683
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85416/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19683
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19683
  
**[Test build #85416 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85416/testReport)**
 for PR 19683 at commit 
[`c3183d0`](https://github.com/apache/spark/commit/c3183d0cba092d1308d62d06cbfa16e6d8f97498).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20085#discussion_r158760302
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -182,6 +182,114 @@ case class StaticInvoke(
   }
 }
 
+/**
+ * Invokes a call to reference to a static field.
+ *
+ * @param staticObject The target of the static call.  This can either be 
the object itself
+ * (methods defined on scala objects), or the class 
object
+ * (static methods defined in java).
+ * @param dataType The expected return type of the function call.
+ * @param fieldName The field to reference.
+ */
+case class StaticField(
+  staticObject: Class[_],
+  dataType: DataType,
+  fieldName: String) extends Expression with NonSQLExpression {
+
+  val objectName = staticObject.getName.stripSuffix("$")
+
+  override def nullable: Boolean = false
+  override def children: Seq[Expression] = Nil
+
+  override def eval(input: InternalRow): Any =
+throw new UnsupportedOperationException("Only code-generated 
evaluation is supported.")
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+val javaType = ctx.javaType(dataType)
+
+val code = s"""
+  final $javaType ${ev.value} = $objectName.$fieldName;
+"""
+
+ev.copy(code = code, isNull = "false")
+  }
+}
+
+/**
+ * Wraps an expression in a try-catch block, which can be used if the body 
expression may throw a
+ * exception.
+ *
+ * @param body The expression body to wrap in a try-catch block.
+ * @param dataType The return type of the try block.
+ * @param returnNullable When false, indicating the invoked method will 
always return
+ *   non-null value.
+  */
+case class WrapException(
+body: Expression,
+dataType: DataType,
+returnNullable: Boolean = true) extends Expression with 
NonSQLExpression {
+
+  override def nullable: Boolean = returnNullable
+  override def children: Seq[Expression] = Seq(body)
+
+  override def eval(input: InternalRow): Any =
+throw new UnsupportedOperationException("Only code-generated 
evaluation is supported.")
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+val javaType = ctx.javaType(dataType)
+val returnName = ctx.freshName("returnName")
+
+val bodyExpr = body.genCode(ctx)
+
+val code =
+  s"""
+ |final $javaType $returnName;
+ |try {
+ |  ${bodyExpr.code}
+ |  $returnName = ${bodyExpr.value};
+ |} catch (Exception e) {
+ |  org.apache.spark.unsafe.Platform.throwException(e);
+ |}
+   """.stripMargin
+
+ev.copy(code = code, isNull = bodyExpr.isNull, value = returnName)
+  }
+}
+
+/**
+ * Returns the value if it is of the specified type, or null otherwise
+ *
+ * @param value   The value to returned
+ * @param checkedType The type to check against the value via instanceOf
+ * @param dataTypeThe type returned by the expression
+  */
+case class ValueIfType(
+  value: Expression,
--- End diff --

Should we limit the data type of `value` to `ObjectType`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20085#discussion_r158760292
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -182,6 +182,114 @@ case class StaticInvoke(
   }
 }
 
+/**
+ * Invokes a call to reference to a static field.
+ *
+ * @param staticObject The target of the static call.  This can either be 
the object itself
+ * (methods defined on scala objects), or the class 
object
+ * (static methods defined in java).
+ * @param dataType The expected return type of the function call.
+ * @param fieldName The field to reference.
+ */
+case class StaticField(
+  staticObject: Class[_],
+  dataType: DataType,
+  fieldName: String) extends Expression with NonSQLExpression {
+
+  val objectName = staticObject.getName.stripSuffix("$")
+
+  override def nullable: Boolean = false
+  override def children: Seq[Expression] = Nil
+
+  override def eval(input: InternalRow): Any =
+throw new UnsupportedOperationException("Only code-generated 
evaluation is supported.")
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+val javaType = ctx.javaType(dataType)
+
+val code = s"""
+  final $javaType ${ev.value} = $objectName.$fieldName;
+"""
+
+ev.copy(code = code, isNull = "false")
+  }
+}
+
+/**
+ * Wraps an expression in a try-catch block, which can be used if the body 
expression may throw a
+ * exception.
+ *
+ * @param body The expression body to wrap in a try-catch block.
+ * @param dataType The return type of the try block.
+ * @param returnNullable When false, indicating the invoked method will 
always return
+ *   non-null value.
+  */
+case class WrapException(
+body: Expression,
+dataType: DataType,
+returnNullable: Boolean = true) extends Expression with 
NonSQLExpression {
+
+  override def nullable: Boolean = returnNullable
+  override def children: Seq[Expression] = Seq(body)
+
+  override def eval(input: InternalRow): Any =
+throw new UnsupportedOperationException("Only code-generated 
evaluation is supported.")
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+val javaType = ctx.javaType(dataType)
+val returnName = ctx.freshName("returnName")
+
+val bodyExpr = body.genCode(ctx)
+
+val code =
+  s"""
+ |final $javaType $returnName;
+ |try {
+ |  ${bodyExpr.code}
+ |  $returnName = ${bodyExpr.value};
+ |} catch (Exception e) {
+ |  org.apache.spark.unsafe.Platform.throwException(e);
+ |}
+   """.stripMargin
+
+ev.copy(code = code, isNull = bodyExpr.isNull, value = returnName)
+  }
+}
+
+/**
+ * Returns the value if it is of the specified type, or null otherwise
+ *
+ * @param value   The value to returned
+ * @param checkedType The type to check against the value via instanceOf
+ * @param dataTypeThe type returned by the expression
+  */
+case class ValueIfType(
+  value: Expression,
+  checkedType: Class[_],
+  dataType: DataType) extends Expression with NonSQLExpression {
--- End diff --

Will we have different data type other than `value.dataType`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-26 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19813
  
This is a pretty cool idea that can work with the current string based 
codegen framework, LGTM!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20085#discussion_r158759848
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -390,8 +391,8 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 
   test("SPARK-22696: InitializeJavaBean should not use global variables") {
--- End diff --

`InitializeJavaBean` -> `InitializeObject`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-26 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/19977
  
I'll update tonight.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...

2017-12-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20085#discussion_r158759244
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -106,27 +106,27 @@ trait InvokeLike extends Expression with 
NonSQLExpression {
 }
 
 /**
- * Invokes a static function, returning the result.  By default, any of 
the arguments being null
- * will result in returning null instead of calling the function.
- *
- * @param staticObject The target of the static call.  This can either be 
the object itself
- * (methods defined on scala objects), or the class 
object
- * (static methods defined in java).
- * @param dataType The expected return type of the function call
- * @param functionName The name of the method to call.
- * @param arguments An optional list of expressions to pass as arguments 
to the function.
- * @param propagateNull When true, and any of the arguments is null, null 
will be returned instead
- *  of calling the function.
- * @param returnNullable When false, indicating the invoked method will 
always return
- *   non-null value.
- */
+  * Invokes a static function, returning the result.  By default, any of 
the arguments being null
--- End diff --

Why we change the comment style? Looks not consistent with others.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...

2017-12-26 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20056
  
Probably, you better pass all the tests by yourself to save committers' 
bandwidth.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-26 Thread fjh100456

Github user fjh100456 closed the pull request at:

https://github.com/apache/spark/pull/19218


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-26 Thread fjh100456

Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
Please go to #20087


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20087
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-26 Thread fjh100456

GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/20087

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 
'spark.sql.orc.compression.codec' configuration doesn't take effect on hive 
table writing

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 
'spark.sql.orc.compression.codec' configuration doesn't take effect on hive 
table writing

What changes were proposed in this pull request?

Pass âspark.sql.parquet.compression.codecâ value to 
âparquet.compressionâ.
Pass âspark.sql.orc.compression.codecâ value to âorc.compressâ.

How was this patch tested?

Add test.

Note: 
This is the same issue mentioned in #19218 . That branch was deleted 
mistakenly, so make a new pr instead.

@gatorsmile @maropu @dongjoon-hyun @discipleforteen


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark HiveTableWriting

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20087.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20087


commit 9bbfe6ef4b5a418373c2250ad676233fb05df7f7
Author: fjh100456 
Date:   2017-12-25T02:29:53Z

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

## How was this patch tested?
Manual test.

commit 48cf108ed5c3298eb860d9735b439ac89d65765e
Author: fjh100456 
Date:   2017-12-25T02:30:24Z

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

## How was this patch tested?
Manual test.

commit 5dbd3edf9e086433d3d3fe9c0ead887d799c61d3
Author: fjh100456 
Date:   2017-12-25T02:34:29Z

spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring 
'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to 
be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".

## How was this patch tested?
Manual test.

commit 5124f1b560e942c0dc23af31336317a4b995dd8f
Author: fjh100456 
Date:   2017-12-25T07:06:26Z

spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring 
'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to 
be considered.

## What changes were proposed in this pull request?
1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".
3.Change `compressionCode` to `compressionCodecClassName`.

## How was this patch tested?
Manual test.

commit 6907a3ef86a2546fae91c22754796490a80e
Author: fjh100456 
Date:   2017-12-25T09:26:33Z

Make comression codec take effect in hive table writing.

commit 67e40d4d7fd3b6a9e4526ce17bf6d4eadb05b2b8
Author: fjh100456 
Date:   2017-12-25T12:08:11Z

Modify test

commit e2526ca1bb72e54c03d977b8678bd14b28c83585
Author: fjh100456 
Date:   2017-12-26T05:38:10Z

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-26 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
I agree that the best is we can have both of them.

I have a proposal to replace statement output in split methods. Maybe you 
can check if it sounds good.

By #20043, we have a `StatementValue` wrapping statement output. Instead of 
immediately embedding the statement in codes, we use a special replacement like 
`%STATEMENT_1%` for it. Normally we replace this with actual statement. If we 
need split methods, we replace this with a generated variable name. As it is 
special replacement, I think it should be safer.

This is the idea to more safely replace statement with generate variable 
name under the string based framework.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20056
  
**[Test build #85415 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85415/testReport)**
 for PR 20056 at commit 
[`34cfb46`](https://github.com/apache/spark/commit/34cfb46b7b5eca0a1785880fee792418507aac89).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20056
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85415/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20056
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...

2017-12-26 Thread Ngone51

Github user Ngone51 commented on the issue:

https://github.com/apache/spark/pull/20056
  
@maropu Actually, I didn't modify this unit test ever. And my unit test 
locate in SparkListenerSuite haven'been started according to the "Console 
Output".


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19943#discussion_r158756552
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, 
TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.orc._
+import org.apache.orc.mapred.OrcInputFormat
+import org.apache.orc.storage.ql.exec.vector._
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.memory.MemoryMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
+
+
+/**
+ * To support vectorization in WholeStageCodeGen, this reader returns 
ColumnarBatch.
+ */
+private[orc] class OrcColumnarBatchReader extends RecordReader[Void, 
ColumnarBatch] with Logging {
--- End diff --

I think it's fine to follow parquet and write data to Spark column vector 
now. Later we can try the wrapper approach and compare.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20082
  
**[Test build #85421 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85421/testReport)**
 for PR 20082 at commit 
[`59e4a9c`](https://github.com/apache/spark/commit/59e4a9c70c037729f3eb60b47b2e625208687385).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20082
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20082
  
**[Test build #85420 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85420/testReport)**
 for PR 20082 at commit 
[`0f86e95`](https://github.com/apache/spark/commit/0f86e95fd9c8703516e23b5b3bd370a89b27bc69).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20082
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85420/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...

2017-12-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20082
  
**[Test build #85420 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85420/testReport)**
 for PR 20082 at commit 
[`0f86e95`](https://github.com/apache/spark/commit/0f86e95fd9c8703516e23b5b3bd370a89b27bc69).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...

2017-12-26 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20056
  
This test passed in your local env?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19675: [SPARK-14540][BUILD] Support Scala 2.12 closures and Jav...

2017-12-26 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19675
  
Did you have a look at the JIRA? lots more detail there. 
https://issues.apache.org/jira/browse/SPARK-14540


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20086: [SPARK-22903]Fix already being created exception in stag...

2017-12-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20086
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20086: [SPARK-22903]Fix already being created exception ...

2017-12-26 Thread liupc

GitHub user liupc opened a pull request:

https://github.com/apache/spark/pull/20086

[SPARK-22903]Fix already being created exception in stage retry caused by 
wrong atâ¦

â¦temptNumber

## What changes were proposed in this pull request?

This PR fix the wrong attemptNumber in stage retry, it will solve the 
probem of AlreadyBeingCreatedException thrown by executor when failedStages 
already created the taskAttemptPath.
Details see: https://issues.apache.org/jira/browse/SPARK-22903

(Please fill in changes proposed in this fix)

## How was this patch tested?

manual
(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liupc/spark 
Fix-ready-beging-created-exception-in-stage-retry

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20086.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20086


commit 1b11eca92e5b79ef0757c19de3acc17c1c047965
Author: liupengcheng 
Date:   2017-12-27T02:11:51Z

Fix already being created exception in stage retry caused by wrong 
attemptNumber




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20034: [SPARK-22846][SQL] Fix table owner is null when c...

2017-12-26 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20034


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 312 matches

Mail list logo