[GitHub] spark issue #20026: [SPARK-22838][Core] Avoid unnecessary copying of data
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/20026 cc @jiangxb1987 any comments on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20036 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20036 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85423/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20036 **[Test build #85423 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85423/testReport)** for PR 20036 at commit [`05da9d7`](https://github.com/apache/spark/commit/05da9d7dfa2aca359630e70eee96db5abf96c9e4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...
Github user advancedxy commented on a diff in the pull request: https://github.com/apache/spark/pull/20082#discussion_r158774364 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala --- @@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with BeforeAndAfter with LocalSpark assert(attemptIdsWithFailedTask.toSet === Set(0, 1)) } + test("TaskContext.stageAttemptId getter") { +sc = new SparkContext("local[1,2]", "test") + +// Check stage attemptIds are 0 for initial stage +val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ => + Seq(TaskContext.get().stageAttemptId()).iterator +}.collect() +assert(stageAttemptIds.toSet === Set(0)) + +// Check stage attemptIds that are resubmitted when task fails +val stageAttemptIdsWithFailedStage = + sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ => + val stageAttemptId = TaskContext.get().stageAttemptId() + if (stageAttemptId < 2) { +throw new FetchFailedException(null, 0, 0, 0, "Fake") --- End diff -- Will do. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20082#discussion_r158774143 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala --- @@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with BeforeAndAfter with LocalSpark assert(attemptIdsWithFailedTask.toSet === Set(0, 1)) } + test("TaskContext.stageAttemptId getter") { +sc = new SparkContext("local[1,2]", "test") + +// Check stage attemptIds are 0 for initial stage +val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ => + Seq(TaskContext.get().stageAttemptId()).iterator +}.collect() +assert(stageAttemptIds.toSet === Set(0)) + +// Check stage attemptIds that are resubmitted when task fails +val stageAttemptIdsWithFailedStage = + sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ => + val stageAttemptId = TaskContext.get().stageAttemptId() + if (stageAttemptId < 2) { +throw new FetchFailedException(null, 0, 0, 0, "Fake") --- End diff -- Please add comment to explain that `FetchFailedException` will trigger a new stage attempt, while a common `Exception` will only trigger a task retry. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20089#discussion_r158774077 --- Diff: python/README.md --- @@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace all of the other use c ## Python Requirements -At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy and pandas). +At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy, pandas, and pyarrow). --- End diff -- Yea, Pandas and PyArrow are optional. Maybe, it's nicer if we have some more details here too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20082#discussion_r158773971 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala --- @@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with BeforeAndAfter with LocalSpark assert(attemptIdsWithFailedTask.toSet === Set(0, 1)) } + test("TaskContext.stageAttemptId getter") { +sc = new SparkContext("local[1,2]", "test") + +// Check stage attemptIds are 0 for initial stage +val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ => + Seq(TaskContext.get().stageAttemptId()).iterator +}.collect() +assert(stageAttemptIds.toSet === Set(0)) + +// Check stage attemptIds that are resubmitted when task fails +val stageAttemptIdsWithFailedStage = + sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ => + val stageAttemptId = TaskContext.get().stageAttemptId() + if (stageAttemptId < 2) { +throw new FetchFailedException(null, 0, 0, 0, "Fake") --- End diff -- oh, right~ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20089#discussion_r158773775 --- Diff: python/setup.py --- @@ -201,7 +201,7 @@ def _supports_symlinks(): extras_require={ 'ml': ['numpy>=1.7'], 'mllib': ['numpy>=1.7'], -'sql': ['pandas>=0.19.2'] +'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0'] --- End diff -- Nope, `extras_require` does not do anything in normal cases but they can be installed together with a dev option via pip IIRC. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20089#discussion_r158773551 --- Diff: python/setup.py --- @@ -201,7 +201,7 @@ def _supports_symlinks(): extras_require={ 'ml': ['numpy>=1.7'], 'mllib': ['numpy>=1.7'], -'sql': ['pandas>=0.19.2'] +'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0'] --- End diff -- If no pyarrow is installed, will setup force users to install it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20089#discussion_r158773507 --- Diff: python/README.md --- @@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace all of the other use c ## Python Requirements -At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy and pandas). +At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy, pandas, and pyarrow). --- End diff -- This sounds like mandatory, but I think pyarrow is still an optional choice. Right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...
Github user advancedxy commented on a diff in the pull request: https://github.com/apache/spark/pull/20082#discussion_r158773445 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala --- @@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with BeforeAndAfter with LocalSpark assert(attemptIdsWithFailedTask.toSet === Set(0, 1)) } + test("TaskContext.stageAttemptId getter") { +sc = new SparkContext("local[1,2]", "test") + +// Check stage attemptIds are 0 for initial stage +val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ => + Seq(TaskContext.get().stageAttemptId()).iterator +}.collect() +assert(stageAttemptIds.toSet === Set(0)) + +// Check stage attemptIds that are resubmitted when task fails +val stageAttemptIdsWithFailedStage = + sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ => + val stageAttemptId = TaskContext.get().stageAttemptId() + if (stageAttemptId < 2) { +throw new FetchFailedException(null, 0, 0, 0, "Fake") --- End diff -- Related to repartition part. I use FetchFailedException to explicitly trigger a stage resubmission. Otherwise, the task would be resubmitted in the same stage if IIRC. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user danielvdende commented on the issue: https://github.com/apache/spark/pull/20057 @gatorsmile would be great to hear why you doubt the value of the feature :). I know that for us it would be extremely valuable (at the moment we have to do an extra step in our data pipeline because this feature is missing in Spark), but of course we're not the only ones using Spark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20082#discussion_r158772547 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala --- @@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with BeforeAndAfter with LocalSpark assert(attemptIdsWithFailedTask.toSet === Set(0, 1)) } + test("TaskContext.stageAttemptId getter") { +sc = new SparkContext("local[1,2]", "test") + +// Check stage attemptIds are 0 for initial stage +val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ => + Seq(TaskContext.get().stageAttemptId()).iterator +}.collect() +assert(stageAttemptIds.toSet === Set(0)) + +// Check stage attemptIds that are resubmitted when task fails +val stageAttemptIdsWithFailedStage = + sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ => + val stageAttemptId = TaskContext.get().stageAttemptId() + if (stageAttemptId < 2) { +throw new FetchFailedException(null, 0, 0, 0, "Fake") --- End diff -- Emmm... just throw an `Exception` is enough here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20082#discussion_r158772359 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala --- @@ -158,6 +159,28 @@ class TaskContextSuite extends SparkFunSuite with BeforeAndAfter with LocalSpark assert(attemptIdsWithFailedTask.toSet === Set(0, 1)) } + test("TaskContext.stageAttemptId getter") { +sc = new SparkContext("local[1,2]", "test") + +// Check stage attemptIds are 0 for initial stage +val stageAttemptIds = sc.parallelize(Seq(1, 2), 2).mapPartitions { _ => + Seq(TaskContext.get().stageAttemptId()).iterator +}.collect() +assert(stageAttemptIds.toSet === Set(0)) + +// Check stage attemptIds that are resubmitted when task fails +val stageAttemptIdsWithFailedStage = + sc.parallelize(Seq(1, 2, 3, 4), 4).repartition(1).mapPartitions { _ => --- End diff -- You don't need `repartition` here, just `sc.parallelize(Seq(1, 2, 3, 4), 1).mapPartitions {...}` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158769195 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { + import OrcColumnarBatchReader._ + + /** + * ORC File Reader. + */ + private var reader: Reader = _ + + /** + * Vectorized Row Batch. + */ + private var batch: VectorizedRowBatch = _ + + /** + * Requested Column IDs. + */ + private var requestedColIds: Array[Int] = _ + + /** + * Record reader from row batch. + */ + private var rows: org.apache.orc.RecordReader = _ + + /** + * Required Schema. + */ + private var requiredSchema: StructType = _ + + /** + * ColumnarBatch for vectorized execution by whole-stage codegen. + */ + private var columnarBatch: ColumnarBatch = _ + + /** + * Writable columnVectors of ColumnarBatch. + */ + private var columnVectors: Seq[WritableColumnVector] = _ + + /** + * The number of rows read and considered to be returned. + */ + private var rowsReturned: Long = 0L + + /** + * Total number of rows. + */ + private var totalRowCount: Long = 0L + + override def getCurrentKey: Void = null + + override def getCurrentValue: ColumnarBatch = columnarBatch + + override def getProgress: Float = rowsReturned.toFloat / totalRowCount + + override def nextKeyValue(): Boolean = nextBatch() + + override def close(): Unit = { +if (columnarBatch != null) { + columnarBatch.close() + columnarBatch = null +} +if (rows != null) { + rows.close() + rows = null +} + } + + /** + * Initialize ORC file reader and batch record reader. + * Please note that `setRequiredSchema` is needed to be called after this. + */ + override def initialize(inputSplit: InputSplit, taskAttemptContext: TaskAttemptContext): Unit = { +val fileSplit = inputSplit.asInstanceOf[FileSplit] +val conf = taskAttemptContext.getConfiguration +reader = OrcFile.createReader( + fileSplit.getPath, + OrcFile.readerOptions(conf) +.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf)) +.filesystem(fileSplit.getPath.getFileSystem(conf))) + +val options = OrcInputFormat.buildOptions(conf, reader, fileSplit.getStart, fileSplit.getLength) +rows = reader.rows(options) + } + + /** + * Set required schema and partition information. + * With this information, this creates ColumnarBatch with the full schema. + */ + def setRequiredSchema( + orcSchema: TypeDescription, + requestedColIds: Array[Int], + resultSchema: StructType, + requiredSchema: StructType, + partitionValues: InternalRow): Unit = { +batch = orcSchema.createRowBatch(DEFAULT_SIZE) +assert(!batch.selectedInUse) +
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158768650 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { + import OrcColumnarBatchReader._ + + /** + * ORC File Reader. + */ + private var reader: Reader = _ + + /** + * Vectorized Row Batch. + */ + private var batch: VectorizedRowBatch = _ + + /** + * Requested Column IDs. + */ + private var requestedColIds: Array[Int] = _ + + /** + * Record reader from row batch. + */ + private var rows: org.apache.orc.RecordReader = _ + + /** + * Required Schema. + */ + private var requiredSchema: StructType = _ + + /** + * ColumnarBatch for vectorized execution by whole-stage codegen. + */ + private var columnarBatch: ColumnarBatch = _ + + /** + * Writable columnVectors of ColumnarBatch. --- End diff -- `columnVectors` -> `ColumnVector` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158769663 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { + import OrcColumnarBatchReader._ + + /** + * ORC File Reader. + */ + private var reader: Reader = _ + + /** + * Vectorized Row Batch. + */ + private var batch: VectorizedRowBatch = _ + + /** + * Requested Column IDs. + */ + private var requestedColIds: Array[Int] = _ + + /** + * Record reader from row batch. + */ + private var rows: org.apache.orc.RecordReader = _ + + /** + * Required Schema. + */ + private var requiredSchema: StructType = _ + + /** + * ColumnarBatch for vectorized execution by whole-stage codegen. + */ + private var columnarBatch: ColumnarBatch = _ + + /** + * Writable columnVectors of ColumnarBatch. + */ + private var columnVectors: Seq[WritableColumnVector] = _ + + /** + * The number of rows read and considered to be returned. + */ + private var rowsReturned: Long = 0L + + /** + * Total number of rows. + */ + private var totalRowCount: Long = 0L + + override def getCurrentKey: Void = null + + override def getCurrentValue: ColumnarBatch = columnarBatch + + override def getProgress: Float = rowsReturned.toFloat / totalRowCount + + override def nextKeyValue(): Boolean = nextBatch() + + override def close(): Unit = { +if (columnarBatch != null) { + columnarBatch.close() + columnarBatch = null +} +if (rows != null) { + rows.close() + rows = null +} + } + + /** + * Initialize ORC file reader and batch record reader. + * Please note that `setRequiredSchema` is needed to be called after this. + */ + override def initialize(inputSplit: InputSplit, taskAttemptContext: TaskAttemptContext): Unit = { +val fileSplit = inputSplit.asInstanceOf[FileSplit] +val conf = taskAttemptContext.getConfiguration +reader = OrcFile.createReader( + fileSplit.getPath, + OrcFile.readerOptions(conf) +.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf)) +.filesystem(fileSplit.getPath.getFileSystem(conf))) + +val options = OrcInputFormat.buildOptions(conf, reader, fileSplit.getStart, fileSplit.getLength) +rows = reader.rows(options) + } + + /** + * Set required schema and partition information. + * With this information, this creates ColumnarBatch with the full schema. + */ + def setRequiredSchema( + orcSchema: TypeDescription, + requestedColIds: Array[Int], + resultSchema: StructType, + requiredSchema: StructType, + partitionValues: InternalRow): Unit = { +batch = orcSchema.createRowBatch(DEFAULT_SIZE) +assert(!batch.selectedInUse) +
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158768585 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { + import OrcColumnarBatchReader._ + + /** + * ORC File Reader. + */ + private var reader: Reader = _ + + /** + * Vectorized Row Batch. + */ + private var batch: VectorizedRowBatch = _ + + /** + * Requested Column IDs. + */ + private var requestedColIds: Array[Int] = _ + + /** + * Record reader from row batch. + */ + private var rows: org.apache.orc.RecordReader = _ --- End diff -- `recordReader` or `rowReader`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20059 **[Test build #85428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85428/testReport)** for PR 20059 at commit [`f4b5c03`](https://github.com/apache/spark/commit/f4b5c03a7d1947cceafcf2ac16ddb0318778b387). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158768984 --- Diff: docs/running-on-kubernetes.md --- @@ -528,51 +576,91 @@ specific to Spark on Kubernetes. - spark.kubernetes.driver.limit.cores - (none) - - Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. - - - - spark.kubernetes.executor.limit.cores - (none) - - Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application. - - - - spark.kubernetes.node.selector.[labelKey] - (none) - - Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the - configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier - will result in the driver pod and executors having a node selector with key identifier and value - myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix. - - - - spark.kubernetes.driverEnv.[EnvironmentVariableName] - (none) - - Add the environment variable specified by EnvironmentVariableName to - the Driver process. The user can specify multiple of these to set multiple environment variables. - - - - spark.kubernetes.mountDependencies.jarsDownloadDir -/var/spark-data/spark-jars - - Location to download jars to in the driver and executors. - This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. - - - - spark.kubernetes.mountDependencies.filesDownloadDir - /var/spark-data/spark-files - - Location to download jars to in the driver and executors. - This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. - - + spark.kubernetes.driver.limit.cores + (none) + +Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. + + + + spark.kubernetes.executor.limit.cores + (none) + +Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application. + + + + spark.kubernetes.node.selector.[labelKey] + (none) + +Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the +configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier +will result in the driver pod and executors having a node selector with key identifier and value + myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix. + + + + spark.kubernetes.driverEnv.[EnvironmentVariableName] + (none) + +Add the environment variable specified by EnvironmentVariableName to +the Driver process. The user can specify multiple of these to set multiple environment variables. + + + + spark.kubernetes.mountDependencies.jarsDownloadDir + /var/spark-data/spark-jars + +Location to download jars to in the driver and executors. +This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. + + + + spark.kubernetes.mountDependencies.filesDownloadDir + /var/spark-data/spark-files + +Location to download jars to in the driver and executors. +This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. + + + + spark.kubernetes.mountDependencies.timeout + 300 seconds --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158768975 --- Diff: docs/running-on-kubernetes.md --- @@ -120,6 +120,54 @@ by their appropriate remote URIs. Also, application dependencies can be pre-moun Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. +### Using Remote Dependencies +When there are application dependencies hosted in remote locations like HDFS or HTTP servers, the driver and executor pods +need a Kubernetes [init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) for downloading +the dependencies so the driver and executor containers can use them locally. This requires users to specify the container +image for the init-container using the configuration property `spark.kubernetes.initContainer.image`. For example, users +simply add the following option to the `spark-submit` command to specify the init-container image: + +``` +--conf spark.kubernetes.initContainer.image= +``` + +The init-container handles remote dependencies specified in `spark.jars` (or the `--jars` option of `spark-submit`) and +`spark.files` (or the `--files` option of `spark-submit`). It also handles remotely hosted main application resources, e.g., +the main application jar. The following shows an example of using remote dependencies with the `spark-submit` command: + +```bash +$ bin/spark-submit \ +--master k8s://https://: \ +--deploy-mode cluster \ +--name spark-pi \ +--class org.apache.spark.examples.SparkPi \ +--jars https://path/to/dependency1.jar,https://path/to/dependency2.jar +--files hdfs://host:port/path/to/file1,hdfs://host:port/path/to/file2 +--conf spark.executor.instances=5 \ +--conf spark.kubernetes.driver.docker.image= \ +--conf spark.kubernetes.executor.docker.image= \ --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19683 **[Test build #85427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85427/testReport)** for PR 19683 at commit [`b6b8694`](https://github.com/apache/spark/commit/b6b8694a16554a272c96f1cdf4fc2695097d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158768353 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -170,6 +171,8 @@ case class FileSourceScanExec( val needsUnsafeRowConversion: Boolean = if (relation.fileFormat.isInstanceOf[ParquetSource]) { SparkSession.getActiveSession.get.sessionState.conf.parquetVectorizedReaderEnabled + } else if (relation.fileFormat.isInstanceOf[OrcFileFormat]) { + SparkSession.getActiveSession.get.sessionState.conf.orcVectorizedReaderEnabled --- End diff -- Different than Parquet, for now we enable vectorized ORC reader when batch output is supported. We don't need unsafe row conversion at all for ORC. Because once it supports batch, we go batch-based approach. If it doesn't support batch, we don't enable vectorized ORC reader at all, so we don't need unsafe row conversion too. Once we can enable vectorized ORC even batch is not supported, we need to add this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158766743 --- Diff: docs/running-on-kubernetes.md --- @@ -120,6 +120,54 @@ by their appropriate remote URIs. Also, application dependencies can be pre-moun Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. +### Using Remote Dependencies +When there are application dependencies hosted in remote locations like HDFS or HTTP servers, the driver and executor pods +need a Kubernetes [init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) for downloading +the dependencies so the driver and executor containers can use them locally. This requires users to specify the container +image for the init-container using the configuration property `spark.kubernetes.initContainer.image`. For example, users +simply add the following option to the `spark-submit` command to specify the init-container image: + +``` +--conf spark.kubernetes.initContainer.image= +``` + +The init-container handles remote dependencies specified in `spark.jars` (or the `--jars` option of `spark-submit`) and +`spark.files` (or the `--files` option of `spark-submit`). It also handles remotely hosted main application resources, e.g., +the main application jar. The following shows an example of using remote dependencies with the `spark-submit` command: + +```bash +$ bin/spark-submit \ +--master k8s://https://: \ +--deploy-mode cluster \ +--name spark-pi \ +--class org.apache.spark.examples.SparkPi \ +--jars https://path/to/dependency1.jar,https://path/to/dependency2.jar +--files hdfs://host:port/path/to/file1,hdfs://host:port/path/to/file2 +--conf spark.executor.instances=5 \ +--conf spark.kubernetes.driver.docker.image= \ +--conf spark.kubernetes.executor.docker.image= \ --- End diff -- `container.image` instead of `docker.image`. We need to modify line 79-80 as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158767810 --- Diff: docs/running-on-kubernetes.md --- @@ -528,51 +576,91 @@ specific to Spark on Kubernetes. - spark.kubernetes.driver.limit.cores - (none) - - Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. - - - - spark.kubernetes.executor.limit.cores - (none) - - Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application. - - - - spark.kubernetes.node.selector.[labelKey] - (none) - - Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the - configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier - will result in the driver pod and executors having a node selector with key identifier and value - myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix. - - - - spark.kubernetes.driverEnv.[EnvironmentVariableName] - (none) - - Add the environment variable specified by EnvironmentVariableName to - the Driver process. The user can specify multiple of these to set multiple environment variables. - - - - spark.kubernetes.mountDependencies.jarsDownloadDir -/var/spark-data/spark-jars - - Location to download jars to in the driver and executors. - This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. - - - - spark.kubernetes.mountDependencies.filesDownloadDir - /var/spark-data/spark-files - - Location to download jars to in the driver and executors. - This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. - - + spark.kubernetes.driver.limit.cores + (none) + +Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. + + + + spark.kubernetes.executor.limit.cores + (none) + +Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application. + + + + spark.kubernetes.node.selector.[labelKey] + (none) + +Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the +configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier +will result in the driver pod and executors having a node selector with key identifier and value + myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix. + + + + spark.kubernetes.driverEnv.[EnvironmentVariableName] + (none) + +Add the environment variable specified by EnvironmentVariableName to +the Driver process. The user can specify multiple of these to set multiple environment variables. + + + + spark.kubernetes.mountDependencies.jarsDownloadDir + /var/spark-data/spark-jars + +Location to download jars to in the driver and executors. +This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. + + + + spark.kubernetes.mountDependencies.filesDownloadDir + /var/spark-data/spark-files + +Location to download jars to in the driver and executors. +This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. + + + + spark.kubernetes.mountDependencies.timeout + 300 seconds --- End diff -- `300s` instead of `300 seconds`, which should be the form we can specify to the config string. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20089 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20089 **[Test build #85426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85426/testReport)** for PR 20089 at commit [`896f752`](https://github.com/apache/spark/commit/896f752a01c96b09ede5ae9d6fc924d4898bfb70). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20089 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85426/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20082 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20082 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85421/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20082 **[Test build #85421 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85421/testReport)** for PR 20082 at commit [`59e4a9c`](https://github.com/apache/spark/commit/59e4a9c70c037729f3eb60b47b2e625208687385). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158767130 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20089 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20089 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85425/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20089 **[Test build #85425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85425/testReport)** for PR 20089 at commit [`bee3c69`](https://github.com/apache/spark/commit/bee3c69b4b559f6bf7aa74366ad2178eb3dd299e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20089 **[Test build #85426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85426/testReport)** for PR 20089 at commit [`896f752`](https://github.com/apache/spark/commit/896f752a01c96b09ede5ae9d6fc924d4898bfb70). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel sa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20088 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85422/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel sa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20088 **[Test build #85422 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85422/testReport)** for PR 20088 at commit [`5fd56c3`](https://github.com/apache/spark/commit/5fd56c350cb410740209fba32c19489847a1d019). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel sa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20088 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20089 @HyukjinKwon I'll update it as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20089 **[Test build #85425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85425/testReport)** for PR 20089 at commit [`bee3c69`](https://github.com/apache/spark/commit/bee3c69b4b559f6bf7aa74366ad2178eb3dd299e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158764605 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { + import OrcColumnarBatchReader._ + + /** + * ORC File Reader. + */ + private var reader: Reader = _ + + /** + * Vectorized Row Batch. + */ + private var batch: VectorizedRowBatch = _ + + /** + * Requested Column IDs. + */ + private var requestedColIds: Array[Int] = _ + + /** + * Record reader from row batch. + */ + private var rows: org.apache.orc.RecordReader = _ + + /** + * Required Schema. + */ + private var requiredSchema: StructType = _ + + /** + * ColumnarBatch for vectorized execution by whole-stage codegen. + */ + private var columnarBatch: ColumnarBatch = _ + + /** + * Writable columnVectors of ColumnarBatch. + */ + private var columnVectors: Seq[WritableColumnVector] = _ + + /** + * The number of rows read and considered to be returned. + */ + private var rowsReturned: Long = 0L + + /** + * Total number of rows. + */ + private var totalRowCount: Long = 0L + + override def getCurrentKey: Void = null + + override def getCurrentValue: ColumnarBatch = columnarBatch + + override def getProgress: Float = rowsReturned.toFloat / totalRowCount + + override def nextKeyValue(): Boolean = nextBatch() + + override def close(): Unit = { +if (columnarBatch != null) { + columnarBatch.close() + columnarBatch = null +} +if (rows != null) { + rows.close() + rows = null +} + } + + /** + * Initialize ORC file reader and batch record reader. + * Please note that `setRequiredSchema` is needed to be called after this. + */ + override def initialize(inputSplit: InputSplit, taskAttemptContext: TaskAttemptContext): Unit = { +val fileSplit = inputSplit.asInstanceOf[FileSplit] +val conf = taskAttemptContext.getConfiguration +reader = OrcFile.createReader( + fileSplit.getPath, + OrcFile.readerOptions(conf) +.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf)) +.filesystem(fileSplit.getPath.getFileSystem(conf))) + +val options = OrcInputFormat.buildOptions(conf, reader, fileSplit.getStart, fileSplit.getLength) +rows = reader.rows(options) + } + + /** + * Set required schema and partition information. + * With this information, this creates ColumnarBatch with the full schema. + */ + def setRequiredSchema( + orcSchema: TypeDescription, + requestedColIds: Array[Int], + resultSchema: StructType, + requiredSchema: StructType, + partitionValues: InternalRow): Unit = { +batch = orcSchema.createRowBatch(DEFAULT_SIZE) +assert(!batch.selectedInUse) +
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158764584 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { + import OrcColumnarBatchReader._ + + /** + * ORC File Reader. + */ + private var reader: Reader = _ + + /** + * Vectorized Row Batch. + */ + private var batch: VectorizedRowBatch = _ + + /** + * Requested Column IDs. + */ + private var requestedColIds: Array[Int] = _ + + /** + * Record reader from row batch. + */ + private var rows: org.apache.orc.RecordReader = _ + + /** + * Required Schema. + */ + private var requiredSchema: StructType = _ + + /** + * ColumnarBatch for vectorized execution by whole-stage codegen. + */ + private var columnarBatch: ColumnarBatch = _ + + /** + * Writable columnVectors of ColumnarBatch. + */ + private var columnVectors: Seq[WritableColumnVector] = _ + + /** + * The number of rows read and considered to be returned. + */ + private var rowsReturned: Long = 0L + + /** + * Total number of rows. + */ + private var totalRowCount: Long = 0L + + override def getCurrentKey: Void = null + + override def getCurrentValue: ColumnarBatch = columnarBatch + + override def getProgress: Float = rowsReturned.toFloat / totalRowCount + + override def nextKeyValue(): Boolean = nextBatch() + + override def close(): Unit = { +if (columnarBatch != null) { + columnarBatch.close() + columnarBatch = null +} +if (rows != null) { + rows.close() + rows = null +} + } + + /** + * Initialize ORC file reader and batch record reader. + * Please note that `setRequiredSchema` is needed to be called after this. + */ + override def initialize(inputSplit: InputSplit, taskAttemptContext: TaskAttemptContext): Unit = { +val fileSplit = inputSplit.asInstanceOf[FileSplit] +val conf = taskAttemptContext.getConfiguration +reader = OrcFile.createReader( + fileSplit.getPath, + OrcFile.readerOptions(conf) +.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf)) +.filesystem(fileSplit.getPath.getFileSystem(conf))) + +val options = OrcInputFormat.buildOptions(conf, reader, fileSplit.getStart, fileSplit.getLength) +rows = reader.rows(options) + } + + /** + * Set required schema and partition information. + * With this information, this creates ColumnarBatch with the full schema. + */ + def setRequiredSchema( + orcSchema: TypeDescription, + requestedColIds: Array[Int], + resultSchema: StructType, + requiredSchema: StructType, + partitionValues: InternalRow): Unit = { +batch = orcSchema.createRowBatch(DEFAULT_SIZE) +assert(!batch.selectedInUse) +
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158764569 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { + import OrcColumnarBatchReader._ + + /** + * ORC File Reader. + */ + private var reader: Reader = _ + + /** + * Vectorized Row Batch. + */ + private var batch: VectorizedRowBatch = _ + + /** + * Requested Column IDs. + */ + private var requestedColIds: Array[Int] = _ + + /** + * Record reader from row batch. + */ + private var rows: org.apache.orc.RecordReader = _ + + /** + * Required Schema. + */ + private var requiredSchema: StructType = _ + + /** + * ColumnarBatch for vectorized execution by whole-stage codegen. + */ + private var columnarBatch: ColumnarBatch = _ + + /** + * Writable columnVectors of ColumnarBatch. + */ + private var columnVectors: Seq[WritableColumnVector] = _ + + /** + * The number of rows read and considered to be returned. + */ + private var rowsReturned: Long = 0L + + /** + * Total number of rows. + */ + private var totalRowCount: Long = 0L + + override def getCurrentKey: Void = null + + override def getCurrentValue: ColumnarBatch = columnarBatch + + override def getProgress: Float = rowsReturned.toFloat / totalRowCount + + override def nextKeyValue(): Boolean = nextBatch() + + override def close(): Unit = { +if (columnarBatch != null) { + columnarBatch.close() + columnarBatch = null +} +if (rows != null) { + rows.close() + rows = null +} + } + + /** + * Initialize ORC file reader and batch record reader. + * Please note that `setRequiredSchema` is needed to be called after this. + */ + override def initialize(inputSplit: InputSplit, taskAttemptContext: TaskAttemptContext): Unit = { +val fileSplit = inputSplit.asInstanceOf[FileSplit] +val conf = taskAttemptContext.getConfiguration +reader = OrcFile.createReader( + fileSplit.getPath, + OrcFile.readerOptions(conf) +.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf)) +.filesystem(fileSplit.getPath.getFileSystem(conf))) + +val options = OrcInputFormat.buildOptions(conf, reader, fileSplit.getStart, fileSplit.getLength) +rows = reader.rows(options) + } + + /** + * Set required schema and partition information. + * With this information, this creates ColumnarBatch with the full schema. + */ + def setRequiredSchema( + orcSchema: TypeDescription, + requestedColIds: Array[Int], + resultSchema: StructType, + requiredSchema: StructType, + partitionValues: InternalRow): Unit = { +batch = orcSchema.createRowBatch(DEFAULT_SIZE) +assert(!batch.selectedInUse) +
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158764558 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { + import OrcColumnarBatchReader._ + + /** + * ORC File Reader. + */ + private var reader: Reader = _ + + /** + * Vectorized Row Batch. + */ + private var batch: VectorizedRowBatch = _ + + /** + * Requested Column IDs. + */ + private var requestedColIds: Array[Int] = _ + + /** + * Record reader from row batch. + */ + private var rows: org.apache.orc.RecordReader = _ + + /** + * Required Schema. + */ + private var requiredSchema: StructType = _ + + /** + * ColumnarBatch for vectorized execution by whole-stage codegen. + */ + private var columnarBatch: ColumnarBatch = _ + + /** + * Writable columnVectors of ColumnarBatch. + */ + private var columnVectors: Seq[WritableColumnVector] = _ + + /** + * The number of rows read and considered to be returned. + */ + private var rowsReturned: Long = 0L + + /** + * Total number of rows. + */ + private var totalRowCount: Long = 0L + + override def getCurrentKey: Void = null + + override def getCurrentValue: ColumnarBatch = columnarBatch + + override def getProgress: Float = rowsReturned.toFloat / totalRowCount + + override def nextKeyValue(): Boolean = nextBatch() + + override def close(): Unit = { +if (columnarBatch != null) { + columnarBatch.close() + columnarBatch = null +} +if (rows != null) { + rows.close() + rows = null +} + } + + /** + * Initialize ORC file reader and batch record reader. + * Please note that `setRequiredSchema` is needed to be called after this. + */ + override def initialize(inputSplit: InputSplit, taskAttemptContext: TaskAttemptContext): Unit = { +val fileSplit = inputSplit.asInstanceOf[FileSplit] +val conf = taskAttemptContext.getConfiguration +reader = OrcFile.createReader( + fileSplit.getPath, + OrcFile.readerOptions(conf) +.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf)) +.filesystem(fileSplit.getPath.getFileSystem(conf))) + +val options = OrcInputFormat.buildOptions(conf, reader, fileSplit.getStart, fileSplit.getLength) +rows = reader.rows(options) + } + + /** + * Set required schema and partition information. + * With this information, this creates ColumnarBatch with the full schema. + */ + def setRequiredSchema( + orcSchema: TypeDescription, + requestedColIds: Array[Int], + resultSchema: StructType, + requiredSchema: StructType, + partitionValues: InternalRow): Unit = { +batch = orcSchema.createRowBatch(DEFAULT_SIZE) +assert(!batch.selectedInUse) +
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158764573 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { + import OrcColumnarBatchReader._ + + /** + * ORC File Reader. + */ + private var reader: Reader = _ + + /** + * Vectorized Row Batch. + */ + private var batch: VectorizedRowBatch = _ + + /** + * Requested Column IDs. + */ + private var requestedColIds: Array[Int] = _ + + /** + * Record reader from row batch. + */ + private var rows: org.apache.orc.RecordReader = _ + + /** + * Required Schema. + */ + private var requiredSchema: StructType = _ + + /** + * ColumnarBatch for vectorized execution by whole-stage codegen. + */ + private var columnarBatch: ColumnarBatch = _ + + /** + * Writable columnVectors of ColumnarBatch. + */ + private var columnVectors: Seq[WritableColumnVector] = _ + + /** + * The number of rows read and considered to be returned. + */ + private var rowsReturned: Long = 0L + + /** + * Total number of rows. + */ + private var totalRowCount: Long = 0L + + override def getCurrentKey: Void = null + + override def getCurrentValue: ColumnarBatch = columnarBatch + + override def getProgress: Float = rowsReturned.toFloat / totalRowCount + + override def nextKeyValue(): Boolean = nextBatch() + + override def close(): Unit = { +if (columnarBatch != null) { + columnarBatch.close() + columnarBatch = null +} +if (rows != null) { + rows.close() + rows = null +} + } + + /** + * Initialize ORC file reader and batch record reader. + * Please note that `setRequiredSchema` is needed to be called after this. + */ + override def initialize(inputSplit: InputSplit, taskAttemptContext: TaskAttemptContext): Unit = { +val fileSplit = inputSplit.asInstanceOf[FileSplit] +val conf = taskAttemptContext.getConfiguration +reader = OrcFile.createReader( + fileSplit.getPath, + OrcFile.readerOptions(conf) +.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf)) +.filesystem(fileSplit.getPath.getFileSystem(conf))) + +val options = OrcInputFormat.buildOptions(conf, reader, fileSplit.getStart, fileSplit.getLength) +rows = reader.rows(options) + } + + /** + * Set required schema and partition information. + * With this information, this creates ColumnarBatch with the full schema. + */ + def setRequiredSchema( + orcSchema: TypeDescription, + requestedColIds: Array[Int], + resultSchema: StructType, + requiredSchema: StructType, + partitionValues: InternalRow): Unit = { +batch = orcSchema.createRowBatch(DEFAULT_SIZE) +assert(!batch.selectedInUse) +
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158764338 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { + import OrcColumnarBatchReader._ + + /** + * ORC File Reader. + */ + private var reader: Reader = _ + + /** + * Vectorized Row Batch. + */ + private var batch: VectorizedRowBatch = _ + + /** + * Requested Column IDs. + */ + private var requestedColIds: Array[Int] = _ + + /** + * Record reader from row batch. + */ + private var rows: org.apache.orc.RecordReader = _ + + /** + * Required Schema. + */ + private var requiredSchema: StructType = _ + + /** + * ColumnarBatch for vectorized execution by whole-stage codegen. + */ + private var columnarBatch: ColumnarBatch = _ + + /** + * Writable columnVectors of ColumnarBatch. + */ + private var columnVectors: Seq[WritableColumnVector] = _ + + /** + * The number of rows read and considered to be returned. + */ + private var rowsReturned: Long = 0L + + /** + * Total number of rows. + */ + private var totalRowCount: Long = 0L + + override def getCurrentKey: Void = null + + override def getCurrentValue: ColumnarBatch = columnarBatch + + override def getProgress: Float = rowsReturned.toFloat / totalRowCount + + override def nextKeyValue(): Boolean = nextBatch() + + override def close(): Unit = { +if (columnarBatch != null) { + columnarBatch.close() + columnarBatch = null +} +if (rows != null) { + rows.close() + rows = null +} + } + + /** + * Initialize ORC file reader and batch record reader. + * Please note that `setRequiredSchema` is needed to be called after this. + */ + override def initialize(inputSplit: InputSplit, taskAttemptContext: TaskAttemptContext): Unit = { +val fileSplit = inputSplit.asInstanceOf[FileSplit] +val conf = taskAttemptContext.getConfiguration +reader = OrcFile.createReader( + fileSplit.getPath, + OrcFile.readerOptions(conf) +.maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf)) +.filesystem(fileSplit.getPath.getFileSystem(conf))) + +val options = OrcInputFormat.buildOptions(conf, reader, fileSplit.getStart, fileSplit.getLength) +rows = reader.rows(options) + } + + /** + * Set required schema and partition information. + * With this information, this creates ColumnarBatch with the full schema. + */ + def setRequiredSchema( + orcSchema: TypeDescription, + requestedColIds: Array[Int], + resultSchema: StructType, + requiredSchema: StructType, + partitionValues: InternalRow): Unit = { +batch = orcSchema.createRowBatch(DEFAULT_SIZE) +assert(!batch.selectedInUse) +
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20089 @HyukjinKwon Thanks! I'll add it soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20089 Yea, I think we could. I added the support and tested it before - SPARK-19019. I think it's okay to add it they are just metadata AFAIK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20089 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20089 **[Test build #85424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85424/testReport)** for PR 20089 at commit [`36614af`](https://github.com/apache/spark/commit/36614af4d8e00bb9564ef834a341859a0e96dfe4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20089 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85424/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85419/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20089 **[Test build #85424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85424/testReport)** for PR 20089 at commit [`36614af`](https://github.com/apache/spark/commit/36614af4d8e00bb9564ef834a341859a0e96dfe4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18906 **[Test build #85419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85419/testReport)** for PR 18906 at commit [`d992f93`](https://github.com/apache/spark/commit/d992f939886c488d00bad7ac0d43c4e8e1eb41b7). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py fi...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20089 Btw, should we add `'Programming Language :: Python :: 3.6'` to `classifiers`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/20089 [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py file. ## What changes were proposed in this pull request? This is a follow-up pr of #19884 updating setup.py file to add pyarrow dependency. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-22324/fup1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20089.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20089 commit 36614af4d8e00bb9564ef834a341859a0e96dfe4 Author: Takuya UESHINDate: 2017-12-27T04:33:59Z Add pyarrow to setup.py. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20036 **[Test build #85423 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85423/testReport)** for PR 20036 at commit [`05da9d7`](https://github.com/apache/spark/commit/05da9d7dfa2aca359630e70eee96db5abf96c9e4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Co...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20036#discussion_r158761989 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -283,7 +283,7 @@ case class InputAdapter(child: SparkPlan) extends UnaryExecNode with CodegenSupp override def doProduce(ctx: CodegenContext): String = { // Right now, InputAdapter is only used when there is one input RDD. -// inline mutable state since an inputAdaptor in a task +// inline mutable state since an InputAdapter in a task --- End diff -- sure, done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user henrify commented on the issue: https://github.com/apache/spark/pull/19943 If i've understood Spark development process correctly, the 2.3 branch cut date is in couple of days, and if this PR doesn't get merged to master real soon, it'll have to wait until 2.4, about 6 months? @dongjoon-hyun @cloud-fan Considering that the benchmarks show almost order of magnitude improvement in performance, it would be really great to get this in for Spark 2.3, and worry about the details of copy vs wrapper approach later. Also, as this is anyway opt-in feature that needs to be enabled with config option, merging this shouldn't be "dangerous".. Thanks for your efforts & looking forward to get this PR to production! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel sa...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/20088 Currently I cannot construct a failed test for this issue, but the future PR (changing `RoundRobinPartitioning`) by @jiangxb1987 will trigger this bug. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel sa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20088 **[Test build #85422 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85422/testReport)** for PR 20088 at commit [`5fd56c3`](https://github.com/apache/spark/commit/5fd56c350cb410740209fba32c19489847a1d019). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20088: [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorM...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/20088 [SPARK-22905][ML][MLLIB][CORE] Fix ChiSqSelectorModel save implementation ## What changes were proposed in this pull request? Currently, in `ChiSqSelectorModel`, save: spark.createDataFrame(dataArray).repartition(1).write... The default partition number used by createDataFrame is "defaultParallelism", Current RoundRobinPartitioning won't guarantee the "repartition" generating the same order result with local array. We need fix it. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark fix_chisq_model_save Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20088.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20088 commit 5fd56c350cb410740209fba32c19489847a1d019 Author: WeichenXuDate: 2017-12-27T04:31:25Z init pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...
Github user bdrillard commented on a diff in the pull request: https://github.com/apache/spark/pull/20085#discussion_r158761511 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala --- @@ -390,8 +391,8 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { test("SPARK-22696: InitializeJavaBean should not use global variables") { --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...
Github user bdrillard commented on a diff in the pull request: https://github.com/apache/spark/pull/20085#discussion_r158761155 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -106,27 +106,27 @@ trait InvokeLike extends Expression with NonSQLExpression { } /** - * Invokes a static function, returning the result. By default, any of the arguments being null - * will result in returning null instead of calling the function. - * - * @param staticObject The target of the static call. This can either be the object itself - * (methods defined on scala objects), or the class object - * (static methods defined in java). - * @param dataType The expected return type of the function call - * @param functionName The name of the method to call. - * @param arguments An optional list of expressions to pass as arguments to the function. - * @param propagateNull When true, and any of the arguments is null, null will be returned instead - * of calling the function. - * @param returnNullable When false, indicating the invoked method will always return - * non-null value. - */ + * Invokes a static function, returning the result. By default, any of the arguments being null --- End diff -- Those additional spaces shouldn't be there, I've fixed them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20061 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85418/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20061 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20061 **[Test build #85418 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85418/testReport)** for PR 20061 at commit [`cfef0f1`](https://github.com/apache/spark/commit/cfef0f1511bb1ec8f9bd99ca41effce347968dd0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19683 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85416/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19683 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19683 **[Test build #85416 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85416/testReport)** for PR 19683 at commit [`c3183d0`](https://github.com/apache/spark/commit/c3183d0cba092d1308d62d06cbfa16e6d8f97498). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20085#discussion_r158760302 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -182,6 +182,114 @@ case class StaticInvoke( } } +/** + * Invokes a call to reference to a static field. + * + * @param staticObject The target of the static call. This can either be the object itself + * (methods defined on scala objects), or the class object + * (static methods defined in java). + * @param dataType The expected return type of the function call. + * @param fieldName The field to reference. + */ +case class StaticField( + staticObject: Class[_], + dataType: DataType, + fieldName: String) extends Expression with NonSQLExpression { + + val objectName = staticObject.getName.stripSuffix("$") + + override def nullable: Boolean = false + override def children: Seq[Expression] = Nil + + override def eval(input: InternalRow): Any = +throw new UnsupportedOperationException("Only code-generated evaluation is supported.") + + override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +val javaType = ctx.javaType(dataType) + +val code = s""" + final $javaType ${ev.value} = $objectName.$fieldName; +""" + +ev.copy(code = code, isNull = "false") + } +} + +/** + * Wraps an expression in a try-catch block, which can be used if the body expression may throw a + * exception. + * + * @param body The expression body to wrap in a try-catch block. + * @param dataType The return type of the try block. + * @param returnNullable When false, indicating the invoked method will always return + * non-null value. + */ +case class WrapException( +body: Expression, +dataType: DataType, +returnNullable: Boolean = true) extends Expression with NonSQLExpression { + + override def nullable: Boolean = returnNullable + override def children: Seq[Expression] = Seq(body) + + override def eval(input: InternalRow): Any = +throw new UnsupportedOperationException("Only code-generated evaluation is supported.") + + override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +val javaType = ctx.javaType(dataType) +val returnName = ctx.freshName("returnName") + +val bodyExpr = body.genCode(ctx) + +val code = + s""" + |final $javaType $returnName; + |try { + | ${bodyExpr.code} + | $returnName = ${bodyExpr.value}; + |} catch (Exception e) { + | org.apache.spark.unsafe.Platform.throwException(e); + |} + """.stripMargin + +ev.copy(code = code, isNull = bodyExpr.isNull, value = returnName) + } +} + +/** + * Returns the value if it is of the specified type, or null otherwise + * + * @param value The value to returned + * @param checkedType The type to check against the value via instanceOf + * @param dataTypeThe type returned by the expression + */ +case class ValueIfType( + value: Expression, --- End diff -- Should we limit the data type of `value` to `ObjectType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20085#discussion_r158760292 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -182,6 +182,114 @@ case class StaticInvoke( } } +/** + * Invokes a call to reference to a static field. + * + * @param staticObject The target of the static call. This can either be the object itself + * (methods defined on scala objects), or the class object + * (static methods defined in java). + * @param dataType The expected return type of the function call. + * @param fieldName The field to reference. + */ +case class StaticField( + staticObject: Class[_], + dataType: DataType, + fieldName: String) extends Expression with NonSQLExpression { + + val objectName = staticObject.getName.stripSuffix("$") + + override def nullable: Boolean = false + override def children: Seq[Expression] = Nil + + override def eval(input: InternalRow): Any = +throw new UnsupportedOperationException("Only code-generated evaluation is supported.") + + override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +val javaType = ctx.javaType(dataType) + +val code = s""" + final $javaType ${ev.value} = $objectName.$fieldName; +""" + +ev.copy(code = code, isNull = "false") + } +} + +/** + * Wraps an expression in a try-catch block, which can be used if the body expression may throw a + * exception. + * + * @param body The expression body to wrap in a try-catch block. + * @param dataType The return type of the try block. + * @param returnNullable When false, indicating the invoked method will always return + * non-null value. + */ +case class WrapException( +body: Expression, +dataType: DataType, +returnNullable: Boolean = true) extends Expression with NonSQLExpression { + + override def nullable: Boolean = returnNullable + override def children: Seq[Expression] = Seq(body) + + override def eval(input: InternalRow): Any = +throw new UnsupportedOperationException("Only code-generated evaluation is supported.") + + override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +val javaType = ctx.javaType(dataType) +val returnName = ctx.freshName("returnName") + +val bodyExpr = body.genCode(ctx) + +val code = + s""" + |final $javaType $returnName; + |try { + | ${bodyExpr.code} + | $returnName = ${bodyExpr.value}; + |} catch (Exception e) { + | org.apache.spark.unsafe.Platform.throwException(e); + |} + """.stripMargin + +ev.copy(code = code, isNull = bodyExpr.isNull, value = returnName) + } +} + +/** + * Returns the value if it is of the specified type, or null otherwise + * + * @param value The value to returned + * @param checkedType The type to check against the value via instanceOf + * @param dataTypeThe type returned by the expression + */ +case class ValueIfType( + value: Expression, + checkedType: Class[_], + dataType: DataType) extends Expression with NonSQLExpression { --- End diff -- Will we have different data type other than `value.dataType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19813 This is a pretty cool idea that can work with the current string based codegen framework, LGTM! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20085#discussion_r158759848 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala --- @@ -390,8 +391,8 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { test("SPARK-22696: InitializeJavaBean should not use global variables") { --- End diff -- `InitializeJavaBean` -> `InitializeObject`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19977 I'll update tonight. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20085: [SPARK-22739][Catalyst][WIP] Additional Expressio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20085#discussion_r158759244 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -106,27 +106,27 @@ trait InvokeLike extends Expression with NonSQLExpression { } /** - * Invokes a static function, returning the result. By default, any of the arguments being null - * will result in returning null instead of calling the function. - * - * @param staticObject The target of the static call. This can either be the object itself - * (methods defined on scala objects), or the class object - * (static methods defined in java). - * @param dataType The expected return type of the function call - * @param functionName The name of the method to call. - * @param arguments An optional list of expressions to pass as arguments to the function. - * @param propagateNull When true, and any of the arguments is null, null will be returned instead - * of calling the function. - * @param returnNullable When false, indicating the invoked method will always return - * non-null value. - */ + * Invokes a static function, returning the result. By default, any of the arguments being null --- End diff -- Why we change the comment style? Looks not consistent with others. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20056 Probably, you better pass all the tests by yourself to save committers' bandwidth. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user fjh100456 closed the pull request at: https://github.com/apache/spark/pull/19218 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user fjh100456 commented on the issue: https://github.com/apache/spark/pull/19218 Please go to #20087 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20087 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
GitHub user fjh100456 opened a pull request: https://github.com/apache/spark/pull/20087 [SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing [SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing What changes were proposed in this pull request? Pass âspark.sql.parquet.compression.codecâ value to âparquet.compressionâ. Pass âspark.sql.orc.compression.codecâ value to âorc.compressâ. How was this patch tested? Add test. Note: This is the same issue mentioned in #19218 . That branch was deleted mistakenly, so make a new pr instead. @gatorsmile @maropu @dongjoon-hyun @discipleforteen You can merge this pull request into a Git repository by running: $ git pull https://github.com/fjh100456/spark HiveTableWriting Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20087.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20087 commit 9bbfe6ef4b5a418373c2250ad676233fb05df7f7 Author: fjh100456Date: 2017-12-25T02:29:53Z [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test. commit 48cf108ed5c3298eb860d9735b439ac89d65765e Author: fjh100456 Date: 2017-12-25T02:30:24Z [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test. commit 5dbd3edf9e086433d3d3fe9c0ead887d799c61d3 Author: fjh100456 Date: 2017-12-25T02:34:29Z spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test. commit 5124f1b560e942c0dc23af31336317a4b995dd8f Author: fjh100456 Date: 2017-12-25T07:06:26Z spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Manual test. commit 6907a3ef86a2546fae91c22754796490a80e Author: fjh100456 Date: 2017-12-25T09:26:33Z Make comression codec take effect in hive table writing. commit 67e40d4d7fd3b6a9e4526ce17bf6d4eadb05b2b8 Author: fjh100456 Date: 2017-12-25T12:08:11Z Modify test commit e2526ca1bb72e54c03d977b8678bd14b28c83585 Author: fjh100456 Date: 2017-12-26T05:38:10Z
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 I agree that the best is we can have both of them. I have a proposal to replace statement output in split methods. Maybe you can check if it sounds good. By #20043, we have a `StatementValue` wrapping statement output. Instead of immediately embedding the statement in codes, we use a special replacement like `%STATEMENT_1%` for it. Normally we replace this with actual statement. If we need split methods, we replace this with a generated variable name. As it is special replacement, I think it should be safer. This is the idea to more safely replace statement with generate variable name under the string based framework. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20056 **[Test build #85415 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85415/testReport)** for PR 20056 at commit [`34cfb46`](https://github.com/apache/spark/commit/34cfb46b7b5eca0a1785880fee792418507aac89). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20056 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85415/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20056 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...
Github user Ngone51 commented on the issue: https://github.com/apache/spark/pull/20056 @maropu Actually, I didn't modify this unit test ever. And my unit test locate in SparkListenerSuite haven'been started according to the "Console Output". --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r158756552 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.orc._ +import org.apache.orc.mapred.OrcInputFormat +import org.apache.orc.storage.ql.exec.vector._ +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ + + +/** + * To support vectorization in WholeStageCodeGen, this reader returns ColumnarBatch. + */ +private[orc] class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging { --- End diff -- I think it's fine to follow parquet and write data to Spark column vector now. Later we can try the wrapper approach and compare. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20082 **[Test build #85421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85421/testReport)** for PR 20082 at commit [`59e4a9c`](https://github.com/apache/spark/commit/59e4a9c70c037729f3eb60b47b2e625208687385). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20082 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20082 **[Test build #85420 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85420/testReport)** for PR 20082 at commit [`0f86e95`](https://github.com/apache/spark/commit/0f86e95fd9c8703516e23b5b3bd370a89b27bc69). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20082 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85420/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20082 **[Test build #85420 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85420/testReport)** for PR 20082 at commit [`0f86e95`](https://github.com/apache/spark/commit/0f86e95fd9c8703516e23b5b3bd370a89b27bc69). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20056 This test passed in your local env? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19675: [SPARK-14540][BUILD] Support Scala 2.12 closures and Jav...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19675 Did you have a look at the JIRA? lots more detail there. https://issues.apache.org/jira/browse/SPARK-14540 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20086: [SPARK-22903]Fix already being created exception in stag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20086 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20086: [SPARK-22903]Fix already being created exception ...
GitHub user liupc opened a pull request: https://github.com/apache/spark/pull/20086 [SPARK-22903]Fix already being created exception in stage retry caused by wrong at⦠â¦temptNumber ## What changes were proposed in this pull request? This PR fix the wrong attemptNumber in stage retry, it will solve the probem of AlreadyBeingCreatedException thrown by executor when failedStages already created the taskAttemptPath. Details see: https://issues.apache.org/jira/browse/SPARK-22903 (Please fill in changes proposed in this fix) ## How was this patch tested? manual (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liupc/spark Fix-ready-beging-created-exception-in-stage-retry Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20086.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20086 commit 1b11eca92e5b79ef0757c19de3acc17c1c047965 Author: liupengchengDate: 2017-12-27T02:11:51Z Fix already being created exception in stage retry caused by wrong attemptNumber --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20034: [SPARK-22846][SQL] Fix table owner is null when c...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20034 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org