[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r212163122 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -183,34 +178,106 @@ private[sql] object ArrowConverters { } /** - * Convert a byte array to an ArrowRecordBatch. + * Load a serialized ArrowRecordBatch. */ - private[arrow] def byteArrayToBatch( + private[arrow] def loadBatch( batchBytes: Array[Byte], allocator: BufferAllocator): ArrowRecordBatch = { -val in = new ByteArrayReadableSeekableByteChannel(batchBytes) -val reader = new ArrowFileReader(in, allocator) - -// Read a batch from a byte stream, ensure the reader is closed -Utils.tryWithSafeFinally { - val root = reader.getVectorSchemaRoot // throws IOException - val unloader = new VectorUnloader(root) - reader.loadNextBatch() // throws IOException - unloader.getRecordBatch -} { - reader.close() -} +val in = new ByteArrayInputStream(batchBytes) +MessageSerializer.deserializeRecordBatch( + new ReadChannel(Channels.newChannel(in)), allocator) // throws IOException } + /** + * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches. + */ private[sql] def toDataFrame( - payloadRDD: JavaRDD[Array[Byte]], + arrowBatchRDD: JavaRDD[Array[Byte]], schemaString: String, sqlContext: SQLContext): DataFrame = { -val rdd = payloadRDD.rdd.mapPartitions { iter => +val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] +val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone +val rdd = arrowBatchRDD.rdd.mapPartitions { iter => val context = TaskContext.get() - ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), context) + ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context) } -val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] sqlContext.internalCreateDataFrame(rdd, schema) } + + /** + * Read a file as an Arrow stream and parallelize as an RDD of serialized ArrowRecordBatches. + */ + private[sql] def readArrowStreamFromFile( + sqlContext: SQLContext, + filename: String): JavaRDD[Array[Byte]] = { +val fileStream = new FileInputStream(filename) +try { + // Create array so that we can safely close the file + val batches = getBatchesFromStream(fileStream.getChannel).toArray + // Parallelize the record batches to create an RDD + JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, batches.length)) --- End diff -- @BryanCutler, why did this parallelize with the length of batches size? I thought the data size is usually small and wondering if it necessarily speeds up in general. Did I misread? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r212162307 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -183,34 +178,106 @@ private[sql] object ArrowConverters { } /** - * Convert a byte array to an ArrowRecordBatch. + * Load a serialized ArrowRecordBatch. */ - private[arrow] def byteArrayToBatch( + private[arrow] def loadBatch( batchBytes: Array[Byte], allocator: BufferAllocator): ArrowRecordBatch = { -val in = new ByteArrayReadableSeekableByteChannel(batchBytes) -val reader = new ArrowFileReader(in, allocator) - -// Read a batch from a byte stream, ensure the reader is closed -Utils.tryWithSafeFinally { - val root = reader.getVectorSchemaRoot // throws IOException - val unloader = new VectorUnloader(root) - reader.loadNextBatch() // throws IOException - unloader.getRecordBatch -} { - reader.close() -} +val in = new ByteArrayInputStream(batchBytes) +MessageSerializer.deserializeRecordBatch( + new ReadChannel(Channels.newChannel(in)), allocator) // throws IOException } + /** + * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches. + */ private[sql] def toDataFrame( - payloadRDD: JavaRDD[Array[Byte]], + arrowBatchRDD: JavaRDD[Array[Byte]], schemaString: String, sqlContext: SQLContext): DataFrame = { -val rdd = payloadRDD.rdd.mapPartitions { iter => +val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] +val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone +val rdd = arrowBatchRDD.rdd.mapPartitions { iter => val context = TaskContext.get() - ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), context) + ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context) } -val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] sqlContext.internalCreateDataFrame(rdd, schema) } + + /** + * Read a file as an Arrow stream and parallelize as an RDD of serialized ArrowRecordBatches. + */ + private[sql] def readArrowStreamFromFile( + sqlContext: SQLContext, + filename: String): JavaRDD[Array[Byte]] = { +val fileStream = new FileInputStream(filename) --- End diff -- nit: i would do `tryWithResource` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22163#discussion_r212162206 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) { long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length while (dataRemaining > 0) { final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining); -Platform.copyMemory( - recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer); -writer.write(writeBuffer, 0, toTransfer); +if (bufferOffset > 0 && bufferOffset + toTransfer > DISK_WRITE_BUFFER_SIZE) { --- End diff -- Actually, it doesn't need to change. The result `numRecordsWritten` has no effect, tt's only written in `diskWriteBuffer` before, but now it's written to `writeBuffer`. The bytes which has Written will be updated in `commitAndGet` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22180: [SPARK-25174][YARN]Limit the size of diagnostic message ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22180 **[Test build #95132 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95132/testReport)** for PR 22180 at commit [`3271c3f`](https://github.com/apache/spark/commit/3271c3f4100e0d69fe30400a42ab35aaab1c7c48). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22180: [SPARK-25174][YARN]Limit the size of diagnostic message ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22180 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22180: [SPARK-25174][YARN]Limit the size of diagnostic message ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22180 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2468/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22180: [SPARK-25174][YARN]Limit the size of diagnostic m...
Github user yaooqinn commented on a diff in the pull request: https://github.com/apache/spark/pull/22180#discussion_r212161599 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -368,7 +369,11 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments) extends } logInfo(s"Final app status: $finalStatus, exitCode: $exitCode" + Option(msg).map(msg => s", (reason: $msg)").getOrElse("")) -finalMsg = msg +finalMsg = if (msg == null || msg.length <= finalMsgLimitSize) { --- End diff -- that's better, thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r212161411 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -3268,13 +3268,49 @@ class Dataset[T] private[sql]( } /** - * Collect a Dataset as ArrowPayload byte arrays and serve to PySpark. + * Collect a Dataset as Arrow batches and serve stream to PySpark. */ private[sql] def collectAsArrowToPython(): Array[Any] = { +val timeZoneId = sparkSession.sessionState.conf.sessionLocalTimeZone + withAction("collectAsArrowToPython", queryExecution) { plan => - val iter: Iterator[Array[Byte]] = -toArrowPayload(plan).collect().iterator.map(_.asPythonSerializable) - PythonRDD.serveIterator(iter, "serve-Arrow") + PythonRDD.serveToStream("serve-Arrow") { out => +val batchWriter = new ArrowBatchStreamWriter(schema, out, timeZoneId) +val arrowBatchRdd = toArrowBatchRdd(plan) +val numPartitions = arrowBatchRdd.partitions.length + +// Store collection results for worst case of 1 to N-1 partitions --- End diff -- Is it better `0 to N-1 partitions`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r212158131 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -111,65 +113,58 @@ private[sql] object ArrowConverters { rowCount += 1 } arrowWriter.finish() - writer.writeBatch() + val batch = unloader.getRecordBatch() + MessageSerializer.serialize(writeChannel, batch) + batch.close() --- End diff -- Should we `tryWithResouce` here too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22187 Thanks, updated both titles --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/22163#discussion_r212160161 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) { long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length while (dataRemaining > 0) { final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining); -Platform.copyMemory( - recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer); -writer.write(writeBuffer, 0, toTransfer); +if (bufferOffset > 0 && bufferOffset + toTransfer > DISK_WRITE_BUFFER_SIZE) { --- End diff -- Oh, I see. If so, I'm afraid you may have to change ` writer.recordWritten()`'s behaviour, which just count records one bye one right now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/22188 OK, I reran the tests for the lower column count cases, and the runs with the patch consistently show a tiny (1-3%) improvement compared to the master branch. So even the lower column count cases benefit a little. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20345 **[Test build #95131 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95131/testReport)** for PR 20345 at commit [`39462fb`](https://github.com/apache/spark/commit/39462fbee952ec574b4c04d7718fd73bb5f56d9d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2467/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22163 **[Test build #95130 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95130/testReport)** for PR 22163 at commit [`f91e18c`](https://github.com/apache/spark/commit/f91e18c7d4b8eab53c4983320a0eab0403c37a48). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22163 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2466/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20345 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22163 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22163 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22163#discussion_r212154100 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) { long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length while (dataRemaining > 0) { final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining); -Platform.copyMemory( - recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer); -writer.write(writeBuffer, 0, toTransfer); +if (bufferOffset > 0 && bufferOffset + toTransfer > DISK_WRITE_BUFFER_SIZE) { --- End diff -- Thanks @Ngone51 If we got `n` records with size n*X < diskWriteBufferSize(same as DISK_WRITE_BUFFER_SIZE), then we will only call writer.write() once. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95129 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95129/testReport)** for PR 22112 at commit [`93f37fa`](https://github.com/apache/spark/commit/93f37fa585462b9ee2fb9e179eab736fbc416d3e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2465/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95128 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95128/testReport)** for PR 22112 at commit [`097092b`](https://github.com/apache/spark/commit/097092be4b2967689082af62715ecc4f78086c30). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2464/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21332: [SPARK-24236][SS] Continuous replacement for ShuffleExch...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21332 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95122/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21332: [SPARK-24236][SS] Continuous replacement for ShuffleExch...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21332 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21332: [SPARK-24236][SS] Continuous replacement for ShuffleExch...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21332 **[Test build #95122 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95122/testReport)** for PR 21332 at commit [`06c6cfb`](https://github.com/apache/spark/commit/06c6cfb5e88ffcb742830dc6f38c8bdc9057552d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 BTW, I think a cleaner fix is to make shuffle files reliable(e.g. put them on HDFS), so that Spark will never retry a task from a finished shuffle map stage. Then all the problems go away, the randomness is materialized with shuffle files and we will not hit correctness issues. This is a big project and maybe we can consider it in Spark 3.0. For now(2.4) I think failing and asking users to checkpoint is better than just documenting that `repartition`/`zip` may return wrong results. We also have a plan to reduce the possibility of failing later, by marking RDD actions as "repeatable". --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22185: [SPARK-25127] DataSourceV2: Remove SupportsPushDo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22185 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22185: [SPARK-25127] DataSourceV2: Remove SupportsPushDownCatal...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22185 LGTM, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22193: [SPARK-25186][SQL] Remove v2 save mode.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22193 **[Test build #95127 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95127/testReport)** for PR 22193 at commit [`394c9d7`](https://github.com/apache/spark/commit/394c9d71ea46ce9d90d5c1c5a0309f29d0f6f5e6). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22193: [SPARK-25186][SQL] Remove v2 save mode.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22193 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95127/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22193: [SPARK-25186][SQL] Remove v2 save mode.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22193 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21756: [SPARK-24764] [CORE] Add ServiceLoader implementation fo...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/21756 From what I talked to @shrutig offline, the use-case her team has is when `UserGroupInformation.getCurrentUser.getAuthenticationMethod == AuthenticationMethod.KERBEROS`, they want to set the remote ugi with `AuthMethod.KERBEROS` authentication method as the default remote ugi is using `AuthMethod.SIMPLE`. I'll suggest to add ```scala ugi.setAuthenticationMethod( UserGroupInformation.getCurrentUser.getAuthenticationMethod.getAuthMethod) ``` into `def createSparkUser(): UserGroupInformation` instead of customizing `SparkHadoopUtil`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22193: [SPARK-25186][SQL] Remove v2 save mode.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22193 **[Test build #95127 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95127/testReport)** for PR 22193 at commit [`394c9d7`](https://github.com/apache/spark/commit/394c9d71ea46ce9d90d5c1c5a0309f29d0f6f5e6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22193: [SPARK-25186][SQL] Remove v2 save mode.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22193 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2463/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22193: [SPARK-25186][SQL] Remove v2 save mode.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22193 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22193: [SPARK-25186][SQL] Remove v2 save mode.
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/22193 [SPARK-25186][SQL] Remove v2 save mode. ## What changes were proposed in this pull request? This removes `SaveMode` from the v2 write API. Overwrite is temporarily implemented by deleting path-based and name-based tables and appending. These appends implicitly create tables and should be replaced with CTAS and RTAS. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark remove-v2-save-mode Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22193.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22193 commit e3fcc83a4a55576821573ceb9a3a56b89218a187 Author: Ryan Blue Date: 2018-08-22T21:17:11Z SPARK-25188: Add WriteConfig to v2 write API. commit 6dbc6e09b5a50764b29050ac30eb9c4f9f8970b8 Author: Ryan Blue Date: 2018-08-22T21:45:17Z SPARK-25188: BatchPartitionOverwriteSupport extends BatchWriteSupport. commit 394c9d71ea46ce9d90d5c1c5a0309f29d0f6f5e6 Author: Ryan Blue Date: 2018-08-22T23:51:24Z Remove SaveMode from the v2 write API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] change the generated code of the keyS...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22187 Yeah, I agreed with @rednaxelafx to directly ship the StructType objects looks like a better solution. +1 for that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22063: [WIP][SPARK-25044][SQL] Address translation of LMF closu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22063 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95121/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22063: [WIP][SPARK-25044][SQL] Address translation of LMF closu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22063 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22063: [WIP][SPARK-25044][SQL] Address translation of LMF closu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22063 **[Test build #95121 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95121/testReport)** for PR 22063 at commit [`09c3a3b`](https://github.com/apache/spark/commit/09c3a3b5c45fb2a356b191d5699c6ba497694031). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918] Executor Plugin API
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22192 Please remove all of the PR template text from the description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918] Executor Plugin API
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22192 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22192: [SPARK-24918] Executor Plugin API
GitHub user NiharS opened a pull request: https://github.com/apache/spark/pull/22192 [SPARK-24918] Executor Plugin API A continuation of @squito's executor plugin task. By his request I took his code and added testing and moved the plugin initialization to a separate thread. ## What changes were proposed in this pull request? Executor plugins now run on one separate thread, so the executor does not wait on them. Added testing. ## How was this patch tested? Added test cases that test using a sample plugin. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/NiharS/spark executorPlugin Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22192.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22192 commit 44454dd586e35bdf16492c4a8969494bd3b7f8f5 Author: Nihar Sheth Date: 2018-08-20T17:53:37Z [SPARK-24918] Executor Plugin API This commit adds testing and moves the plugin initialization to a separate thread. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22185: [SPARK-25127] DataSourceV2: Remove SupportsPushDownCatal...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22185 +1, LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22188 That does seem counter intuitive, but no idea what could explain that since the new code seems like a straight-forward better version. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/22190 This is related to #21308, which adds `DeleteSupport`. Both `BatchOverwriteSupport` and `DeleteSupport` use the same input to remove data (`Filter[]`) and can reject deletes that don't align with partition boundaries. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21749: [SPARK-24785] [SHELL] Making sure REPL prints Spa...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21749 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/22188 Thanks @vanzin. In my benchmark tests, the tiny degradation (0.5%) in the lower column count cases is pretty consistent, which concerns me a little. I am going to re-run those tests in a different environment and see what happens. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21749: [SPARK-24785] [SHELL] Making sure REPL prints Spark UI i...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/21749 @vanzin Thanks for the reviewing. I'll merge it as it, and have another PR trying to merge 2.11 and 2.12 branches. I had a positive conversation with Scala community to add a proper hook for us, and they agreed with it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] change the generated code of the keyS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22187 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95119/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] change the generated code of the keyS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22187 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] change the generated code of the keyS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22187 **[Test build #95119 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95119/testReport)** for PR 22187 at commit [`86c9c5b`](https://github.com/apache/spark/commit/86c9c5b31f8a7f8ec722ceeedc8e4812ac7159ef). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21749: [SPARK-24785] [SHELL] Making sure REPL prints Spark UI i...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21749 The patch LGTM. I'd add a comment where Spark-specific code was inserted, so it's easy to find later. But ok as is too. Not sure why it seems like a big deal to have the Scala libraries add an initialization hook to that class... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22191 **[Test build #95126 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95126/testReport)** for PR 22191 at commit [`ade656a`](https://github.com/apache/spark/commit/ade656a3cf3edb32d80e66d03276584c0c86c4d0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22177: [SPARK-25119][Web UI] stages in wrong order within job p...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22177 You're sorting on stage ID but your description says task ID. Just a small inconsistency. This is also against branch-2.3, it should be against master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22177: [SPARK-25119][Web UI] stages in wrong order withi...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22177#discussion_r212134537 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala --- @@ -337,7 +337,9 @@ private[ui] class JobPage(parent: JobsTab, store: AppStatusStore) extends WebUIP store.executorList(false), appStartTime) val operationGraphContent = store.asOption(store.operationGraphForJob(jobId)) match { - case Some(operationGraph) => UIUtils.showDagVizForJob(jobId, operationGraph) + case Some(operationGraph) => UIUtils.showDagVizForJob(jobId, operationGraph.sortWith( + _.rootCluster.id.replaceAll(RDDOperationGraph.STAGE_CLUSTER_PREFIX, "").toInt --- End diff -- +1. `replaceAll` also seems like the wrong API to use, `substring` seems more correct. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22177: [SPARK-25119][Web UI] stages in wrong order withi...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22177#discussion_r212134229 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala --- @@ -18,18 +18,18 @@ package org.apache.spark.ui.jobs import java.util.Locale + import javax.servlet.http.HttpServletRequest import scala.collection.mutable.{Buffer, ListBuffer} import scala.xml.{Node, NodeSeq, Unparsed, Utility} - --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22188 LGTM. Will leave here for a bit to see if anyone else comments... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22191 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95125/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22191 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22191 **[Test build #95125 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95125/testReport)** for PR 22191 at commit [`eec0ad0`](https://github.com/apache/spark/commit/eec0ad08e390831717203f6d002e3b1218de6d36). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22191 **[Test build #95125 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95125/testReport)** for PR 22191 at commit [`eec0ad0`](https://github.com/apache/spark/commit/eec0ad08e390831717203f6d002e3b1218de6d36). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22191 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22191 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22191: [SPARK-25204][SS] Fix race in rate source test.
GitHub user jose-torres opened a pull request: https://github.com/apache/spark/pull/22191 [SPARK-25204][SS] Fix race in rate source test. ## What changes were proposed in this pull request? Fix a race in the rate source tests. We need a better way of testing restart behavior. ## How was this patch tested? unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/jose-torres/spark racetest Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22191.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22191 commit eec0ad08e390831717203f6d002e3b1218de6d36 Author: Jose Torres Date: 2018-08-22T22:10:32Z fix test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22188 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95118/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22188 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22188 **[Test build #95118 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95118/testReport)** for PR 22188 at commit [`697de21`](https://github.com/apache/spark/commit/697de21501acbda3dbcd8ccc13a35ad3723a652e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22146#discussion_r212126595 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesDriverBuilder.scala --- @@ -51,7 +57,13 @@ private[spark] class KubernetesDriverBuilder( provideJavaStep: ( KubernetesConf[KubernetesDriverSpecificConf] => JavaDriverFeatureStep) = -new JavaDriverFeatureStep(_)) { +new JavaDriverFeatureStep(_), +provideTemplateVolumeStep: (KubernetesConf[_ <: KubernetesRoleSpecificConf] + => TemplateVolumeStep) = +new TemplateVolumeStep(_), +provideInitialSpec: KubernetesConf[KubernetesDriverSpecificConf] --- End diff -- You can make the initial pod and wrap it in the `KubernetesDriverSpec` object. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] change the generated code of the keyS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22187 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95117/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] change the generated code of the keyS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22187 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] change the generated code of the keyS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22187 **[Test build #95117 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95117/testReport)** for PR 22187 at commit [`0626de7`](https://github.com/apache/spark/commit/0626de74f726622ac3eb251fc9f66aaa3de002d3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22190 **[Test build #95124 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95124/testReport)** for PR 22190 at commit [`6dbc6e0`](https://github.com/apache/spark/commit/6dbc6e09b5a50764b29050ac30eb9c4f9f8970b8). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95124/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22157: [SPARK-25126][SQL] Avoid creating Reader for all orc fil...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22157 **[Test build #4285 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4285/testReport)** for PR 22157 at commit [`9afdac6`](https://github.com/apache/spark/commit/9afdac6da180b9c8959696941701c734e9a3fe8e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22190 **[Test build #95124 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95124/testReport)** for PR 22190 at commit [`6dbc6e0`](https://github.com/apache/spark/commit/6dbc6e09b5a50764b29050ac30eb9c4f9f8970b8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write AP...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/22190#discussion_r212121224 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/MicroBatchWriteSupport.scala --- @@ -18,27 +18,38 @@ package org.apache.spark.sql.execution.streaming.sources import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.sources.v2.writer.{BatchWriteSupport, DataWriter, DataWriterFactory, WriterCommitMessage} -import org.apache.spark.sql.sources.v2.writer.streaming.{StreamingDataWriterFactory, StreamingWriteSupport} +import org.apache.spark.sql.sources.v2.DataSourceOptions +import org.apache.spark.sql.sources.v2.writer.{BatchWriteSupport, DataWriter, DataWriterFactory, WriteConfig, WriterCommitMessage} +import org.apache.spark.sql.sources.v2.writer.streaming.{StreamingDataWriterFactory, StreamingWriteConfig, StreamingWriteSupport} +import org.apache.spark.sql.types.StructType /** * A [[BatchWriteSupport]] used to hook V2 stream writers into a microbatch plan. It implements * the non-streaming interface, forwarding the epoch ID determined at construction to a wrapped * streaming write support. */ -class MicroBatchWritSupport(eppchId: Long, val writeSupport: StreamingWriteSupport) +class MicroBatchWriteSupport(eppchId: Long, val writeSupport: StreamingWriteSupport) --- End diff -- This fixed a typo in the class name. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2462/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write AP...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/22190#discussion_r212120878 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/BatchPartitionOverwriteSupport.java --- @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.writer; + +import org.apache.spark.sql.sources.v2.DataSourceOptions; +import org.apache.spark.sql.types.StructType; + +/** + * An interface that adds support to {@link BatchWriteSupport} for a replace data operation that + * replaces partitions dynamically with the output of a write operation. + * + * Data source implementations can implement this interface in addition to {@link BatchWriteSupport} + * to support write operations that replace all partitions in the output table that are present + * in the write's output data. + * + * This is used to implement INSERT OVERWRITE ... PARTITIONS. + */ +public interface BatchPartitionOverwriteSupport extends BatchWriteSupport { --- End diff -- This class will be used to create a `WriteConfig` that instructs the data source to replace partitions in the existing data with partitions of a dataframe. The logical plan would be `DynamicPartitionOverwrite`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95123/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22190 **[Test build #95123 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95123/testReport)** for PR 22190 at commit [`e3fcc83`](https://github.com/apache/spark/commit/e3fcc83a4a55576821573ceb9a3a56b89218a187). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write AP...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/22190#discussion_r212120411 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/BatchPartitionOverwriteSupport.java --- @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.writer; + +import org.apache.spark.sql.sources.v2.DataSourceOptions; +import org.apache.spark.sql.types.StructType; + +/** + * An interface that adds support to {@link BatchWriteSupport} for a replace data operation that + * replaces partitions dynamically with the output of a write operation. + * + * Data source implementations can implement this interface in addition to {@link BatchWriteSupport} + * to support write operations that replace all partitions in the output table that are present + * in the write's output data. + * + * This is used to implement INSERT OVERWRITE ... PARTITIONS. + */ +public interface BatchPartitionOverwriteSupport { --- End diff -- This class will be used to create a `WriteConfig` that instructs the data source to replace partitions in the existing data with partitions of a dataframe. The logical plan would be `DynamicPartitionOverwrite`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write AP...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/22190#discussion_r212119716 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/BatchOverwriteSupport.java --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.writer; + +import org.apache.spark.sql.catalyst.plans.logical.Filter; +import org.apache.spark.sql.sources.v2.DataSourceOptions; +import org.apache.spark.sql.types.StructType; + +/** + * An interface that adds support to {@link BatchWriteSupport} for a replace data operation that + * replaces a subset of the output table with the output of a write operation. The subset removed is + * determined by a set of filter expressions. + * + * Data source implementations can implement this interface in addition to {@link BatchWriteSupport} + * to support idempotent write operations that replace data matched by a set of delete filters with + * the result of the write operation. + * + * This is used to build idempotent writes. For example, a query that produces a daily summary + * may be run several times as new data arrives. Each run should replace the output of the last + * run for a particular day in the partitioned output table. Such a job would write using this + * WriteSupport and would pass a filter matching the previous job's output, like + * $"day" === '2018-08-22', to remove that data and commit the replacement data at + * the same time. + */ +public interface BatchOverwriteSupport extends BatchWriteSupport { --- End diff -- This class will be used to create the `WriteConfig` for idempotent overwrite operations. This would be triggered by an overwrite like this (the API could be different). ``` df.writeTo("table").overwrite($"day" === "2018-08-22") ``` That would produce a `OverwriteData(source, deleteFilter, query)` logical plan, which would result in the exec node calling this to create the write config. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write AP...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/22190#discussion_r212118021 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala --- @@ -279,10 +277,7 @@ private[kafka010] class KafkaSourceProvider extends DataSourceRegister // We convert the options argument from V2 -> Java map -> scala mutable -> scala immutable. val producerParams = kafkaParamsForProducer(options.asMap.asScala.toMap) -KafkaWriter.validateQuery( --- End diff -- This query validation happens in `KafkaStreamingWriteSupport`. It was duplicated here and in that class. Now, it happens just once when creating the scan config. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: SPARK-25188: Add WriteConfig to v2 write API.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22190 **[Test build #95123 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95123/testReport)** for PR 22190 at commit [`e3fcc83`](https://github.com/apache/spark/commit/e3fcc83a4a55576821573ceb9a3a56b89218a187). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: SPARK-25188: Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: SPARK-25188: Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2461/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22190: SPARK-25188: Add WriteConfig to v2 write API.
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/22190 SPARK-25188: Add WriteConfig to v2 write API. ## What changes were proposed in this pull request? This updates the v2 write path to a similar structure as the v2 read path. Individual writes are configured and tracked using `WriteConfig` (analogous to `ScanConfig`) and this config is passed to the methods of `WriteSupport` that are specific to a single write, like `commit` and `abort`. This new config will be used to communicate overwrite options to data sources that implement new support classes, `BatchOverwriteSupport` and `BatchPartitionOverwriteSupport`. The new config could also be used by implementations to get and hold locks to make operations atomic. Streaming is also updated to use a `StreamingWriteConfig`. Options that are specific to a write, like schema, output mode, and write options. ## How was this patch tested? This is primarily an API change and should pass existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-25188-add-write-config Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22190.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22190 commit e3fcc83a4a55576821573ceb9a3a56b89218a187 Author: Ryan Blue Date: 2018-08-22T21:17:11Z SPARK-25188: Add WriteConfig to v2 write API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] change the generated code of the keyS...
Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/22187 So the new solution now is to directly ship the `StructType` object as a reference object. Why not? ;-) I'm +1 on shipping the object directly instead of generating code to recreate it on the executor. If other reviewers are also on board with this approach, let's update the PR title and the JIRA ticket title to reflect that (since we're no longer generating any "names" at all now). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org