[GitHub] [spark] dongjoon-hyun commented on issue #28090: [SPARK-31321][SQL] Remove SaveMode check in v2 FileWriteBuilder
dongjoon-hyun commented on issue #28090: [SPARK-31321][SQL] Remove SaveMode check in v2 FileWriteBuilder URL: https://github.com/apache/spark/pull/28090#issuecomment-607965303 Thank you, @cloud-fan and @yaooqinn . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata
dongjoon-hyun commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata URL: https://github.com/apache/spark/pull/28102#discussion_r402460915 ## File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriter.scala ## @@ -45,20 +48,19 @@ private[avro] class AvroOutputWriter( * Overrides the couple of methods responsible for generating the output streams / files so * that the data can be correctly partitioned */ - private val recordWriter: RecordWriter[AvroKey[GenericRecord], NullWritable] = -new AvroKeyOutputFormat[GenericRecord]() { - + private val recordWriter: RecordWriter[AvroKey[GenericRecord], NullWritable] = { +val sparkVersion = Map(SPARK_VERSION_METADATA_KEY -> SPARK_VERSION_SHORT).asJava Review comment: +1 for @MaxGekk 's comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied
AmplabJenkins removed a comment on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied URL: https://github.com/apache/spark/pull/28100#issuecomment-607961900 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied
AmplabJenkins removed a comment on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied URL: https://github.com/apache/spark/pull/28100#issuecomment-607961915 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120724/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied
AmplabJenkins commented on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied URL: https://github.com/apache/spark/pull/28100#issuecomment-607961900 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied
AmplabJenkins commented on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied URL: https://github.com/apache/spark/pull/28100#issuecomment-607961915 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120724/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata
dongjoon-hyun commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata URL: https://github.com/apache/spark/pull/28102#discussion_r402457869 ## File path: external/avro/src/main/java/org/apache/spark/sql/avro/SparkAvroKeyOutputFormat.java ## @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.avro; + +import java.io.IOException; +import java.io.OutputStream; +import java.util.Map; + +import org.apache.avro.Schema; +import org.apache.avro.file.CodecFactory; +import org.apache.avro.file.DataFileWriter; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.mapred.AvroKey; +import org.apache.avro.mapreduce.AvroKeyOutputFormat; +import org.apache.avro.mapreduce.Syncable; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.mapreduce.RecordWriter; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +// A variant of `AvroKeyOutputFormat`, which is used to inject the custom `RecordWriterFactory` so +// that we can set avro file metadata. +public class SparkAvroKeyOutputFormat extends AvroKeyOutputFormat { + public SparkAvroKeyOutputFormat(Map metadata) { +super(new SparkRecordWriterFactory(metadata)); + } + + static class SparkRecordWriterFactory extends RecordWriterFactory { +private final Map metadata; +SparkRecordWriterFactory(Map metadata) { + this.metadata = metadata; +} + +protected RecordWriter, NullWritable> create( +Schema writerSchema, +GenericData dataModel, +CodecFactory compressionCodec, +OutputStream outputStream, +int syncInterval) throws IOException { + return new SparkAvroKeyRecordWriter( +writerSchema, dataModel, compressionCodec, outputStream, syncInterval, metadata); +} + } +} + +// This a fork of org.apache.avro.mapreduce.AvroKeyRecordWriter, in order to set file metadata. +class SparkAvroKeyRecordWriter extends RecordWriter, NullWritable> +implements Syncable { + private final DataFileWriter mAvroFileWriter; + public SparkAvroKeyRecordWriter( + Schema writerSchema, + GenericData dataModel, + CodecFactory compressionCodec, + OutputStream outputStream, + int syncInterval, + Map metadata) throws IOException { Review comment: This looks good to generalize later. Do you have another candidate in mind for adding later? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied
SparkQA removed a comment on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied URL: https://github.com/apache/spark/pull/28100#issuecomment-607854256 **[Test build #120724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120724/testReport)** for PR 28100 at commit [`e32cc30`](https://github.com/apache/spark/commit/e32cc30e470edc149427e0988a3c20c2220303c2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied
SparkQA commented on issue #28100: [SPARK-31249][CORE]Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied URL: https://github.com/apache/spark/pull/28100#issuecomment-607960654 **[Test build #120724 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120724/testReport)** for PR 28100 at commit [`e32cc30`](https://github.com/apache/spark/commit/e32cc30e470edc149427e0988a3c20c2220303c2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607958830 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
AmplabJenkins removed a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-607958185 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120727/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607958837 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25429/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607958830 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607958837 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25429/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
AmplabJenkins removed a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-607958176 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
AmplabJenkins commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-607958176 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
AmplabJenkins commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-607958185 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120727/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
SparkQA commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607958209 **[Test build #120731 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120731/testReport)** for PR 28101 at commit [`29d4599`](https://github.com/apache/spark/commit/29d4599bc542d3c87a1489933e679a31080801a6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
AmplabJenkins removed a comment on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607956174 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120728/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
SparkQA commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-607957637 **[Test build #120727 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120727/testReport)** for PR 27864 at commit [`076dd67`](https://github.com/apache/spark/commit/076dd671d9a9ac3c8d5926279f54517b8b39426a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata
dongjoon-hyun commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata URL: https://github.com/apache/spark/pull/28102#discussion_r402453349 ## File path: external/avro/src/main/java/org/apache/spark/sql/avro/SparkAvroKeyOutputFormat.java ## @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.avro; + +import java.io.IOException; +import java.io.OutputStream; +import java.util.Map; + +import org.apache.avro.Schema; +import org.apache.avro.file.CodecFactory; +import org.apache.avro.file.DataFileWriter; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.mapred.AvroKey; +import org.apache.avro.mapreduce.AvroKeyOutputFormat; +import org.apache.avro.mapreduce.Syncable; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.mapreduce.RecordWriter; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +// A variant of `AvroKeyOutputFormat`, which is used to inject the custom `RecordWriterFactory` so +// that we can set avro file metadata. +public class SparkAvroKeyOutputFormat extends AvroKeyOutputFormat { + public SparkAvroKeyOutputFormat(Map metadata) { +super(new SparkRecordWriterFactory(metadata)); + } + + static class SparkRecordWriterFactory extends RecordWriterFactory { +private final Map metadata; +SparkRecordWriterFactory(Map metadata) { + this.metadata = metadata; +} + +protected RecordWriter, NullWritable> create( +Schema writerSchema, +GenericData dataModel, +CodecFactory compressionCodec, +OutputStream outputStream, +int syncInterval) throws IOException { + return new SparkAvroKeyRecordWriter( +writerSchema, dataModel, compressionCodec, outputStream, syncInterval, metadata); +} + } +} + +// This a fork of org.apache.avro.mapreduce.AvroKeyRecordWriter, in order to set file metadata. +class SparkAvroKeyRecordWriter extends RecordWriter, NullWritable> +implements Syncable { + private final DataFileWriter mAvroFileWriter; + public SparkAvroKeyRecordWriter( + Schema writerSchema, + GenericData dataModel, + CodecFactory compressionCodec, + OutputStream outputStream, + int syncInterval, + Map metadata) throws IOException { +this.mAvroFileWriter = new DataFileWriter(dataModel.createDatumWriter(writerSchema)); +for (Map.Entry entry : metadata.entrySet()) { + this.mAvroFileWriter.setMeta(entry.getKey(), entry.getValue()); +} +this.mAvroFileWriter.setCodec(compressionCodec); +this.mAvroFileWriter.setSyncInterval(syncInterval); +this.mAvroFileWriter.create(writerSchema, outputStream); + } + + public void write(AvroKey record, NullWritable ignore) throws IOException { Review comment: @cloud-fan . ~Do we need to check the effect in terms of performance because this is an additional wrapper technically?~ Oh, never mind. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
SparkQA removed a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-607871644 **[Test build #120727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120727/testReport)** for PR 27864 at commit [`076dd67`](https://github.com/apache/spark/commit/076dd671d9a9ac3c8d5926279f54517b8b39426a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata
dongjoon-hyun commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata URL: https://github.com/apache/spark/pull/28102#discussion_r402453349 ## File path: external/avro/src/main/java/org/apache/spark/sql/avro/SparkAvroKeyOutputFormat.java ## @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.avro; + +import java.io.IOException; +import java.io.OutputStream; +import java.util.Map; + +import org.apache.avro.Schema; +import org.apache.avro.file.CodecFactory; +import org.apache.avro.file.DataFileWriter; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.mapred.AvroKey; +import org.apache.avro.mapreduce.AvroKeyOutputFormat; +import org.apache.avro.mapreduce.Syncable; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.mapreduce.RecordWriter; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +// A variant of `AvroKeyOutputFormat`, which is used to inject the custom `RecordWriterFactory` so +// that we can set avro file metadata. +public class SparkAvroKeyOutputFormat extends AvroKeyOutputFormat { + public SparkAvroKeyOutputFormat(Map metadata) { +super(new SparkRecordWriterFactory(metadata)); + } + + static class SparkRecordWriterFactory extends RecordWriterFactory { +private final Map metadata; +SparkRecordWriterFactory(Map metadata) { + this.metadata = metadata; +} + +protected RecordWriter, NullWritable> create( +Schema writerSchema, +GenericData dataModel, +CodecFactory compressionCodec, +OutputStream outputStream, +int syncInterval) throws IOException { + return new SparkAvroKeyRecordWriter( +writerSchema, dataModel, compressionCodec, outputStream, syncInterval, metadata); +} + } +} + +// This a fork of org.apache.avro.mapreduce.AvroKeyRecordWriter, in order to set file metadata. +class SparkAvroKeyRecordWriter extends RecordWriter, NullWritable> +implements Syncable { + private final DataFileWriter mAvroFileWriter; + public SparkAvroKeyRecordWriter( + Schema writerSchema, + GenericData dataModel, + CodecFactory compressionCodec, + OutputStream outputStream, + int syncInterval, + Map metadata) throws IOException { +this.mAvroFileWriter = new DataFileWriter(dataModel.createDatumWriter(writerSchema)); +for (Map.Entry entry : metadata.entrySet()) { + this.mAvroFileWriter.setMeta(entry.getKey(), entry.getValue()); +} +this.mAvroFileWriter.setCodec(compressionCodec); +this.mAvroFileWriter.setSyncInterval(syncInterval); +this.mAvroFileWriter.create(writerSchema, outputStream); + } + + public void write(AvroKey record, NullWritable ignore) throws IOException { Review comment: @cloud-fan . Do we need to check the effect in terms of performance because this is an additional wrapper technically? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
AmplabJenkins removed a comment on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607956166 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
AmplabJenkins commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607956166 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
AmplabJenkins commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607956174 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120728/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination …
AmplabJenkins removed a comment on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination … URL: https://github.com/apache/spark/pull/27724#issuecomment-607955698 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination …
AmplabJenkins removed a comment on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination … URL: https://github.com/apache/spark/pull/27724#issuecomment-607955709 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120726/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
SparkQA removed a comment on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607884832 **[Test build #120728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120728/testReport)** for PR 27207 at commit [`8d35ea1`](https://github.com/apache/spark/commit/8d35ea14edfb8bd7cc276020784c13caf030532a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination …
AmplabJenkins commented on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination … URL: https://github.com/apache/spark/pull/27724#issuecomment-607955698 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination …
AmplabJenkins commented on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination … URL: https://github.com/apache/spark/pull/27724#issuecomment-607955709 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120726/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination …
SparkQA removed a comment on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination … URL: https://github.com/apache/spark/pull/27724#issuecomment-607867530 **[Test build #120726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120726/testReport)** for PR 27724 at commit [`99010e0`](https://github.com/apache/spark/commit/99010e0a020c7e72cfc0ac89e50b8a5b819a65ba). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
SparkQA commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607955673 **[Test build #120728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120728/testReport)** for PR 27207 at commit [`8d35ea1`](https://github.com/apache/spark/commit/8d35ea14edfb8bd7cc276020784c13caf030532a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination …
SparkQA commented on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination … URL: https://github.com/apache/spark/pull/27724#issuecomment-607954854 **[Test build #120726 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120726/testReport)** for PR 27724 at commit [`99010e0`](https://github.com/apache/spark/commit/99010e0a020c7e72cfc0ac89e50b8a5b819a65ba). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata
dongjoon-hyun commented on issue #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata URL: https://github.com/apache/spark/pull/28102#issuecomment-607954215 Thank you for pinging me, @cloud-fan . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata
MaxGekk commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata URL: https://github.com/apache/spark/pull/28102#discussion_r402442002 ## File path: external/avro/src/main/java/org/apache/spark/sql/avro/SparkAvroKeyOutputFormat.java ## @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.avro; + +import java.io.IOException; +import java.io.OutputStream; +import java.util.Map; + +import org.apache.avro.Schema; +import org.apache.avro.file.CodecFactory; +import org.apache.avro.file.DataFileWriter; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.mapred.AvroKey; +import org.apache.avro.mapreduce.AvroKeyOutputFormat; +import org.apache.avro.mapreduce.Syncable; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.mapreduce.RecordWriter; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +// A variant of `AvroKeyOutputFormat`, which is used to inject the custom `RecordWriterFactory` so +// that we can set avro file metadata. +public class SparkAvroKeyOutputFormat extends AvroKeyOutputFormat { + public SparkAvroKeyOutputFormat(Map metadata) { +super(new SparkRecordWriterFactory(metadata)); + } + + static class SparkRecordWriterFactory extends RecordWriterFactory { +private final Map metadata; +SparkRecordWriterFactory(Map metadata) { + this.metadata = metadata; +} + +protected RecordWriter, NullWritable> create( +Schema writerSchema, +GenericData dataModel, +CodecFactory compressionCodec, +OutputStream outputStream, +int syncInterval) throws IOException { + return new SparkAvroKeyRecordWriter( +writerSchema, dataModel, compressionCodec, outputStream, syncInterval, metadata); +} + } +} + +// This a fork of org.apache.avro.mapreduce.AvroKeyRecordWriter, in order to set file metadata. +class SparkAvroKeyRecordWriter extends RecordWriter, NullWritable> +implements Syncable { + private final DataFileWriter mAvroFileWriter; + public SparkAvroKeyRecordWriter( + Schema writerSchema, + GenericData dataModel, + CodecFactory compressionCodec, + OutputStream outputStream, + int syncInterval, + Map metadata) throws IOException { +this.mAvroFileWriter = new DataFileWriter(dataModel.createDatumWriter(writerSchema)); +for (Map.Entry entry : metadata.entrySet()) { + this.mAvroFileWriter.setMeta(entry.getKey(), entry.getValue()); +} Review comment: this is the diff, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata
MaxGekk commented on a change in pull request #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata URL: https://github.com/apache/spark/pull/28102#discussion_r402443422 ## File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriter.scala ## @@ -45,20 +48,19 @@ private[avro] class AvroOutputWriter( * Overrides the couple of methods responsible for generating the output streams / files so * that the data can be correctly partitioned */ - private val recordWriter: RecordWriter[AvroKey[GenericRecord], NullWritable] = -new AvroKeyOutputFormat[GenericRecord]() { - + private val recordWriter: RecordWriter[AvroKey[GenericRecord], NullWritable] = { +val sparkVersion = Map(SPARK_VERSION_METADATA_KEY -> SPARK_VERSION_SHORT).asJava Review comment: Could you update the comment https://github.com/apache/spark/blob/86cc907448f0102ad0c185e87fcc897d0a32707f/sql/core/src/main/scala/org/apache/spark/sql/package.scala#L50-L51 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf
AmplabJenkins commented on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf URL: https://github.com/apache/spark/pull/28097#issuecomment-607949595 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120722/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf
AmplabJenkins removed a comment on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf URL: https://github.com/apache/spark/pull/28097#issuecomment-607949595 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120722/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf
AmplabJenkins removed a comment on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf URL: https://github.com/apache/spark/pull/28097#issuecomment-607949582 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf
AmplabJenkins commented on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf URL: https://github.com/apache/spark/pull/28097#issuecomment-607949582 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf
SparkQA removed a comment on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf URL: https://github.com/apache/spark/pull/28097#issuecomment-607800148 **[Test build #120722 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120722/testReport)** for PR 28097 at commit [`6ba5d18`](https://github.com/apache/spark/commit/6ba5d187436ea0d307ff8bfd5780539a57084548). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf
SparkQA commented on issue #28097: [SPARK-31325][SQL][Web UI] Control a plan explain mode in the events of SQL listeners via SQLConf URL: https://github.com/apache/spark/pull/28097#issuecomment-607948487 **[Test build #120722 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120722/testReport)** for PR 28097 at commit [`6ba5d18`](https://github.com/apache/spark/commit/6ba5d187436ea0d307ff8bfd5780539a57084548). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #28099: [SPARK-31326][SQL][DOCS] Create Function docs structure for SQL Reference
huaxingao commented on a change in pull request #28099: [SPARK-31326][SQL][DOCS] Create Function docs structure for SQL Reference URL: https://github.com/apache/spark/pull/28099#discussion_r402423891 ## File path: docs/sql-ref-functions-builtin.md ## @@ -1,25 +1,26 @@ --- layout: global -title: Reference -displayTitle: Reference +title: Built-in Functions +displayTitle: Built-in Functions license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at - + http://www.apache.org/licenses/LICENSE-2.0 - + Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --- -Spark SQL is Apache Spark's module for working with structured data. -This guide is a reference for Structured Query Language (SQL) for Apache -Spark. This document describes the SQL constructs supported by Spark in detail -along with usage examples when applicable. +Spark SQL defines built-in functions to use, a complete list of which can be found [here](api/sql/). Among them, Spark SQL has several special categories of built-in functions: [Aggregate Functions](sql-ref-functions-builtin-aggregate.html) to operate on a group of rows, [Array Functions](sql-ref-functions-builtin-array.html) to operate on Array columns, and [Date and Time Functions](sql-ref-functions-builtin-date-time.html) to operate on Date and Time. Review comment: I initially wanted to have scalar built-in functions and aggregate built-in functions to be parallel to the sections in UDFs, but then I think it's not worthwhile to document scalar built-in functions separately. I guess it makes more sense to document some special functions, such as array functions, date and time functions, and maybe window functions and string functions as well. I also don't want to document too many here, because I need to finish quickly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nchammas commented on issue #27928: [SPARK-31167][BUILD] Refactor how we track Python test/build dependencies
nchammas commented on issue #27928: [SPARK-31167][BUILD] Refactor how we track Python test/build dependencies URL: https://github.com/apache/spark/pull/27928#issuecomment-607930617 Following a common pattern in Python projects (e.g. Warehouse, Trio, my own Flintrock), I'm specifying broad development dependencies and then using pip-tools to compile them down to a pinned list of requirements. The broad requirements are for contributors to use locally, and the pinned requirements are for use by CI and our release tooling. I believe this should make everyone happy. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on issue #28026: [SPARK-31257][SQL] Unify create table syntax (WIP)
rdblue commented on issue #28026: [SPARK-31257][SQL] Unify create table syntax (WIP) URL: https://github.com/apache/spark/pull/28026#issuecomment-607927357 > Sorry if this is answered above but why do we handle v2 APIs here together? Create statement plans are already converted to v2 because the DataSource create syntax is supported. If the conversion to v2 plans didn't handle the new statement fields, then the new data would be ignored and that's a correctness error. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607917767 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25428/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607917750 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607917750 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607917767 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25428/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
SparkQA commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607916880 **[Test build #120730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120730/testReport)** for PR 28101 at commit [`1654629`](https://github.com/apache/spark/commit/1654629e6d744fb6207de8d7c91ecd4758b342c5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
MaxGekk commented on a change in pull request #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#discussion_r402401376 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala ## @@ -821,4 +821,20 @@ class DateTimeUtilsSuite extends SparkFunSuite with Matchers with SQLHelper { days += 1 } } + + test("rebasing overlapped timestamps during daylight saving time") { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
MaxGekk commented on a change in pull request #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#discussion_r402401225 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ## @@ -1030,7 +1033,13 @@ object DateTimeUtils { cal.get(Calendar.SECOND), (Math.floorMod(micros, MICROS_PER_SECOND) * NANOS_PER_MICROS).toInt) .plusDays(cal.get(Calendar.DAY_OF_MONTH) - 1) -instantToMicros(localDateTime.atZone(ZoneId.systemDefault).toInstant) +val zonedDateTime = localDateTime.atZone(ZoneId.systemDefault) +val adjustedZdt = if (cal.get(Calendar.DST_OFFSET) == 0) { Review comment: added This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID
gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID URL: https://github.com/apache/spark/pull/27664#discussion_r402369581 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLogSuite.scala ## @@ -240,6 +247,40 @@ class FileStreamSinkLogSuite extends SparkFunSuite with SharedSparkSession { )) } + test("getLatestBatchId") { +withCountOpenLocalFileSystemAsLocalFileSystem { + val scheme = CountOpenLocalFileSystem.scheme + withSQLConf(SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3") { +withTempDir { file => + val sinkLog = new FileStreamSinkLog(FileStreamSinkLog.VERSION, spark, +s"$scheme:///${file.getCanonicalPath}") + for (batchId <- 0 to 2) { +sinkLog.add( + batchId, + Array(newFakeSinkFileStatus("/a/b/" + batchId, FileStreamSinkLog.ADD_ACTION))) + } + + def getCountForOpenOnMetadataFile(batchId: Long): Long = { +val path = sinkLog.batchIdToPath(batchId).toUri.getPath +CountOpenLocalFileSystem.pathToNumOpenCalled + .get(path).map(_.get()).getOrElse(0) + } + + val curCount = getCountForOpenOnMetadataFile(2) + + assert(sinkLog.getLatestBatchId() === Some(2)) Review comment: Nit: s/2/2L/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID
gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID URL: https://github.com/apache/spark/pull/27664#discussion_r402362153 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLogSuite.scala ## @@ -240,6 +247,40 @@ class FileStreamSinkLogSuite extends SparkFunSuite with SharedSparkSession { )) } + test("getLatestBatchId") { +withCountOpenLocalFileSystemAsLocalFileSystem { + val scheme = CountOpenLocalFileSystem.scheme + withSQLConf(SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3") { +withTempDir { file => + val sinkLog = new FileStreamSinkLog(FileStreamSinkLog.VERSION, spark, +s"$scheme:///${file.getCanonicalPath}") + for (batchId <- 0 to 2) { +sinkLog.add( + batchId, + Array(newFakeSinkFileStatus("/a/b/" + batchId, FileStreamSinkLog.ADD_ACTION))) + } + + def getCountForOpenOnMetadataFile(batchId: Long): Long = { +val path = sinkLog.batchIdToPath(batchId).toUri.getPath +CountOpenLocalFileSystem.pathToNumOpenCalled + .get(path).map(_.get()).getOrElse(0) Review comment: Nit: no linebreak needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID
gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID URL: https://github.com/apache/spark/pull/27664#discussion_r402388241 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLogSuite.scala ## @@ -240,6 +247,40 @@ class FileStreamSinkLogSuite extends SparkFunSuite with SharedSparkSession { )) } + test("getLatestBatchId") { +withCountOpenLocalFileSystemAsLocalFileSystem { + val scheme = CountOpenLocalFileSystem.scheme + withSQLConf(SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3") { +withTempDir { file => + val sinkLog = new FileStreamSinkLog(FileStreamSinkLog.VERSION, spark, +s"$scheme:///${file.getCanonicalPath}") + for (batchId <- 0 to 2) { +sinkLog.add( + batchId, + Array(newFakeSinkFileStatus("/a/b/" + batchId, FileStreamSinkLog.ADD_ACTION))) + } + + def getCountForOpenOnMetadataFile(batchId: Long): Long = { +val path = sinkLog.batchIdToPath(batchId).toUri.getPath +CountOpenLocalFileSystem.pathToNumOpenCalled + .get(path).map(_.get()).getOrElse(0) + } + + val curCount = getCountForOpenOnMetadataFile(2) + + assert(sinkLog.getLatestBatchId() === Some(2)) + // getLatestBatchId doesn't open the latest metadata log file + assert(getCountForOpenOnMetadataFile(2L) === curCount) Review comment: Maybe worth to check other batches as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID
gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID URL: https://github.com/apache/spark/pull/27664#discussion_r402386522 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLogSuite.scala ## @@ -267,4 +308,38 @@ class FileStreamSinkLogSuite extends SparkFunSuite with SharedSparkSession { val log = new FileStreamSinkLog(FileStreamSinkLog.VERSION, spark, input.toString) log.allFiles() } + + private def withCountOpenLocalFileSystemAsLocalFileSystem(body: => Unit): Unit = { +val optionKey = s"fs.${CountOpenLocalFileSystem.scheme}.impl" +val originClassForLocalFileSystem = spark.conf.getOption(optionKey) +try { + spark.conf.set(optionKey, classOf[CountOpenLocalFileSystem].getName) + body +} finally { + originClassForLocalFileSystem match { +case Some(fsClazz) => spark.conf.set(optionKey, fsClazz) +case _ => spark.conf.unset(optionKey) + } +} + } +} + +class CountOpenLocalFileSystem extends RawLocalFileSystem { + import CountOpenLocalFileSystem._ + + override def getUri: URI = { +URI.create(s"$scheme:///") + } + + override def open(f: Path, bufferSize: Int): FSDataInputStream = { +val path = f.toUri.getPath +val curVal = pathToNumOpenCalled.getOrElseUpdate(path, new AtomicLong(0)) +curVal.incrementAndGet() +super.open(f, bufferSize) + } +} + +object CountOpenLocalFileSystem { + val scheme = s"FileStreamSinkLogSuite${math.abs(Random.nextInt)}fs" + val pathToNumOpenCalled = new mutable.HashMap[String, AtomicLong] Review comment: Some reset functionality would be good to make it re-usable. This would also make `curCount` disappear. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID
gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID URL: https://github.com/apache/spark/pull/27664#discussion_r402362508 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLogSuite.scala ## @@ -240,6 +247,40 @@ class FileStreamSinkLogSuite extends SparkFunSuite with SharedSparkSession { )) } + test("getLatestBatchId") { +withCountOpenLocalFileSystemAsLocalFileSystem { + val scheme = CountOpenLocalFileSystem.scheme + withSQLConf(SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3") { +withTempDir { file => Review comment: If it's a dir maybe we can call it `dir` or `path`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID
gaborgsomogyi commented on a change in pull request #27664: [SPARK-30915][SS] FileStreamSink: Avoid reading the metadata log file when finding the latest batch ID URL: https://github.com/apache/spark/pull/27664#discussion_r402369944 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLogSuite.scala ## @@ -240,6 +247,40 @@ class FileStreamSinkLogSuite extends SparkFunSuite with SharedSparkSession { )) } + test("getLatestBatchId") { +withCountOpenLocalFileSystemAsLocalFileSystem { + val scheme = CountOpenLocalFileSystem.scheme + withSQLConf(SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3") { +withTempDir { file => + val sinkLog = new FileStreamSinkLog(FileStreamSinkLog.VERSION, spark, +s"$scheme:///${file.getCanonicalPath}") + for (batchId <- 0 to 2) { +sinkLog.add( + batchId, + Array(newFakeSinkFileStatus("/a/b/" + batchId, FileStreamSinkLog.ADD_ACTION))) + } + + def getCountForOpenOnMetadataFile(batchId: Long): Long = { +val path = sinkLog.batchIdToPath(batchId).toUri.getPath +CountOpenLocalFileSystem.pathToNumOpenCalled + .get(path).map(_.get()).getOrElse(0) + } + + val curCount = getCountForOpenOnMetadataFile(2) Review comment: Nit: s/2/2L/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
tgravescs commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607902037 @bmarcott sorry for my delay on this. If you are able to update this, I think its really close and it would be nice to get into 3.0 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#discussion_r402383913 ## File path: core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala ## @@ -196,6 +196,241 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B assert(!failedTaskSet) } + def setupTaskScheduler(clock: ManualClock): TaskSchedulerImpl = { +val conf = new SparkConf() +sc = new SparkContext("local", "TaskSchedulerImplSuite", conf) +val taskScheduler = new TaskSchedulerImpl(sc, + sc.conf.get(config.TASK_MAX_FAILURES), + clock = clock) { + override def createTaskSetManager(taskSet: TaskSet, maxTaskFailures: Int): TaskSetManager = { +new TaskSetManager(this, taskSet, maxTaskFailures, blacklistTrackerOpt, clock) + } + override def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { +// Don't shuffle the offers around for this test. Instead, we'll just pass in all +// the permutations we care about directly. +offers + } +} +// Need to initialize a DAGScheduler for the taskScheduler to use for callbacks. +new DAGScheduler(sc, taskScheduler) { + override def taskStarted(task: Task[_], taskInfo: TaskInfo): Unit = {} + + override def executorAdded(execId: String, host: String): Unit = {} +} +taskScheduler.initialize(new FakeSchedulerBackend) +val taskSet = FakeTask.createTaskSet(8, 1, 1, + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")) +) + +// Offer resources first so that when the taskset is submitted it can initialize +// with proper locality level. Otherwise, ANY would be the only locality level. +// See TaskSetManager.computeValidLocalityLevels() +// This begins the task set as PROCESS_LOCAL locality level +taskScheduler.resourceOffers(IndexedSeq(WorkerOffer("exec1", "host1", 1))) +taskScheduler.submitTasks(taskSet) +taskScheduler + } + + test("SPARK-18886 - partial offers (isAllFreeResources = false) reset timer before " + +"any resources have been rejected") { +val clock = new ManualClock() +// All tasks created here are local to exec1, host1. +// Locality level starts at PROCESS_LOCAL. +val taskScheduler = setupTaskScheduler(clock) +// Locality levels increase at 3000 ms. +val advanceAmount = 3000 + +// Advancing clock increases locality level to NODE_LOCAL. +clock.advance(advanceAmount) + +// If there hasn't yet been any full resource offers, +// partial resource (isAllFreeResources = false) offers reset delay scheduling +// if this and previous offers were accepted. +// This line resets the timer and locality level is reset to PROCESS_LOCAL. +assert(taskScheduler + .resourceOffers( +IndexedSeq(WorkerOffer("exec1", "host1", 1)), +isAllFreeResources = false) + .flatten.length === 1) + +// This NODE_LOCAL task should not be accepted. +assert(taskScheduler + .resourceOffers( +IndexedSeq(WorkerOffer("exec2", "host1", 1)), +isAllFreeResources = false) + .flatten.isEmpty) + } + + test("SPARK-18886 - delay scheduling timer is reset when it accepts all resources offered when" + +"isAllFreeResources = true") { +val clock = new ManualClock() +// All tasks created here are local to exec1, host1. +// Locality level starts at PROCESS_LOCAL. +val taskScheduler = setupTaskScheduler(clock) +// Locality levels increase at 3000 ms. +val advanceAmount = 3000 + +// Advancing clock increases locality level to NODE_LOCAL. +clock.advance(advanceAmount) + +// If there are no rejects on an all resource offer, delay scheduling is reset. +// This line resets the timer and locality level is reset to PROCESS_LOCAL. +assert(taskScheduler + .resourceOffers( +IndexedSeq(WorkerOffer("exec1", "host1", 1)), +isAllFreeResources = true) + .flatten.length === 1) + +// This NODE_LOCAL task should not be accepted. +assert(taskScheduler + .resourceOffers( +IndexedSeq(WorkerOffer("exec2", "host1", 1)), +isAllFreeResources = false) + .flatten.isEmpty) + } + + test("SPARK-18886 - partial resource offers (isAllFreeResources = false) reset " + +"time if last full resource offer (isAllResources = true) was accepted as well as any " + +"following partial resource offers") { +val clock = new ManualClock() +// All tasks created
[GitHub] [spark] tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#discussion_r402380407 ## File path: core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala ## @@ -196,6 +196,241 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B assert(!failedTaskSet) } + def setupTaskScheduler(clock: ManualClock): TaskSchedulerImpl = { +val conf = new SparkConf() +sc = new SparkContext("local", "TaskSchedulerImplSuite", conf) +val taskScheduler = new TaskSchedulerImpl(sc, + sc.conf.get(config.TASK_MAX_FAILURES), + clock = clock) { + override def createTaskSetManager(taskSet: TaskSet, maxTaskFailures: Int): TaskSetManager = { +new TaskSetManager(this, taskSet, maxTaskFailures, blacklistTrackerOpt, clock) + } + override def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { +// Don't shuffle the offers around for this test. Instead, we'll just pass in all +// the permutations we care about directly. +offers + } +} +// Need to initialize a DAGScheduler for the taskScheduler to use for callbacks. +new DAGScheduler(sc, taskScheduler) { + override def taskStarted(task: Task[_], taskInfo: TaskInfo): Unit = {} + + override def executorAdded(execId: String, host: String): Unit = {} +} +taskScheduler.initialize(new FakeSchedulerBackend) +val taskSet = FakeTask.createTaskSet(8, 1, 1, + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")), + Seq(TaskLocation("host1", "exec1")) +) + +// Offer resources first so that when the taskset is submitted it can initialize +// with proper locality level. Otherwise, ANY would be the only locality level. +// See TaskSetManager.computeValidLocalityLevels() +// This begins the task set as PROCESS_LOCAL locality level +taskScheduler.resourceOffers(IndexedSeq(WorkerOffer("exec1", "host1", 1))) +taskScheduler.submitTasks(taskSet) +taskScheduler + } + + test("SPARK-18886 - partial offers (isAllFreeResources = false) reset timer before " + +"any resources have been rejected") { +val clock = new ManualClock() +// All tasks created here are local to exec1, host1. +// Locality level starts at PROCESS_LOCAL. +val taskScheduler = setupTaskScheduler(clock) +// Locality levels increase at 3000 ms. +val advanceAmount = 3000 + +// Advancing clock increases locality level to NODE_LOCAL. +clock.advance(advanceAmount) + +// If there hasn't yet been any full resource offers, +// partial resource (isAllFreeResources = false) offers reset delay scheduling +// if this and previous offers were accepted. +// This line resets the timer and locality level is reset to PROCESS_LOCAL. +assert(taskScheduler + .resourceOffers( +IndexedSeq(WorkerOffer("exec1", "host1", 1)), +isAllFreeResources = false) + .flatten.length === 1) + +// This NODE_LOCAL task should not be accepted. +assert(taskScheduler + .resourceOffers( +IndexedSeq(WorkerOffer("exec2", "host1", 1)), +isAllFreeResources = false) + .flatten.isEmpty) + } + + test("SPARK-18886 - delay scheduling timer is reset when it accepts all resources offered when" + +"isAllFreeResources = true") { Review comment: nit need space before isAllFreeResources This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#discussion_r402373537 ## File path: core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala ## @@ -196,6 +196,241 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B assert(!failedTaskSet) } + def setupTaskScheduler(clock: ManualClock): TaskSchedulerImpl = { Review comment: there is already a setupScheduler function, perhaps make this name more specific for setupTaskSchedulerForLocalityTests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27897: [SPARK-31113][SQL] Add SHOW VIEWS command
AmplabJenkins removed a comment on issue #27897: [SPARK-31113][SQL] Add SHOW VIEWS command URL: https://github.com/apache/spark/pull/27897#issuecomment-607890263 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25427/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27897: [SPARK-31113][SQL] Add SHOW VIEWS command
AmplabJenkins removed a comment on issue #27897: [SPARK-31113][SQL] Add SHOW VIEWS command URL: https://github.com/apache/spark/pull/27897#issuecomment-607890248 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27897: [SPARK-31113][SQL] Add SHOW VIEWS command
AmplabJenkins commented on issue #27897: [SPARK-31113][SQL] Add SHOW VIEWS command URL: https://github.com/apache/spark/pull/27897#issuecomment-607890248 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27897: [SPARK-31113][SQL] Add SHOW VIEWS command
AmplabJenkins commented on issue #27897: [SPARK-31113][SQL] Add SHOW VIEWS command URL: https://github.com/apache/spark/pull/27897#issuecomment-607890263 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25427/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Eric5553 commented on a change in pull request #27897: [SPARK-31113][SQL] Add SHOW VIEWS command
Eric5553 commented on a change in pull request #27897: [SPARK-31113][SQL] Add SHOW VIEWS command URL: https://github.com/apache/spark/pull/27897#discussion_r402369650 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala ## @@ -295,6 +295,41 @@ case class AlterViewAsCommand( } } +/** + * A command for users to get views in the given database. + * If a databaseName is not given, the current database will be used. + * The syntax of using this command in SQL is: + * {{{ + * SHOW VIEWS [(IN|FROM) database_name] [[LIKE] 'identifier_with_wildcards']; + * }}} + */ +case class ShowViewsCommand( +databaseName: Option[String], +tableIdentifierPattern: Option[String]) extends RunnableCommand { + + // The result of SHOW VIEWS has three basic columns: namespace, viewName and isTemporary. + override val output: Seq[Attribute] = Seq( +AttributeReference("namespace", StringType, nullable = false)(), +AttributeReference("viewName", StringType, nullable = false)(), +AttributeReference("isTemporary", BooleanType, nullable = false)()) + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +val db = databaseName.getOrElse(catalog.getCurrentDatabase) + +// Show the information of views. +val views = tableIdentifierPattern.map(catalog.listViews(db, _)) + .getOrElse(catalog.listViews(db, "*")) +views.map { tableIdent => + val namespace = tableIdent.database.getOrElse("") Review comment: Yea, I see. updated in 6a4348ed9e289c6ca3cc557f9a575e1829f53fc0. Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Eric5553 commented on a change in pull request #27897: [SPARK-31113][SQL] Add SHOW VIEWS command
Eric5553 commented on a change in pull request #27897: [SPARK-31113][SQL] Add SHOW VIEWS command URL: https://github.com/apache/spark/pull/27897#discussion_r402369650 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala ## @@ -295,6 +295,41 @@ case class AlterViewAsCommand( } } +/** + * A command for users to get views in the given database. + * If a databaseName is not given, the current database will be used. + * The syntax of using this command in SQL is: + * {{{ + * SHOW VIEWS [(IN|FROM) database_name] [[LIKE] 'identifier_with_wildcards']; + * }}} + */ +case class ShowViewsCommand( +databaseName: Option[String], +tableIdentifierPattern: Option[String]) extends RunnableCommand { + + // The result of SHOW VIEWS has three basic columns: namespace, viewName and isTemporary. + override val output: Seq[Attribute] = Seq( +AttributeReference("namespace", StringType, nullable = false)(), +AttributeReference("viewName", StringType, nullable = false)(), +AttributeReference("isTemporary", BooleanType, nullable = false)()) + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +val db = databaseName.getOrElse(catalog.getCurrentDatabase) + +// Show the information of views. +val views = tableIdentifierPattern.map(catalog.listViews(db, _)) + .getOrElse(catalog.listViews(db, "*")) +views.map { tableIdent => + val namespace = tableIdent.database.getOrElse("") Review comment: Yea, I see. updated in 6a4348ed9e289c6ca3cc557f9a575e1829f53fc0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Eric5553 commented on a change in pull request #27897: [SPARK-31113][SQL] Add SHOW VIEWS command
Eric5553 commented on a change in pull request #27897: [SPARK-31113][SQL] Add SHOW VIEWS command URL: https://github.com/apache/spark/pull/27897#discussion_r402369650 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala ## @@ -295,6 +295,41 @@ case class AlterViewAsCommand( } } +/** + * A command for users to get views in the given database. + * If a databaseName is not given, the current database will be used. + * The syntax of using this command in SQL is: + * {{{ + * SHOW VIEWS [(IN|FROM) database_name] [[LIKE] 'identifier_with_wildcards']; + * }}} + */ +case class ShowViewsCommand( +databaseName: Option[String], +tableIdentifierPattern: Option[String]) extends RunnableCommand { + + // The result of SHOW VIEWS has three basic columns: namespace, viewName and isTemporary. + override val output: Seq[Attribute] = Seq( +AttributeReference("namespace", StringType, nullable = false)(), +AttributeReference("viewName", StringType, nullable = false)(), +AttributeReference("isTemporary", BooleanType, nullable = false)()) + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +val db = databaseName.getOrElse(catalog.getCurrentDatabase) + +// Show the information of views. +val views = tableIdentifierPattern.map(catalog.listViews(db, _)) + .getOrElse(catalog.listViews(db, "*")) +views.map { tableIdent => + val namespace = tableIdent.database.getOrElse("") Review comment: I see, updated in 6a4348ed9e289c6ca3cc557f9a575e1829f53fc0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27897: [SPARK-31113][SQL] Add SHOW VIEWS command
SparkQA commented on issue #27897: [SPARK-31113][SQL] Add SHOW VIEWS command URL: https://github.com/apache/spark/pull/27897#issuecomment-607889396 **[Test build #120729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120729/testReport)** for PR 27897 at commit [`6a4348e`](https://github.com/apache/spark/commit/6a4348ed9e289c6ca3cc557f9a575e1829f53fc0). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
AmplabJenkins removed a comment on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607885531 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25426/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
AmplabJenkins removed a comment on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607885520 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
AmplabJenkins commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607885531 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25426/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
AmplabJenkins commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607885520 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
SparkQA commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607884832 **[Test build #120728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120728/testReport)** for PR 27207 at commit [`8d35ea1`](https://github.com/apache/spark/commit/8d35ea14edfb8bd7cc276020784c13caf030532a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#discussion_r402320307 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ## @@ -319,20 +336,38 @@ private[spark] class TaskSchedulerImpl( taskSetsByStageIdAndAttempt -= manager.taskSet.stageId } } +resetOnPreviousOffer -= manager.taskSet manager.parent.removeSchedulable(manager) logInfo(s"Removed TaskSet ${manager.taskSet.id}, whose tasks have all completed, from pool" + s" ${manager.parent.name}") } + /** + * Offers resources to a single [[TaskSetManager]] at a given max allowed [[TaskLocality]]. + * + * @param taskSet task set to offer resources to Review comment: this is actually task set manager This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
tgravescs commented on issue #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#issuecomment-607884003 test this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#discussion_r402322826 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ## @@ -466,12 +503,28 @@ private[spark] class TaskSchedulerImpl( }.sum } + def minTaskLocality( Review comment: make private This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling.
tgravescs commented on a change in pull request #27207: [SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling. URL: https://github.com/apache/spark/pull/27207#discussion_r402360614 ## File path: core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala ## @@ -898,18 +1083,17 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B } // Here is the main check of this test -- we have the same offers again, and we schedule it -// successfully. Because the scheduler first tries to schedule with locality in mind, at first -// it won't schedule anything on executor1. But despite that, we don't abort the job. Then the -// scheduler tries for ANY locality, and successfully schedules tasks on executor1. +// successfully. Because the scheduler tries to schedule with locality in mind, at first +// it won't schedule anything on executor1. But despite that, we don't abort the job. val secondTaskAttempts = taskScheduler.resourceOffers(offers).flatten -assert(secondTaskAttempts.size == 2) -secondTaskAttempts.foreach { taskAttempt => assert("executor1" === taskAttempt.executorId) } Review comment: I see we are still on node local locality since we didn't reject any so we have to wait the timeout here so these are rejected at this point. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607878852 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120720/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607878845 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607878845 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
AmplabJenkins removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607878852 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120720/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
SparkQA removed a comment on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607746808 **[Test build #120720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120720/testReport)** for PR 28101 at commit [`b761ac1`](https://github.com/apache/spark/commit/b761ac141b3d92f92785a2b7601760a9cec59ed7). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
SparkQA commented on issue #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#issuecomment-607877641 **[Test build #120720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120720/testReport)** for PR 28101 at commit [`b761ac1`](https://github.com/apache/spark/commit/b761ac141b3d92f92785a2b7601760a9cec59ed7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
AmplabJenkins removed a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-607872436 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
AmplabJenkins removed a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-607872448 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25425/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
AmplabJenkins commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-607872448 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25425/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
AmplabJenkins commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-607872436 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
SparkQA commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-607871644 **[Test build #120727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120727/testReport)** for PR 27864 at commit [`076dd67`](https://github.com/apache/spark/commit/076dd671d9a9ac3c8d5926279f54517b8b39426a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] prakharjain09 commented on a change in pull request #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
prakharjain09 commented on a change in pull request #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#discussion_r402343206 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala ## @@ -1761,6 +1776,57 @@ private[spark] class BlockManager( blocksToRemove.size } + def decommissionBlockManager(): Unit = { +if (!blockManagerDecommissioning) { + logInfo("Starting block manager decommissioning process") + blockManagerDecommissioning = true + decommissionManager = Some(new BlockManagerDecommissionManager) + decommissionManager.foreach(_.start()) +} + } + + /** + * Tries to offload all cached RDD blocks from this BlockManager to peer BlockManagers + * Visible for testing + */ + def offloadRddCacheBlocks(): Unit = { +val replicateBlocksInfo = master.getReplicateInfoForRDDBlocks(blockManagerId) + +if (replicateBlocksInfo.nonEmpty) { + logInfo(s"Need to replicate ${replicateBlocksInfo.size} blocks " + +s"for block manager decommissioning") + // Refresh peer list once before starting replication + getPeers(true) +} + +// Maximum number of storage replication failure which replicateBlock can handle +// before giving up for one block Review comment: @dongjoon-hyun Renamed to "spark.storage.decommission.maxReplicationFailuresPerBlock". Any suggestions on 2) ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination …
SparkQA commented on issue #27724: [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination … URL: https://github.com/apache/spark/pull/27724#issuecomment-607867530 **[Test build #120726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120726/testReport)** for PR 27724 at commit [`99010e0`](https://github.com/apache/spark/commit/99010e0a020c7e72cfc0ac89e50b8a5b819a65ba). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #28089: [SPARK-30921][PySpark] Predicates on python udf should not be pushdown through Aggregate
HyukjinKwon commented on issue #28089: [SPARK-30921][PySpark] Predicates on python udf should not be pushdown through Aggregate URL: https://github.com/apache/spark/pull/28089#issuecomment-607866353 Will take a look tomorrow This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata
AmplabJenkins removed a comment on issue #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata URL: https://github.com/apache/spark/pull/28102#issuecomment-607864523 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120725/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
cloud-fan commented on a change in pull request #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#discussion_r402337523 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala ## @@ -821,4 +821,20 @@ class DateTimeUtilsSuite extends SparkFunSuite with Matchers with SQLHelper { days += 1 } } + + test("rebasing overlapped timestamps during daylight saving time") { Review comment: can we add JIRA ID? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
cloud-fan commented on a change in pull request #28101: [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time URL: https://github.com/apache/spark/pull/28101#discussion_r402336635 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ## @@ -1030,7 +1033,13 @@ object DateTimeUtils { cal.get(Calendar.SECOND), (Math.floorMod(micros, MICROS_PER_SECOND) * NANOS_PER_MICROS).toInt) .plusDays(cal.get(Calendar.DAY_OF_MONTH) - 1) -instantToMicros(localDateTime.atZone(ZoneId.systemDefault).toInstant) +val zonedDateTime = localDateTime.atZone(ZoneId.systemDefault) +val adjustedZdt = if (cal.get(Calendar.DST_OFFSET) == 0) { Review comment: can we add some comment here? people may not understand what `cal.get(Calendar.DST_OFFSET) == 0` means This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata
AmplabJenkins removed a comment on issue #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata URL: https://github.com/apache/spark/pull/28102#issuecomment-607864513 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata
SparkQA removed a comment on issue #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata URL: https://github.com/apache/spark/pull/28102#issuecomment-607858705 **[Test build #120725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120725/testReport)** for PR 28102 at commit [`5aaf2db`](https://github.com/apache/spark/commit/5aaf2db549e3b6d441ea01fed6bb0440f5eff02d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata
AmplabJenkins commented on issue #28102: [SPARK-31327][SQL] Write Spark version into Avro file metadata URL: https://github.com/apache/spark/pull/28102#issuecomment-607864513 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org