[GitHub] spark issue #19465: [SPARK-21988][SS]Implement StreamingRelation.computeStat...
Github user joseph-torres commented on the issue: https://github.com/apache/spark/pull/19465 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19269: [SPARK-22026][SQL][WIP] data source v2 write path
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/19269#discussion_r144097061 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.writer; + +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.apache.spark.sql.sources.v2.DataSourceV2Options; +import org.apache.spark.sql.sources.v2.WriteSupport; +import org.apache.spark.sql.types.StructType; + +/** + * A data source writer that is returned by + * {@link WriteSupport#createWriter(StructType, SaveMode, DataSourceV2Options)}. + * It can mix in various writing optimization interfaces to speed up the data saving. The actual + * writing logic is delegated to {@link DataWriter}. + * + * The writing procedure is: + * 1. Create a writer factory by {@link #createWriterFactory()}, serialize and send it to all the + * partitions of the input data(RDD). + * 2. For each partition, create the data writer, and write the data of the partition with this + * writer. If all the data are written successfully, call {@link DataWriter#commit()}. If + * exception happens during the writing, call {@link DataWriter#abort()}. + * 3. If all writers are successfully committed, call {@link #commit(WriterCommitMessage[])}. If + * some writers are aborted, or the job failed with an unknown reason, call + * {@link #abort(WriterCommitMessage[])}. + * + * Spark won't retry failed writing jobs, users should do it manually in their Spark applications if + * they want to retry. + * + * Please refer to the document of commit/abort methods for detailed specifications. + * + * Note that, this interface provides a protocol between Spark and data sources for transactional + * data writing, but the transaction here is Spark-level transaction, which may not be the + * underlying storage transaction. For example, Spark successfully writes data to a Cassandra data + * source, but Cassandra may need some more time to reach consistency at storage level. + */ +@InterfaceStability.Evolving +public interface DataSourceV2Writer { + + /** + * Creates a writer factory which will be serialized and sent to executors. + */ + DataWriterFactory createWriterFactory(); + + /** + * Commits this writing job with a list of commit messages. The commit messages are collected from + * successful data writers and are produced by {@link DataWriter#commit()}. If this method + * fails(throw exception), this writing job is considered to be failed, and + * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible + * to data source readers if this method successes. + * + * Note that, one partition may have multiple committed data writers because of speculative tasks. + * Spark will pick the first successful one and get its commit message. Implementations should be + * aware of this and handle it correctly, e.g., have a mechanism to make sure only one data writer + * can commit successfully, or have a way to clean up the data of already-committed writers. + */ + void commit(WriterCommitMessage[] messages); + + /** + * Aborts this writing job because some data writers are failed to write the records and aborted, + * or the Spark job fails with some unknown reasons, or {@link #commit(WriterCommitMessage[])} + * fails. If this method fails(throw exception), the underlying data source may have garbage that + * need to be cleaned manually, but these garbage should not be visible to data source readers. + * + * Unless the abortion is triggered by the failure of commit, the given messages should have some --- End diff --
[GitHub] spark pull request #19466: [SPARK-22237] [CORE] Fix spark submit file downlo...
Github user loneknightpy commented on a diff in the pull request: https://github.com/apache/spark/pull/19466#discussion_r144100747 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -366,6 +366,16 @@ object SparkSubmit extends CommandLineUtils with Logging { localPyFiles = Option(args.pyFiles).map { downloadFileList(_, targetDir, sparkConf, hadoopConf, secMgr) }.orNull + + if (clusterManager == STANDALONE || clusterManager == LOCAL) { +// Use local files for standalone client mode. +args.primaryResource = localPrimaryResource --- End diff -- Thanks for your input. It seems the issue is caused by a conflict with our internal patch. I will close this PR for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19466: [SPARK-22237] [CORE] Fix spark submit file downlo...
Github user loneknightpy closed the pull request at: https://github.com/apache/spark/pull/19466 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19474 **[Test build #82639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82639/testReport)** for PR 19474 at commit [`0667ac8`](https://github.com/apache/spark/commit/0667ac8bc893c50a37b607dc4713c24db12300e8). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait Command extends LeafNode ` * `trait RunnableCommand extends Command ` * `case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user sathiyapk commented on the issue: https://github.com/apache/spark/pull/19451 @gengliangwang Ready for a next review :) > put case ... in a new line Are your sure? I thought according to the coding style, while calling on a partial function if there is only one case, we put the case on the same line as the function invocation. isn't it? PS: `nonFilterChild` is really a perfect name, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18386: [SPARK-21165] [SQL] [2.2] Use executedPlan instea...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18386#discussion_r144097084 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala --- @@ -111,9 +111,18 @@ object FileFormatWriter extends Logging { job.setOutputValueClass(classOf[InternalRow]) FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath)) -val allColumns = queryExecution.logical.output +val allColumns = queryExecution.executedPlan.output --- End diff -- Yes. We should always use `analyzed.output` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19474 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19474 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82639/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18979#discussion_r144094437 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala --- @@ -57,7 +60,14 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration) private def getFileSize(filePath: String): Long = { val path = new Path(filePath) val fs = path.getFileSystem(hadoopConf) -fs.getFileStatus(path).getLen() +try { + fs.getFileStatus(path).getLen() +} catch { + case e: FileNotFoundException => +// may arise against eventually consistent object stores +logInfo(s"File $path is not yet visible", e) --- End diff -- Could you update the log message and indicate the size zero might be wrong? For example negative caching in S3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @WeichenXu123, yes sure. But can this wait until this PR is merged? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/19448#discussion_r144088592 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -138,6 +138,10 @@ class ParquetFileFormat conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) } +require(!conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) --- End diff -- `AnalysisException`? Shouldn't this be `SparkException`? By the time this runs, Spark has already analyzed, optimized, and planned the job. Doesn't seem like failing analysis is appropriate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #30: SPARK-1004. PySpark on YARN
Github user swaapnika-guntaka commented on the issue: https://github.com/apache/spark/pull/30 I see the Java EOF Exception when I run python packaged jar(using JDK 8) using Spark-2.2 I'm trying to run this using the below command. `time bash -x $SPARK_HOME/bin/spark-submit --driver-class-path .:: -v $PYTHONPATH/ >& run.log` ``` Recent failure: Lost task 3.3 in stage 0.0 (TID 36, 10.15.163.25, executor 0): java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:166) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:395) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19474 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82638/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19474 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19448#discussion_r144086925 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala --- @@ -0,0 +1,152 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.FileNotFoundException + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.mapreduce.{JobContext, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter +import org.apache.parquet.hadoop.{ParquetOutputCommitter, ParquetOutputFormat} + +import org.apache.spark.{LocalSparkContext, SparkFunSuite} +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SQLTestUtils + +/** + * Test logic related to choice of output commtters + */ +class ParquetCommitterSuite extends SparkFunSuite with SQLTestUtils + with LocalSparkContext { + + private val PARQUET_COMMITTER = classOf[ParquetOutputCommitter].getCanonicalName + + protected var spark: SparkSession = _ + + /** + * Create a new [[SparkSession]] running in local-cluster mode with unsafe and codegen enabled. + */ + override def beforeAll(): Unit = { +super.beforeAll() +spark = SparkSession.builder() + .master("local-cluster[2,1,1024]") + .appName("testing") + .getOrCreate() + } + + override def afterAll(): Unit = { +if (spark != null) { + spark.stop() + spark = null +} +super.afterAll() --- End diff -- ```Scala try { ... } finally { super.afterAll() } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19474 **[Test build #82638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82638/testReport)** for PR 19474 at commit [`3b1174f`](https://github.com/apache/spark/commit/3b1174f7e1ed9caae890936ceeb4fb54e58eadcc). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait Command extends LeafNode ` * `trait RunnableCommand extends Command ` * `case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/19468 > iron out the kinks A large chunk of the difficulty in identifying and ironing out kinks in such a project is the difficulty of writing adequate tests of the scheduler code. I'd expect test coverage to take roughly the same amount of effort as all of the rest of the scheduler plug-in effort. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19467: [SPARK-22238] Fix plan resolution bug caused by EnsureSt...
Github user brkyvz commented on the issue: https://github.com/apache/spark/pull/19467 cc @tdas --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19448#discussion_r144082508 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -138,6 +138,10 @@ class ParquetFileFormat conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) } +require(!conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) --- End diff -- We need to issue an `AnalysisException` here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/18460#discussion_r144082145 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -100,6 +101,17 @@ object TypeCoercion { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => Some(TimestampType) +case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if t1.sameType(t2) => + Some(StructType(fields1.zip(fields2).map { case (f1, f2) => +// Since `t1.sameType(t2)` is true, two StructTypes have the same DataType +// except `name` (in case of `spark.sql.caseSensitive=false`) and `nullable`. +// - Different names: use a lower case name because findTightestCommonType is commutative. +// - Different nullabilities: `nullable` is true iff one of them is nullable. +val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT) +val dataType = findTightestCommonType(f1.dataType, f2.dataType).get +StructField(name, dataType, nullable = f1.nullable || f2.nullable) --- End diff -- Please see [TypeCoercionSuite.checkWidenType](https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala#L130-L142). In order to use the first type name, we need to loosen this test helper function and to break the existing commutative assumption. I'm ok for that if you want. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19429: [SPARK-20055] [Docs] Added documentation for loading csv...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19429 +1 for more detailed documentation (we should steer away from `inferSchema`) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18805 **[Test build #82644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82644/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/18805 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/18805 The [code](https://github.com/luben/zstd-jni/blob/master/src/main/java/com/github/luben/zstd/util/Native.java) overwrites the original exception message that might shed some light on what's going on... and also ignores some exceptions it shouldn't be ignoring (like errors on close, which may indicate low disk space). Anyway, let's try again to see if it's at least consistent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/18460#discussion_r144078834 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -100,6 +101,17 @@ object TypeCoercion { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => Some(TimestampType) +case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if t1.sameType(t2) => + Some(StructType(fields1.zip(fields2).map { case (f1, f2) => +// Since `t1.sameType(t2)` is true, two StructTypes have the same DataType +// except `name` (in case of `spark.sql.caseSensitive=false`) and `nullable`. +// - Different names: use a lower case name because findTightestCommonType is commutative. +// - Different nullabilities: `nullable` is true iff one of them is nullable. +val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT) +val dataType = findTightestCommonType(f1.dataType, f2.dataType).get +StructField(name, dataType, nullable = f1.nullable || f2.nullable) --- End diff -- Sure, right. It's for commutativity. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18460#discussion_r144075681 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -100,6 +101,17 @@ object TypeCoercion { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => Some(TimestampType) +case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if t1.sameType(t2) => + Some(StructType(fields1.zip(fields2).map { case (f1, f2) => +// Since `t1.sameType(t2)` is true, two StructTypes have the same DataType +// except `name` (in case of `spark.sql.caseSensitive=false`) and `nullable`. +// - Different names: use a lower case name because findTightestCommonType is commutative. +// - Different nullabilities: `nullable` is true iff one of them is nullable. +val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT) +val dataType = findTightestCommonType(f1.dataType, f2.dataType).get +StructField(name, dataType, nullable = f1.nullable || f2.nullable) --- End diff -- ``` val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT) ``` The above code changes the case, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19473: [SPARK-22251][SQL] Metric 'aggregate time' is incorrect ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19473 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19473: [SPARK-22251][SQL] Metric 'aggregate time' is incorrect ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19473 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82637/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19473: [SPARK-22251][SQL] Metric 'aggregate time' is incorrect ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19473 **[Test build #82637 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82637/testReport)** for PR 19473 at commit [`470b54f`](https://github.com/apache/spark/commit/470b54f41ca7da5331d8ace22911a1a722c9c51b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18460 Please let me know if there is something to do more~ Thank you always, @gatorsmile . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19471 +1 for this change. BTW, wow, there are lots of test case failures: 81 failures. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19448 **[Test build #82643 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82643/testReport)** for PR 19448 at commit [`d634f9e`](https://github.com/apache/spark/commit/d634f9e3467885759fa02657466a1cdd859ab704). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19466: [SPARK-22237] [CORE] Fix spark submit file download for ...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19466 > here with this change all the resources should be fetched from local driver That's a good point. You should download resources just to add them to the driver's classpath, but executors can download them directly from the source. As I requested, a test case would really help in understanding what is broken here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19472 **[Test build #82642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82642/testReport)** for PR 19472 at commit [`a814eb3`](https://github.com/apache/spark/commit/a814eb3f08085b09a16f336b36fba8da24e4f34a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19456: [SPARK] [Scheduler] Configurable default scheduling mode
Github user blyncsy-david-lewis commented on the issue: https://github.com/apache/spark/pull/19456 I have a multiuser application where I use the userId as the name of the scheduling pool so that users are balanced equally by spark and within a user's workload I can set the scheduling mode to FAIR (or whatever I want). It is unreasonable to specify xml for each user and restart spark, so this patch allows me to specify the default configuration used by undefined pools. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18664 Thanks @ueshin , I agree it is better to convert the timezone to Python system local first and then localize to make tz-naive in case the Python system local tz is different that `DateTimeUtils.defaultTimeZone()`. I'll apply your patch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18664 **[Test build #82641 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82641/testReport)** for PR 18664 at commit [`d1617fd`](https://github.com/apache/spark/commit/d1617fde80697bc3423d661c23997300cbb896ce). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/19448#discussion_r144065810 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -138,6 +138,13 @@ class ParquetFileFormat conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) } +if (conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) + && !classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) { + // output summary is requested, but the class is not a Parquet Committer + throw new RuntimeException(s"Committer $committerClass is not a ParquetOutputCommitter" + +s" and cannot create job summaries.") --- End diff -- aah. in the move to require() everything is going back onto a single line. so now moot --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19472 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/19448#discussion_r144065074 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -138,6 +138,13 @@ class ParquetFileFormat conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) } +if (conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) + && !classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) { + // output summary is requested, but the class is not a Parquet Committer + throw new RuntimeException(s"Committer $committerClass is not a ParquetOutputCommitter" + +s" and cannot create job summaries.") --- End diff -- aah --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19472 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82636/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19472 **[Test build #82636 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82636/testReport)** for PR 19472 at commit [`0738359`](https://github.com/apache/spark/commit/0738359193f4c3a7e7cbeb85d1fc10967f06ff4f). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/19448#discussion_r144065041 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.FileNotFoundException + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.mapreduce.{JobContext, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter +import org.apache.parquet.hadoop.{ParquetOutputCommitter, ParquetOutputFormat} + +import org.apache.spark.{LocalSparkContext, SparkFunSuite} +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SQLTestUtils + +/** + * Test logic related to choice of output commtters + */ +class ParquetCommitterSuite extends SparkFunSuite with SQLTestUtils + with LocalSparkContext { + + private val PARQUET_COMMITTER = classOf[ParquetOutputCommitter].getCanonicalName + + protected var spark: SparkSession = _ + + /** + * Create a new [[SparkSession]] running in local-cluster mode with unsafe and codegen enabled. + */ + override def beforeAll(): Unit = { +super.beforeAll() +spark = SparkSession.builder() + .master("local-cluster[2,1,1024]") + .appName("testing") + .getOrCreate() + } + + override def afterAll(): Unit = { +spark.stop() +spark = null --- End diff -- done, + will add a check for spark==null so if a failure happens during setup, the exception doesn't get lost in teardown --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18098: [SPARK-16944][Mesos] Improve data locality when launchin...
Github user PerilousApricot commented on the issue: https://github.com/apache/spark/pull/18098 Is there any documentation for this feature? How would I expose my topology to mesos/spark? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19474 **[Test build #82640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82640/testReport)** for PR 19474 at commit [`9d4c7a2`](https://github.com/apache/spark/commit/9d4c7a236d9d6e95f1ae355ed8cb07154df5f04e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r144060858 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -224,6 +224,24 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM /** * For Online optimizer only: [[optimizer]] = "online". * + * A (positive) learning parameter that controls the convergence of variational inference. + * Smaller value will lead to more accuracy model and longer training time. --- End diff -- Thanks, I will update the doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19474 For a simple command `Seq(1 -> "a").toDF("i", "j").write.parquet("/tmp/qwe")`, the UI before this PR: https://user-images.githubusercontent.com/3182036/31452520-bc74bb44-aee1-11e7-8721-234925856411.png;> The UI after this PR: https://user-images.githubusercontent.com/3182036/31452534-c6ba5622-aee1-11e7-865d-b2af359d529d.png;> The scan node is no longer visible above the insert node, I'll fix this later. The writer bug is more important and we should fix it ASAP. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19474 cc @gatorsmile @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19474 **[Test build #82639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82639/testReport)** for PR 19474 at commit [`0667ac8`](https://github.com/apache/spark/commit/0667ac8bc893c50a37b607dc4713c24db12300e8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...
Github user foxish commented on the issue: https://github.com/apache/spark/pull/19468 @skonto, there was some discussion about this on the SPIP. We see them as separate and independent issues with the pluggable API being a long term goal. It would involve a working group of people drawn from all cluster manager integration maintainers - because the changes are extensive and require scalability/performance testing in a variety of environments - probably over 2-3 releases before we can iron out the kinks. In the short term however, adding the package here enables supporting the growing number of K8s users of Spark that currently rely on our fork, and the integration testing makes us confident about not impacting the release process or adding complexity to the maintainers workflow. The K8s community will set up all the requisite testing, and ensure health of this package. The integration testing would also carry forward to when we have pluggable APIs in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/18979 @viirya : the new data writer API will allow for a broader set of stats to be propagated back from workers. When you are working with the object stores, an useful stat to get back is throttle count & retry count as they can be the cause of why things are slow ... and if it is due to throttling, throwing more workers at the job will actually slow things down. They'd be the ones to look at first --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19041: [SPARK-21097][CORE] Add option to recover cached data
Github user brad-kaiser commented on the issue: https://github.com/apache/spark/pull/19041 Thanks @vanzin I will work on these comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19473: [SPARK-22251] Metric 'aggregate time' is incorrect when ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/19473 Could you add [SQL] to the title? That makes it easier for others to scan PRs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19473: [SPARK-22251] Metric 'aggregate time' is incorrect when ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/19473 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19474 **[Test build #82638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82638/testReport)** for PR 19474 at commit [`3b1174f`](https://github.com/apache/spark/commit/3b1174f7e1ed9caae890936ceeb4fb54e58eadcc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/19474 [SPARK-22252][SQL] FileFormatWriter should respect the input query schema ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/18064, we allowed `RunnableCommand` to have children in order to fix some UI issues. Then we made `InsertIntoXXX` commands take the input `query` as a child, when we do the actual writing, we just pass the physical plan to the writer(`FileFormatWriter.write`). However this is problematic. In Spark SQL, optimizer and planner are allowed to change the schema names a little bit. e.g. `ColumnPruning` rule will remove no-op `Project`s, like `Project("A", Scan("a"))`, and thus change the output schema from "" to ``. When it comes to writing, especially for self-description data format like parquet, we may write the wrong schema to the file and cause null values at the read path. Fortunately, in https://github.com/apache/spark/pull/18450 , we decided to allow nested execution and one query can map to multiple executions in the UI. This releases the major restriction in #18604 , and now we don't have to take the input `query` as child of `InsertIntoXXX` commands. So the fix is simple, this PR partially revert #18064 and make `InsertIntoXXX` commands leaf nodes again. ## How was this patch tested? new regression test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark bug Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19474.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19474 commit 3b1174f7e1ed9caae890936ceeb4fb54e58eadcc Author: Wenchen FanDate: 2017-10-11T14:30:38Z FileFormatWriter should respect the input query schema --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19077: [SPARK-21860][core]Improve memory reuse for heap ...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19077#discussion_r144038007 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -116,9 +116,10 @@ private [sql] object GenArrayData { s"final ArrayData $arrayDataName = new $genericArrayClass($arrayName);", arrayDataName) } else { + val numBytes = elementType.defaultSize * numElements val unsafeArraySizeInBytes = UnsafeArrayData.calculateHeaderPortionInBytes(numElements) + - ByteArrayMethods.roundNumberOfBytesToNearestWord(elementType.defaultSize * numElements) +ByteArrayMethods.roundNumberOfBytesToNearestWord(numBytes).toInt --- End diff -- We should really inline that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19077: [SPARK-21860][core]Improve memory reuse for heap ...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19077#discussion_r144037194 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java --- @@ -48,6 +49,15 @@ public long size() { } /** + * Reset the size of the memory block. + */ + public void resetSize(long len) { +assert (ByteArrayMethods.roundNumberOfBytesToNearestWord(length) == --- End diff -- Also leave some message if the check fails. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19077: [SPARK-21860][core]Improve memory reuse for heap ...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19077#discussion_r144037771 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java --- @@ -57,7 +57,7 @@ public void initialize(BufferHolder holder, int numElements, int elementSize) { // Grows the global buffer ahead for header and fixed size data. int fixedPartInBytes = - ByteArrayMethods.roundNumberOfBytesToNearestWord(elementSize * numElements); + (int)ByteArrayMethods.roundNumberOfBytesToNearestWord(elementSize * numElements); --- End diff -- nit: extra space after `(int) `, also please update the other similar changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19077: [SPARK-21860][core]Improve memory reuse for heap ...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19077#discussion_r144037069 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java --- @@ -48,6 +49,15 @@ public long size() { } /** + * Reset the size of the memory block. + */ + public void resetSize(long len) { +assert (ByteArrayMethods.roundNumberOfBytesToNearestWord(length) == --- End diff -- We'd better use `require` instead of `assert`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18386: [SPARK-21165] [SQL] [2.2] Use executedPlan instea...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18386#discussion_r144037286 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala --- @@ -111,9 +111,18 @@ object FileFormatWriter extends Logging { job.setOutputValueClass(classOf[InternalRow]) FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath)) -val allColumns = queryExecution.logical.output +val allColumns = queryExecution.executedPlan.output --- End diff -- This is problematic. The physical plan may have different schema from logical plan(schema name may be different), and the writer should respect the logical schema as that what users expects. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19456: [SPARK] [Scheduler] Configurable default scheduling mode
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/19456 Could you elaborate on the scenario that you should need to make these settings configurable? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19464#discussion_r144030461 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -196,7 +196,10 @@ class HadoopRDD[K, V]( // add the credentials here as this can be called before SparkContext initialized SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) -val inputSplits = inputFormat.getSplits(jobConf, minPartitions) +var inputSplits = inputFormat.getSplits(jobConf, minPartitions) --- End diff -- How about: ``` val inputSplits = if (..) { inputFormat.getSplits(jobConf, minPartitions).filter(_.getLength > 0) } else { inputFormat.getSplits(jobConf, minPartitions) } ``` We should alway try to not use `var`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19316: [SPARK-22097][CORE]Request an accurate memory aft...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19316#discussion_r144034339 --- Diff: core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala --- @@ -388,7 +388,13 @@ private[spark] class MemoryStore( // perform one final call to attempt to allocate additional memory if necessary. if (keepUnrolling) { serializationStream.close() - reserveAdditionalMemoryIfNecessary() + if (bbos.size > unrollMemoryUsedByThisBlock) { +val amountToRequest = bbos.size - unrollMemoryUsedByThisBlock --- End diff -- Why not just fix that in `reserveAdditionalMemoryIfNecessary` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19464#discussion_r144029852 --- Diff: docs/configuration.md --- @@ -1211,6 +1211,14 @@ Apart from these, the following properties are also available, and may be useful data may need to be rewritten to pre-existing output directories during checkpoint recovery. +spark.hadoop.filterOutEmptySplit --- End diff -- We should add the config to `internal/config`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19473: [SPARK-22251] Metric 'aggregate time' is incorrect when ...
Github user ala commented on the issue: https://github.com/apache/spark/pull/19473 @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19464#discussion_r144031728 --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala --- @@ -510,4 +510,16 @@ class FileSuite extends SparkFunSuite with LocalSparkContext { } } + test("spark.hadoop.filterOutEmptySplit") { +val sf = new SparkConf() + sf.setAppName("test").setMaster("local").set("spark.hadoop.filterOutEmptySplit", "true") +sc = new SparkContext(sf) +val emptyRDD = sc.parallelize(Array.empty[Tuple2[String, String]], 1) +emptyRDD.saveAsHadoopFile[TextOutputFormat[String, String]](tempDir.getPath + "/output") +assert(new File(tempDir.getPath + "/output/part-0").exists() === true) + +val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-0") --- End diff -- We should also add the following test cases: 1. Ensure that if no split is empty, we don't lose any splits; 2. Ensure that if part of the splits are empty, we remove the splits correctly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19464#discussion_r144030646 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -196,7 +196,10 @@ class HadoopRDD[K, V]( // add the credentials here as this can be called before SparkContext initialized SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) -val inputSplits = inputFormat.getSplits(jobConf, minPartitions) +var inputSplits = inputFormat.getSplits(jobConf, minPartitions) +if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) { + inputSplits = inputSplits.filter(_.getLength>0) --- End diff -- nit: extra space around operator. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19464#discussion_r144031167 --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala --- @@ -510,4 +510,16 @@ class FileSuite extends SparkFunSuite with LocalSparkContext { } } + test("spark.hadoop.filterOutEmptySplit") { +val sf = new SparkConf() + sf.setAppName("test").setMaster("local").set("spark.hadoop.filterOutEmptySplit", "true") +sc = new SparkContext(sf) +val emptyRDD = sc.parallelize(Array.empty[Tuple2[String, String]], 1) +emptyRDD.saveAsHadoopFile[TextOutputFormat[String, String]](tempDir.getPath + "/output") +assert(new File(tempDir.getPath + "/output/part-0").exists() === true) + +val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-0") +assert(hadoopRDD.partitions.length === 0) --- End diff -- You should recycle the resources you required in the test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19250: [SPARK-12297] Table timezone correction for Timestamps
Github user zivanfi commented on the issue: https://github.com/apache/spark/pull/19250 @attilajeges has just found a problem with the behavior specified in the requirements: * Partitions of a table can use different file formats. * As a result, a single table can have data files of different file formats at the same time. * Timestamps are already handled differently in these formats (this was our original problem to begin with). * As a result, **no uniform adjustment can fix timestamps for all file formats of the same table at the same time.** We can still solve the issue using a file-format-specific table property though. @rxin, I would like to ask you opinion of such a solution: * It **is** implemented in the analyzer, as you asked. * It **is** writer-agnostic, as you asked. * It **is not** file-format-agnostic, but Parquet-specific instead for the time being. Would you find such a soltion be acceptable, given that a file-format-agnostic fix seems infeasible at this point? Thanks, Zoltan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user susanxhuynh commented on the issue: https://github.com/apache/spark/pull/19437 @vanzin Would you mind reviewing this PR? A followup to ArtRand's secrets PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19473: [SPARK-22251] Metric 'aggregate time' is incorrect when ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19473 **[Test build #82637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82637/testReport)** for PR 19473 at commit [`470b54f`](https://github.com/apache/spark/commit/470b54f41ca7da5331d8ace22911a1a722c9c51b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user susanxhuynh commented on the issue: https://github.com/apache/spark/pull/19437 @skonto I haven't tested with TLS; I'll work on that in the next couple of days. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19473: [SPARK-22251] Metric 'aggregate time' is incorrec...
GitHub user ala opened a pull request: https://github.com/apache/spark/pull/19473 [SPARK-22251] Metric 'aggregate time' is incorrect when codegen is off ## What changes were proposed in this pull request? Adding the code for setting 'aggregate time' metric to non-codegen path in HashAggregateExec. ## How was this patch tested? Tested manually. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ala/spark fix-agg-time Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19473.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19473 commit 470b54f41ca7da5331d8ace22911a1a722c9c51b Author: Ala LuszczakDate: 2017-10-11T13:59:35Z Fix 'aggregate time' when codegen is off. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user a10y commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r144025706 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -175,6 +175,25 @@ trait CodegenSupport extends SparkPlan { } /** + * In Java, a method descriptor is valid only if it represents method parameters with a total + * length of 255 or less. `this` contributes one unit and a parameter of type long or double + * contributes two units. Besides, for nullable parameters, we also need to pass a boolean + * for the null status. + */ + private def isValidParamLength(ctx: CodegenContext): Boolean = { +var paramLength = 1 // for `this` parameter. +output.foreach { attr => --- End diff -- (nit: This could be written as a `foldLeft` and then you can eliminate the `var`) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18979 I don't have strong opinion against this. Incorrect size is an issue but I can't think a better solution for now... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19472 **[Test build #82636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82636/testReport)** for PR 19472 at commit [`0738359`](https://github.com/apache/spark/commit/0738359193f4c3a7e7cbeb85d1fc10967f06ff4f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18924 @akopich LGTM. and do you have time to create a PR to resolve random seed not working issue mentioned by @hhbyyh ? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19472 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82634/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19472 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19472 **[Test build #82634 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82634/testReport)** for PR 19472 at commit [`e51fb6a`](https://github.com/apache/spark/commit/e51fb6ae246a660ec2b0bdd1f6a53a0987d4a063). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public abstract class MemoryBlock implements Serializable ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19448#discussion_r143996060 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -138,6 +138,13 @@ class ParquetFileFormat conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) } +if (conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) + && !classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) { + // output summary is requested, but the class is not a Parquet Committer + throw new RuntimeException(s"Committer $committerClass is not a ParquetOutputCommitter" + +s" and cannot create job summaries.") --- End diff -- Oh, I mean .. s in `s" .. "`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18029 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82635/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18029 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18029 **[Test build #82635 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82635/testReport)** for PR 18029 at commit [`72703a0`](https://github.com/apache/spark/commit/72703a072e34407b52d555bab98a435414d2ed25). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/19448#discussion_r143992362 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.FileNotFoundException + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.mapreduce.{JobContext, TaskAttemptContext} +import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter +import org.apache.parquet.hadoop.{ParquetOutputCommitter, ParquetOutputFormat} + +import org.apache.spark.{LocalSparkContext, SparkFunSuite} +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SQLTestUtils + +/** + * Test logic related to choice of output commtters + */ +class ParquetCommitterSuite extends SparkFunSuite with SQLTestUtils + with LocalSparkContext { + + private val PARQUET_COMMITTER = classOf[ParquetOutputCommitter].getCanonicalName + + protected var spark: SparkSession = _ + + /** + * Create a new [[SparkSession]] running in local-cluster mode with unsafe and codegen enabled. + */ + override def beforeAll(): Unit = { +super.beforeAll() +spark = SparkSession.builder() + .master("local-cluster[2,1,1024]") + .appName("testing") + .getOrCreate() + } + + override def afterAll(): Unit = { +spark.stop() +spark = null + } + + test("alternative output committer, merge schema") { +intercept[RuntimeException] { + val stat = writeDataFrame(MarkingFileOutput.COMMITTER, true, true) + logError(s"Created marker file $stat") +} + } + + test("alternative output committer, no merge schema") { +writeDataFrame(MarkingFileOutput.COMMITTER, false, true) --- End diff -- OK --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/19448#discussion_r143992319 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -138,6 +138,13 @@ class ParquetFileFormat conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) } +if (conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) + && !classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) { + // output summary is requested, but the class is not a Parquet Committer + throw new RuntimeException(s"Committer $committerClass is not a ParquetOutputCommitter" + +s" and cannot create job summaries.") --- End diff -- Depends on the policy about "what to do if it's not a parquet committer *and* the option for job summaries is set. It could just mean "you don't get summaries", which worksforme :). May want to log at info though? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/19448#discussion_r143992018 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -138,6 +138,13 @@ class ParquetFileFormat conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) } +if (conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) + && !classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) { + // output summary is requested, but the class is not a Parquet Committer + throw new RuntimeException(s"Committer $committerClass is not a ParquetOutputCommitter" + --- End diff -- will do --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18029 **[Test build #82635 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82635/testReport)** for PR 18029 at commit [`72703a0`](https://github.com/apache/spark/commit/72703a072e34407b52d555bab98a435414d2ed25). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19464#discussion_r143984294 --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala --- @@ -122,7 +122,10 @@ class NewHadoopRDD[K, V]( case _ => } val jobContext = new JobContextImpl(_conf, jobId) -val rawSplits = inputFormat.getSplits(jobContext).toArray +var rawSplits = inputFormat.getSplits(jobContext).toArray(Array.empty[InputSplit]) +if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) { + rawSplits = rawSplits.filter(_.getLength>0) --- End diff -- Space around operator. You should filter before making an array. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19464 Interesting. On the one hand I don't like adding yet another flag that changes behavior, when the user often can't meaningfully decide to set it. There is probably no value in processing an empty partition, sure. Then again it does change behavior slightly, and I wonder if that impacts assumptions that apps rely on somehow. If there's no reason to expect downside, we could do this in Spark 3.x, or make the change now but yes introduce a flag as a safety valve to go back to old behavior, leaving the default to true. But first are there any known impacts to skipping the empty partitions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19471 Fair enough to me. To check this change reasonable, we might be able to send a dev/user list email to social feedbacks. I saw marmbrus doing so when adding the json API; https://github.com/apache/spark/pull/15274#issuecomment-250092074 http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-JSON-Column-Support-td19132.html If we have no response or positive feedbacks, we could quickly/safely drop the support. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17819 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17819 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82632/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17819 **[Test build #82632 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82632/testReport)** for PR 17819 at commit [`bb19708`](https://github.com/apache/spark/commit/bb19708dcd359c434c0fac779125f949541f9b8c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19471 waiting for more feedbacks before moving forward :) Another thing I wanna point out: for `sql("create table t using parquet options(skipHiveMetadata=true) location '/tmp/t'")`, it works in Spark 2.0, and the created table has a schema that the partition column is at the beginning. In Spark 2.1, it also works, and `DESC TABLE` also shows the table schema has partition column at the beginning. However, if you query the table, the output schema has partition column at the end. It's been a long time since Spark 2.1 was released and no one reports this behavior change. It seems this is really a corner case and makes me feel we should not compilcate our code too much for it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19471 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82633/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19471 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19471 Does this change affect some other tests for the overlapped cases like [DataStreamReaderWriterSuite](https://github.com/apache/spark/blob/655f6f86f84ff5241d1d20766e1ef83bb32ca5e0/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala#L550) and `OrcPartitionDiscoverySuite`? Since we already have some amount of these tests in multiple places, (I know you've already considered this aspect though) I'm a little worried about if this change in minor releases makes users confused. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org