[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22683 In this case the change is simpler to understand in prose, I think; "100 KB" becomes "97.6 KiB", etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23275: [SPARK-26323][SQL] Scala UDF should still check i...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23275#discussion_r240234573 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -47,25 +47,13 @@ case class ScalaUDF( function: AnyRef, dataType: DataType, children: Seq[Expression], -inputsNullSafe: Seq[Boolean], -inputTypes: Seq[DataType] = Nil, +@transient inputsNullSafe: Seq[Boolean], +@transient inputTypes: Seq[AbstractDataType] = Nil, udfName: Option[String] = None, nullable: Boolean = true, udfDeterministic: Boolean = true) extends Expression with ImplicitCastInputTypes with NonSQLExpression with UserDefinedExpression { - // The constructor for SPARK 2.1 and 2.2 --- End diff -- I'm OK removing it even without a formal deprecation; these versions are EOL --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23275: [SPARK-26323][SQL] Scala UDF should still check i...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23275#discussion_r240234371 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -47,25 +47,13 @@ case class ScalaUDF( function: AnyRef, dataType: DataType, children: Seq[Expression], -inputsNullSafe: Seq[Boolean], -inputTypes: Seq[DataType] = Nil, +@transient inputsNullSafe: Seq[Boolean], --- End diff -- What was the need for this one? does this object get caught in a closure somewhere? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23260 Ok, got it. @vanzin or @squito or others would be better able to evaluate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23072 @dongjoon-hyun @felixcheung how about now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23260 If you're on YARN, this feels like something you would manage via YARN and its cluster management options. Is there a specific use case here, that this has to happen in Spark? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23263 My first impression is that it's a big change, which is reason for caution here. Visualizing a workflow is nice, but Spark's Pipelines are typically pretty straightforward and linear. I could imagine producing a nicer visualization than what you get from reading the Spark UI, although of course we already have some degree of history and data there. These are just the hooks, right? someone would have to implement something to use these events. I see the value in the API to some degree, but with no concrete implementation, does it add anything for Spark users out of the box? It seems like the history this generates would belong in the history server, although that already has a pretty particular purpose, storing granular history of events in Spark. Is that what someone would likely do? or would someone likely have to run Atlas to use this? If that's a good example of the use case, and Atlas is really about lineage and governance, is that the thrust of this change, to help with something to do with model lineage and reproducibility? It's good that the API changes little, though it does change a bit. I think I mostly have questions right now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23241: [SPARK-26283][CORE] Enable reading from open frames of z...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23241 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22683 Fortunately the syntax is "100m", which has always meant "100 * 1024 * 1024" or "100 MiB" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23264: [SPARK-26266][BUILD] Update to Scala 2.12.8 (bran...
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/23264 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23264: Update to Scala 2.12.8
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/23264 Update to Scala 2.12.8 ## What changes were proposed in this pull request? Back-port of https://github.com/apache/spark/pull/23218 ; updates Scala 2.12 build to 2.12.8 ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-26266.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23264.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23264 commit 4427e9f183f5f0aae7e32643508b8a3b1c9bf234 Author: Sean Owen Date: 2018-12-08T12:09:30Z Update to Scala 2.12.8 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23218 Merged to master. I'll open a separate PR for branch-2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/23218 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23225: [SPARK-26287][CORE]Don't need to create an empty ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23225#discussion_r239990036 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -161,6 +161,10 @@ private void writeSortedFile(boolean isLastFile) { final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords = inMemSorter.getSortedIterator(); +// If there are no sorted records, so we don't need to create an empty spill file. +if (!sortedRecords.hasNext()) { + return; +} --- End diff -- If you're going to short-circuit, why not do it at the start of the function and save the rest of the work done above? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23256 CC @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23160: [SPARK-26196][SPARK-26281][WebUI] Total tasks title in t...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23160 Merged to master. @shahidki31 does this need to go in branch 2.4, 2.3? I tried back porting it, but looks like a lot of the affected code didn't exist in 2.4. If the fix can or should also be back-ported and you're willing, you're welcome to open another PR against 2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23247: [SPARK-26294][CORE]Delete Unnecessary If statement
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23247 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17673 @ngopal this one can't be merged as-is and looks like it was abandoned. Would you like to take this PR, update per reviews? I'd review that. I think CBOW could be useful in MLlib. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23218 I'm not sure. I can't find any other reference to this crash and 2.12.8. It could be something only Spark happens to trigger, or could be specific to this JVM + platform but not Spark or Scala. We could drop a release note at least recommending the latest version of Java 8 (and 11) with Spark 2.4 / 3.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23241: [SPARK-26283][CORE] Enable reading from open frames of z...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23241 Sorry about the run-around. I'm OK being conservative here as you were originally, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23246: [SPARK-26292][CORE]Assert statement of currentPage may b...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23246 It's not clear this is where it should be from the description. Please review https://spark.apache.org/contributing.html This one should be closed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23247: [SPARK-26294][CORE]Delete Unnecessary If statement
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23247 These aren't worth the time it takes us to review them and merge them, honestly. Little cleanup can be OK if it makes an appreciable difference in speed or readability, and if you can find many instances of the same issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23241#discussion_r239590339 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala --- @@ -118,8 +118,6 @@ private[spark] class ReplayListenerBus extends SparkListenerBus with Logging { case e: HaltReplayException => // Just stop replay. case _: EOFException if maybeTruncated => - case _: IOException if maybeTruncated => --- End diff -- OK, this could be OK, if this was really added only to address what you are fixing here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23241#discussion_r239532748 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala --- @@ -118,10 +118,12 @@ private[spark] class ReplayListenerBus extends SparkListenerBus with Logging { case e: HaltReplayException => // Just stop replay. case _: EOFException if maybeTruncated => - case _: IOException if maybeTruncated => -logWarning(s"Failed to read Spark event log: $sourceName") case ioe: IOException => -throw ioe +if (maybeTruncated) { --- End diff -- Oh I see. Actually, what about just removing the second case? it's simpler to just let it throw. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23241#discussion_r239525888 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala --- @@ -118,10 +118,12 @@ private[spark] class ReplayListenerBus extends SparkListenerBus with Logging { case e: HaltReplayException => // Just stop replay. case _: EOFException if maybeTruncated => - case _: IOException if maybeTruncated => -logWarning(s"Failed to read Spark event log: $sourceName") case ioe: IOException => -throw ioe +if (maybeTruncated) { --- End diff -- I think this was already the behavior? if it doesn't match the 'if' it would just throw anyway --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23202: [SPARK-26248][SQL] Infer date type from CSV
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23202 I'd defer to @HyukjinKwon ; looks OK in broad strokes but he would know much more about the CSV parsing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23241#discussion_r239509724 --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala --- @@ -197,4 +201,8 @@ class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec { // avoid overhead excessive of JNI call while trying to uncompress small amount of data. new BufferedInputStream(new ZstdInputStream(s), bufferSize) } + + override def zstdEventLogCompressedInputStream(s: InputStream): InputStream = { +new BufferedInputStream(new ZstdInputStream(s).setContinuous(true), bufferSize) --- End diff -- That's what I'm wondering about. Is it actually desirable to not fail on a partial frame? I'm not sure. We *shouldn't* encounter it elsewhere. This changes a developer API, but may not even be a breaking change as there is a default implementation. We can take breaking changes in Spark 3 though. I think I agree with your approach here in the end. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22683: [SPARK-25696] The storage memory displayed on spa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22683#discussion_r239480795 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -62,14 +62,14 @@ class KryoSerializer(conf: SparkConf) if (bufferSizeKb >= ByteUnit.GiB.toKiB(2)) { throw new IllegalArgumentException("spark.kryoserializer.buffer must be less than " + - s"2048 mb, got: + ${ByteUnit.KiB.toMiB(bufferSizeKb)} mb.") + s"2048 mib, got: + ${ByteUnit.KiB.toMiB(bufferSizeKb)} mib.") --- End diff -- Nit: mib -> MiB. (I know it was 'mb' before but that's not really the abbreviation for 1 million bytes) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22683: [SPARK-25696] The storage memory displayed on spa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22683#discussion_r239481508 --- Diff: docs/sql-programming-guide.md --- @@ -4,10 +4,15 @@ displayTitle: Spark SQL, DataFrames and Datasets Guide title: Spark SQL and DataFrames --- +* This will become a table of contents (this text will be scraped). --- End diff -- The change to this file looks unrelated. Could you revert it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22683: [SPARK-25696] The storage memory displayed on spa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22683#discussion_r239482033 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullExpressionsSuite.scala --- @@ -160,7 +160,7 @@ class NullExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { checkEvaluation(AtLeastNNonNulls(4, nullOnly), false, EmptyRow) } - test("Coalesce should not throw 64kb exception") { + test("Coalesce should not throw 64kib exception") { --- End diff -- Nit: 64 KiB --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23241#discussion_r239476570 --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala --- @@ -197,4 +201,8 @@ class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec { // avoid overhead excessive of JNI call while trying to uncompress small amount of data. new BufferedInputStream(new ZstdInputStream(s), bufferSize) } + + override def zstdEventLogCompressedInputStream(s: InputStream): InputStream = { +new BufferedInputStream(new ZstdInputStream(s).setContinuous(true), bufferSize) --- End diff -- BTW it seems like 'continuous' changes behavior very little: https://github.com/luben/zstd-jni/blob/master/src/main/java/com/github/luben/zstd/ZstdInputStream.java#L147 I agree with your concern to keep the change minimal. I'm trying to think if this would break anything if everything were read as 'continuous'. It wouldn't fail fast in some case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23241#discussion_r239476672 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala --- @@ -118,8 +118,6 @@ private[spark] class ReplayListenerBus extends SparkListenerBus with Logging { case e: HaltReplayException => // Just stop replay. case _: EOFException if maybeTruncated => - case _: IOException if maybeTruncated => --- End diff -- Can this still happen for non-zstd compression though? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22759#discussion_r239474962 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be } } + test("parquet - column nullability -- write only") { +val schema = StructType( + StructField("cl1", IntegerType, nullable = false) :: --- End diff -- Nit: could we indent these at the same level? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22759#discussion_r239475332 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be } } + test("parquet - column nullability -- write only") { +val schema = StructType( + StructField("cl1", IntegerType, nullable = false) :: +StructField("cl2", IntegerType, nullable = true) :: Nil) +val row = Row(3, 4) +val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), schema) + +withTempPath { dir => + val path = dir.getAbsolutePath + df.write.mode("overwrite").parquet(path) + val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0) + + val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new Configuration()) + val f = ParquetFileReader.open(hadoopInputFile) + val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala + .map(_.getPrimitiveType) + f.close + + // the write keeps nullable info from the schema + val expectedParquetSchema: Seq[PrimitiveType] = Seq( --- End diff -- Also really doesn't matter, but you can simplify the code by omitting types like this, etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22759#discussion_r239475203 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be } } + test("parquet - column nullability -- write only") { +val schema = StructType( + StructField("cl1", IntegerType, nullable = false) :: +StructField("cl2", IntegerType, nullable = true) :: Nil) +val row = Row(3, 4) +val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), schema) + +withTempPath { dir => + val path = dir.getAbsolutePath + df.write.mode("overwrite").parquet(path) + val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0) + + val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new Configuration()) + val f = ParquetFileReader.open(hadoopInputFile) + val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala + .map(_.getPrimitiveType) + f.close + + // the write keeps nullable info from the schema + val expectedParquetSchema: Seq[PrimitiveType] = Seq( +new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.INT32, "cl1"), +new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.INT32, "cl2") + ) + + assert (expectedParquetSchema == parquetSchema) --- End diff -- Nit: I think ideally you use the `===` test operator, so that failures generated a better message --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23241: [SPARK-26283][CORE]When zstd compression enabled,...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23241#discussion_r239218209 --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala --- @@ -197,4 +201,8 @@ class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec { // avoid overhead excessive of JNI call while trying to uncompress small amount of data. new BufferedInputStream(new ZstdInputStream(s), bufferSize) } + + override def zstdEventLogCompressedInputStream(s: InputStream): InputStream = { +new BufferedInputStream(new ZstdInputStream(s).setContinuous(true), bufferSize) --- End diff -- Yeah, so this new "partial file" method can call the existing method by default, and do something different for zstd. Then this one call site can ask for the 'partial file' stream. Some comments about the difference here would be helpful. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23228: [MINOR][DOC]The condition description of serializ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23228#discussion_r239215561 --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala --- @@ -33,10 +33,10 @@ import org.apache.spark.shuffle._ * Sort-based shuffle has two different write paths for producing its map output files: * * - Serialized sorting: used when all three of the following conditions hold: - *1. The shuffle dependency specifies no aggregation or output ordering. + *1. The shuffle dependency specifies no map-side combine. --- End diff -- Does this sound right @JoshRosen ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23241: [SPARK-26283][CORE]When zstd compression enabled,...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23241#discussion_r239208072 --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala --- @@ -197,4 +201,8 @@ class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec { // avoid overhead excessive of JNI call while trying to uncompress small amount of data. new BufferedInputStream(new ZstdInputStream(s), bufferSize) } + + override def zstdEventLogCompressedInputStream(s: InputStream): InputStream = { +new BufferedInputStream(new ZstdInputStream(s).setContinuous(true), bufferSize) --- End diff -- What happens if you set continuous for everything? would it work in all cases? It kind of looks like zstd always uses this in the code path below anyway. I think that if we introduce a new method we might try to make it a little more general, like: compressedInputStreamForPartialFile or something. It would be good to avoid the isInstanceOf below. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23231 I'm not seeing it in the migration guide, maybe I'm missing it. In any event, I dont' think we need to keep this for 3.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23229: [MINOR][CORE] Modify some field name because it may be c...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23229 Agree, this isn't worthwhile. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r239068840 --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala --- @@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { override def beforeAll() { super.beforeAll() TestHive.setCacheTables(true) -// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) -TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) +// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) --- End diff -- I think consistency is indeed a problem, but why disable the new parser, rather than make this consistent? I haven't looked into whether there's a good reason they behave differently but suspect not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23218 Ah OK, so all of them were a JVM crash. It would probably be a good idea to update the JVM on all the workers as _60 is over 3 years old. It's probably not as simple as it sounds but WDYT @shaneknapp ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23218 Hm, one failure was due to a JVM crash, but it fails twice consistent, with sbt just exiting with status 134. No other failures are logged. Not sure what to make of that! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23216: [SPARK-26264][CORE]It is better to add @transient to fie...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23216 I think just leave it. The `@transient` in `ShuffleMapTasks`'s `locs` is just superfluous here, not sure it's worth changing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23216: [SPARK-26264][CORE]It is better to add @transient to fie...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23216 Are you sure it's even a field in the class? it looks like it's only used to define this: ``` @transient private[this] val preferredLocs: Seq[TaskLocation] = { if (locs == null) Nil else locs.toSet.toSeq } ``` I'd expect Scala would not generate a field. Indeed the thing it is used to make is transient. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23159: [SPARK-26191][SQL] Control truncation of Spark pl...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23159#discussion_r238869530 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1777,7 +1777,7 @@ class Analyzer( case p if p.expressions.exists(hasGenerator) => throw new AnalysisException("Generators are not supported outside the SELECT clause, but " + - "got: " + p.simpleString) + "got: " + p.simpleString((SQLConf.get.maxToStringFields))) --- End diff -- Nit: are there extra parens here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23159 Rather than change every single call to this method, if this should generally be the value of the argument, then why not make it the default value or something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23219: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23219 @wangyum I already opened https://github.com/apache/spark/pull/23218 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22759: [MINOR][SQL][DOC] Correct parquet nullability documentat...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22759 Ping @dima-asana to rebase or close --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21363 @MaxGekk now that your change is merge, can this proceed, @xuanyuanking ? or is it obsolete? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22997: SPARK-25999: make-distribution.sh failure with --r and -...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22997 Yeah, we can't make this change for the reasons above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22887: [SPARK-25880][CORE] user set's hadoop conf should not ov...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22887 @gjhkael can you clarify further what the undesirable behavior is, and what behavior you are looking for? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support for Scala...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23098 Note I'm holding on to this PR for a while as I understand it might be disruptive to downstream builds to remove 2.11 support just now. Will look at merging it in weeks. Right now it's an FYI. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23150 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23218 Hm, looks like genjavadocplugin is published for individual Scala releases and doesn't exist yet for 2.12.8: https://mvnrepository.com/artifact/com.typesafe.genjavadoc/genjavadoc-plugin . I'll look at whether we can leave it at 2.12.7 or whether to expect it will be released soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23170: [SPARK-24423][FOLLOW-UP][SQL] Fix error example
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23170 Merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22600: [SPARK-25578][BUILD] Update to Scala 2.12.7
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22600 @wangyum sounds good. I opened https://github.com/apache/spark/pull/23218 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/23218 [SPARK-26266][BUILD] Update to Scala 2.12.8 ## What changes were proposed in this pull request? Update to Scala 2.12.8 ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-26266 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23218.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23218 commit b667d37e9ee2d8cdce459806925cdc0fe725b7bf Author: Sean Owen Date: 2018-12-04T13:53:21Z Update to Scala 2.12.8 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23182: Config change followup to [SPARK-26177] Automated format...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23182 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23189: [SPARK-26235][Core] Change log level for ClassNotFoundEx...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23189 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode
Github user srowen commented on the issue: https://github.com/apache/spark/pull/18784 @skonto do you want to proceed with this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22683 Add to this PR. The change goes logically together. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23205: [SPARK-26253][WebUI] : Task Summary Metrics Table on Sta...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23205 Merging as a follow up to https://github.com/apache/spark/pull/21688 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23164 OK merged to 2.4/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r238110710 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -148,11 +148,20 @@ private[spark] class AppStatusStore( // cheaper for disk stores (avoids deserialization). val count = { Utils.tryWithResource( -store.view(classOf[TaskDataWrapper]) - .parent(stageKey) - .index(TaskIndexNames.EXEC_RUN_TIME) - .first(0L) - .closeableIterator() +if (store.isInstanceOf[LevelDB]) { --- End diff -- Does this code path need to be different for disk vs memory? this part seemed like it could work efficiently either way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r238110723 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -221,29 +230,49 @@ private[spark] class AppStatusStore( // stabilize once the stage finishes. It's also slow, especially with disk stores. val indices = quantiles.map { q => math.min((q * count).toLong, count - 1) } +// TODO Summary metrics needs to display all the successful tasks' metrics (SPARK-26119). --- End diff -- It's not ideal but it's a reasonable solution. Are you OK with it @vanzin ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22683: [SPARK-25696] The storage memory displayed on spa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22683#discussion_r238102080 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -1164,17 +1164,17 @@ private[spark] object Utils extends Logging { } else { val (value, unit) = { if (size >= 2 * EB) { - (BigDecimal(size) / EB, "EB") + (BigDecimal(size) / EB, "EiB") --- End diff -- For full consistency, how about modifying the values like EB and PB above? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23164 I just mean, is this a bug that comes up otherwise in Spark? should this be back-ported or is it just supporting the new change you reference? I can merge to master at least. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23150 It makes sense that parsing depends on a timezone, though that's set as an option in the parser typically. The tests should generally test "GMT" for this reason. If there's a default code path for when no timezone is specified, then I'd use the test harness mechanisms for temporarily changing the system timezone to GMT (which then automatically changes back). Your changes look OK here and they pass right? is there another test you were unable to add? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r238073218 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter( private var univocityGenerator: Option[UnivocityGenerator] = None - override def write(row: InternalRow): Unit = { -val gen = univocityGenerator.getOrElse { - val charset = Charset.forName(params.charset) - val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) - val newGen = new UnivocityGenerator(dataSchema, os, params) - univocityGenerator = Some(newGen) - newGen -} + if (params.headerFlag) { +val gen = getGen() +gen.writeHeaders() + } + private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse { +val charset = Charset.forName(params.charset) +val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) +val newGen = new UnivocityGenerator(dataSchema, os, params) +univocityGenerator = Some(newGen) +newGen + } + + override def write(row: InternalRow): Unit = { +val gen = getGen() --- End diff -- Yeah we have two different approaches, both of which are fine IMHO. I think it's reasonable to clean that up in a follow-up if desired. WDYT @HyukjinKwon ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23162 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23178: [SPARK-26216][SQL] Do not use case class as publi...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23178#discussion_r238062995 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -38,114 +38,106 @@ import org.apache.spark.sql.types.DataType * @since 1.3.0 */ @Stable --- End diff -- I'd go ahead and leave the Since version. The API is essentially unchanged, though there are some marginal breaking compile time changes. But same is true of many things we are changing in 3.0. I've tagged the JIRA with `release-notes` and will add a blurb about the change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23177: [SPARK-26212][Build][test-maven] Upgrade maven version t...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23177 Merging to master. I've been using 3.6.0 on the command line for a while and it's fine. Note that if you use your local mvn in IntelliJ, it seems to have some incompatibility with the current latest 2018.13.1 release. It's no big deal, falling back to its internal 3.3.9 version works fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23185: [MINOR][Docs] Fix typos
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23185 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23146: [SPARK-26173] [MLlib] Prior regularization for Lo...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23146#discussion_r237888294 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularizationSuite.scala --- @@ -16,9 +16,13 @@ */ package org.apache.spark.ml.optim.loss +import org.scalactic.{Equality, TolerantNumerics} --- End diff -- We have ~== etc for approximate comparison --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23146: [SPARK-26173] [MLlib] Prior regularization for Lo...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23146#discussion_r237888645 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -250,6 +250,66 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas isSet(lowerBoundsOnIntercepts) || isSet(upperBoundsOnIntercepts) } + /** + * The prior multivariate mean (coefficients) for Maximum A Posteriori (MAP) optimization. + * Default is none. + * + * @group expertParam */ + @Since("2.4.0") + val priorMean: DoubleArrayParam = new DoubleArrayParam(this, "priorMean", + "The prior mean used for Prior regularization.") --- End diff -- These need continuation indents --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23146: [SPARK-26173] [MLlib] Prior regularization for Lo...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23146#discussion_r237888349 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularizationSuite.scala --- @@ -16,9 +16,13 @@ */ package org.apache.spark.ml.optim.loss +import org.scalactic.{Equality, TolerantNumerics} + import org.apache.spark.SparkFunSuite import org.apache.spark.ml.linalg.{BLAS, Vectors} + --- End diff -- Nit: remove these lines --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23146: [SPARK-26173] [MLlib] Prior regularization for Lo...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23146#discussion_r237888585 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -250,6 +250,66 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas isSet(lowerBoundsOnIntercepts) || isSet(upperBoundsOnIntercepts) } + /** + * The prior multivariate mean (coefficients) for Maximum A Posteriori (MAP) optimization. + * Default is none. + * + * @group expertParam */ + @Since("2.4.0") --- End diff -- This would have to be 3.0.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Reuse withTempDir funct...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23151#discussion_r237886340 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -105,5 +105,16 @@ abstract class SparkFunSuite logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n") } } - + /** + * Creates a temporary directory, which is then passed to `f` and will be deleted after `f` + * returns. + * + * @todo Probably this method should be moved to a more general place + */ + protected def withCreateTempDir(f: File => Unit): Unit = { +val dir = Utils.createTempDir() --- End diff -- Yes shouldn't be necessary here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Reuse withTempDir funct...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23151#discussion_r237886193 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -105,5 +105,16 @@ abstract class SparkFunSuite logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n") } } - + /** + * Creates a temporary directory, which is then passed to `f` and will be deleted after `f` + * returns. + * + * @todo Probably this method should be moved to a more general place + */ + protected def withCreateTempDir(f: File => Unit): Unit = { --- End diff -- Yes, it seems like we should be able to use an override. The subclass that needs to inject an additional method call in the block can call the super method with a lambda that calls the user-supplied block, then this other method. It's probably worth whatever surgery is needed to make this clean and reduce duplication. We already have a lot of "create temp thing" methods all over. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237883918 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter( private var univocityGenerator: Option[UnivocityGenerator] = None - override def write(row: InternalRow): Unit = { -val gen = univocityGenerator.getOrElse { - val charset = Charset.forName(params.charset) - val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) - val newGen = new UnivocityGenerator(dataSchema, os, params) - univocityGenerator = Some(newGen) - newGen -} + private def getOrCreateGen(): UnivocityGenerator = univocityGenerator.getOrElse { +val charset = Charset.forName(params.charset) +val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) +val newGen = new UnivocityGenerator(dataSchema, os, params) +univocityGenerator = Some(newGen) +newGen + } + if (params.headerFlag) { --- End diff -- My only nit here is that this is part of the constructor, but lives between two methods, which is a little less clear. Maybe move it just after the member declarations? Also you could inline this to things like `getOrCreateGen().writeHeaders()`, but doesn't matter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237884151 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter( private var univocityGenerator: Option[UnivocityGenerator] = None - override def write(row: InternalRow): Unit = { -val gen = univocityGenerator.getOrElse { - val charset = Charset.forName(params.charset) - val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) - val newGen = new UnivocityGenerator(dataSchema, os, params) - univocityGenerator = Some(newGen) - newGen -} + private def getOrCreateGen(): UnivocityGenerator = univocityGenerator.getOrElse { --- End diff -- This is a really small thing, I don't feel strongly about, but what about just `getGen()`? the caller doesn't care whether it's created. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23177: [SPARK-26212][Build][test-maven] Upgrade maven ve...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23177#discussion_r237881557 --- Diff: pom.xml --- @@ -114,7 +114,7 @@ 1.8 ${java.version} ${java.version} -3.5.4 +3.6.0 --- End diff -- That wouldn't hurt, though I suspect the project continues to build with Maven 3.5.4 for a long while. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23162#discussion_r237880921 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -513,7 +513,7 @@ package object config { "is written in unsafe shuffle writer. In KiB unless otherwise specified.") .bytesConf(ByteUnit.KiB) .checkValue(v => v > 0 && v <= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH / 1024, -s"The buffer size must be greater than 0 and less than" + +s"The buffer size must be positive and not greater than" + --- End diff -- Yeah, I could go either way on those. I wouldn't mind standardizing on "less than or equal to", sure. @10110346 would you mind taking one more pass accordingly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23148 Yeah we'd need a new PR. If you collect a few good improvements just open a follow up. I was testing by just making a dummy change in a few files and seeing what it did. It's OK as-is, even. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23177: [SPARK-26212][Build][test-maven] Upgrade maven version t...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23177 Ah, that's the second time I've forgotten this. Yes looks good to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23126 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23148 I played with this a little locally, and yeah it does reformat entire files that are in the diff, and most of what it does is fixing stuff we probably wouldn't ask for in a PR review. For example is it possible to disable its preference for putting closing braces and parens on the next line? as in the comment at https://github.com/apache/spark/pull/23148#issuecomment-442243410 . And maybe less aggressive about putting args to a method each on their own line. There isn't a mode that would somehow just reformat lines that are being changed, BTW? This is already useful in that we can just ask people to run dev/scalafmt (I'll update developer guids) as the output style looks _also_ just fine. I won't try to have this automatically add the formatting to the build. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23145: [MINOR][Docs][WIP] Fix Typos
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23145 I think this is fine to merge, this is a good batch of grammar fixes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23145: [MINOR][Docs][WIP] Fix Typos
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23145#discussion_r237564429 --- Diff: docs/structured-streaming-programming-guide.md --- @@ -1634,7 +1634,7 @@ returned through `Dataset.writeStream()`. You will have to specify one or more o - *Query name:* Optionally, specify a unique name of the query for identification. -- *Trigger interval:* Optionally, specify the trigger interval. If it is not specified, the system will check for availability of new data as soon as the previous processing has completed. If a trigger time is missed because the previous processing has not completed, then the system will trigger processing immediately. +- *Trigger interval:* Optionally, specify the trigger interval. If it is not specified, the system will check for availability of new data as soon as the previous processing has been completed. If a trigger time is missed because the previous processing has not been completed, then the system will trigger processing immediately. --- End diff -- "has completed" is actually correct, but the passive voice here is grammatical too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23162#discussion_r237563687 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -503,7 +503,7 @@ package object config { "made in creating intermediate shuffle files.") .bytesConf(ByteUnit.KiB) .checkValue(v => v > 0 && v <= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH / 1024, -s"The file buffer size must be greater than 0 and less than" + +s"The file buffer size must be greater than 0 and not greater than" + --- End diff -- You can say positive instead of 'greater than 0' if you like, as above --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23162#discussion_r237563435 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -430,8 +430,8 @@ package object config { .doc("The chunk size in bytes during writing out the bytes of ChunkedByteBuffer.") .bytesConf(ByteUnit.BYTE) .checkValue(_ <= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH, -"The chunk size during writing out the bytes of" + -" ChunkedByteBuffer should not larger than Int.MaxValue - 15.") +"The chunk size during writing out the bytes of ChunkedByteBuffer should" + + s" not larger than ${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.") --- End diff -- not larger than => be at most or => not be greater than --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23148 Ah I see. I can add that call in a follow-up to enable it and see how we like it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23052 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237561245 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala --- @@ -57,6 +57,9 @@ abstract class OutputWriterFactory extends Serializable { * executor side. This instance is used to persist rows to this single output file. */ abstract class OutputWriter { + /** Initializes before writing any rows. Invoked on executor size. */ + def init(): Unit --- End diff -- Rather than make subclasses implement as a no-op, just provide that no-op impl here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23151#discussion_r237559695 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -66,6 +66,20 @@ private[sql] trait SQLTestUtils extends SparkFunSuite with SQLTestUtilsBase with } } + /** + * Creates a temporary directory, which is then passed to `f` and will be deleted after `f` + * returns. + * + */ + protected override def withTempDir(f: File => Unit): Unit = { +val dir = Utils.createTempDir().getCanonicalFile +try f(dir) finally { --- End diff -- Why not call the super method with a function that calls f, then waitForTasksToFinish()? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23151#discussion_r237559321 --- Diff: core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala --- @@ -494,13 +494,12 @@ class SparkSubmitSuite } test("launch simple application with spark-submit with redaction") { -val testDir = Utils.createTempDir() -testDir.deleteOnExit() -val testDirPath = new Path(testDir.getAbsolutePath()) val unusedJar = TestUtils.createJarWithClasses(Seq.empty) val fileSystem = Utils.getHadoopFileSystem("/", SparkHadoopUtil.get.newConfiguration(new SparkConf())) -try { +withTempDir { testDir => + testDir.deleteOnExit() --- End diff -- Although I think this is redundant for temp dirs, you can put this in the Utils.createTempDir method and take it out in places like this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23177: [SPARK-26212][Build][test-maven] Upgrade maven version t...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23177 That's fine but we need to also update build/mvn to download and use the same version. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r237556042 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -128,6 +128,82 @@ class RowMatrix @Since("1.0.0") ( RowMatrix.triuToFull(n, GU.data) } + private def computeDenseVectorCovariance(mean: Vector, n: Int, m: Long): Matrix = { + +val bc = rows.context.broadcast(mean) + +// Computes n*(n+1)/2, avoiding overflow in the multiplication. +// This succeeds when n <= 65535, which is checked above +val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2)) + +val MU = rows.treeAggregate(new BDV[Double](nt))( + seqOp = (U, v) => { + +val n = v.size +val na = Array.ofDim[Double](n) +val means = bc.value +if (v.isInstanceOf[DenseVector]) { + v.foreachActive{(index, value) => --- End diff -- Yeah, because it hasn't subtracted the mean from one of the values in the Spark vector. that's the general issue with centering a sparse vector: it becomes dense! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r237541497 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -128,6 +128,82 @@ class RowMatrix @Since("1.0.0") ( RowMatrix.triuToFull(n, GU.data) } + private def computeDenseVectorCovariance(mean: Vector, n: Int, m: Long): Matrix = { + +val bc = rows.context.broadcast(mean) + +// Computes n*(n+1)/2, avoiding overflow in the multiplication. +// This succeeds when n <= 65535, which is checked above +val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2)) + +val MU = rows.treeAggregate(new BDV[Double](nt))( + seqOp = (U, v) => { + +val n = v.size +val na = Array.ofDim[Double](n) +val means = bc.value +if (v.isInstanceOf[DenseVector]) { + v.foreachActive{(index, value) => --- End diff -- But isn't it incorrect to not subtract the mean from 0 elements in a sparse vector anyway? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org