[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21931 LGTM, cc @cloud-fan @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21859 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21087#discussion_r211080067 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -164,9 +165,12 @@ case class BucketSpec( numBuckets: Int, bucketColumnNames: Seq[String], sortColumnNames: Seq[String]) { - if (numBuckets <= 0 || numBuckets >= 10) { + def conf: SQLConf = SQLConf.get + + if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) { --- End diff -- Since the condition is changed from `>` to `>=`, there is inconsistent between the condition and the error message. If this condition is true, the message is like `... but less than or equal to bucketing.maxBuckets ...`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21860 cc @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20637 cc @ueshin @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20637 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22135 SGTM, but can we address the similar issues at once? Even under `src/main/...`, we can see this pattern at several places. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22125: [DOCS] Fix cloud-integration.md Typo
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22125 @KraFusion Sorry, I overlooked another PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22048: [SPARK-25108][SQL] Fix the show method to display the wi...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22048 Thank you for creating a JIRA entry and for putting the result. The test case is not available yet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22126: [SPARK-23938][SQL][FOLLOW-UP][TEST] Nullabilities of val...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22126 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22125: [DOCS] Fix cloud-integration.md Typo
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22125 Thanks, would it possible to address similar issues? For example, in `configurations.md`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21537 @gatorsmile Thank you for your reply. Could you elaborate on your suggestion? >A general suggestion. To avoid introducing the regressions, how about implementing a new one without changing the existing one? We can easily switch to the new one when it becomes stable. Does it mean to work in a particular branch or to work in a fork repository until its implementation has been completed ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21537 For 2. and 3., it is harder to say my opinion in the comment. Let me say short comments at first. For 2., if I remember correctly, @viirya once wrote the API document in a JIRA entry. it would be good to update and add some thoughts about design as a first step. I understand that it is hard to keep the up-to-date design document, in particular, during the open-source development. This is because we have a lot of excellent comments in a PR. For 3., at the early implementation of SQL codegen (i.e. use `s""` to represent code), I thought there are two problems. 1. lack of type of an expression (in other words, `ExprCode` did not have the type of `value`) 2. lack of a structure of statements Now, we meet a problem that it is hard to rewrite a method argument due to problem 1. In my understanding, the current effort led by @viirya is trying to resolve problem 1. It is hard to rewrite a set of statements due to Problem 2. To resolve problem 2, we need more effort. In my opinion, we are addressing them step by step. Of course, it would be happy for me to co-lead a discussion of the IR design for 2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21537 Thank for involving me in an important thread. I was busy this morning in Japan. I think there are three topics in the thread. 1. Merge or revert this PR 2. Design document 3. IR design For 1., in short, my opinion is likely to revert this PR from the view like a release manager. As we know, it is a time to cut a release branch. This PR partially introduce a new representation. If there would be a bug in code generation at Spark 2.4, it may introduce a complication of debugging and fixing. As @mgaido91 pointed out, #20043 and #21026 have been merged. I think that they are a kind of refactoring (e.g. change the representation of literal, class, and so on). Thus, these two PRs can be there. However, this PR introduces a mixture of representation `s""` and `code""`. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21912 cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22110#discussion_r210282476 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala --- @@ -115,6 +115,8 @@ protected[sql] abstract class AtomicType extends DataType { private[sql] type InternalType private[sql] val tag: TypeTag[InternalType] private[sql] val ordering: Ordering[InternalType] + + private[spark] override def supportsEquals: Boolean = true --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22103: [SPARK-25113][SQL] Add logging to CodeGenerator w...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22103#discussion_r210152479 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1385,9 +1386,15 @@ object CodeGenerator extends Logging { try { val cf = new ClassFile(new ByteArrayInputStream(classBytes)) val stats = cf.methodInfos.asScala.flatMap { method => - method.getAttributes().filter(_.getClass.getName == codeAttr.getName).map { a => + method.getAttributes().filter(_.getClass eq codeAttr).map { a => --- End diff -- Yeah, I know The current comparison is more strict. Although the previous comparison was only for name, the current comparison is for a pair of class loader and name. I worried whether the strictness may change behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22105: [SPARK-25115] [Core] Eliminate extra memory copy ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22105#discussion_r210056504 --- Diff: common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java --- @@ -140,8 +140,24 @@ private int copyByteBuf(ByteBuf buf, WritableByteChannel target) throws IOExcept // SPARK-24578: cap the sub-region's size of returned nio buffer to improve the performance // for the case that the passed-in buffer has too many components. int length = Math.min(buf.readableBytes(), NIO_BUFFER_LIMIT); -ByteBuffer buffer = buf.nioBuffer(buf.readerIndex(), length); -int written = target.write(buffer); +// If the ByteBuf holds more then one ByteBuffer we should better call nioBuffers(...) +// to eliminate extra memory copies. +int written = 0; +if (buf.nioBufferCount() == 1) { + ByteBuffer buffer = buf.nioBuffer(buf.readerIndex(), length); + written = target.write(buffer); +} else { + ByteBuffer[] buffers = buf.nioBuffers(buf.readerIndex(), length); + for (ByteBuffer buffer: buffers) { +int remaining = buffer.remaining(); +int w = target.write(buffer); +written += w; --- End diff -- Can we guarantee `written` does not overflow while we accumulate `int` values? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22101: [SPARK-25114][Core] Fix RecordBinaryComparator when subt...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22101 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22103: [SPARK-25113][SQL] Add logging to CodeGenerator w...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22103#discussion_r209995159 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1385,9 +1386,15 @@ object CodeGenerator extends Logging { try { val cf = new ClassFile(new ByteArrayInputStream(classBytes)) val stats = cf.methodInfos.asScala.flatMap { method => - method.getAttributes().filter(_.getClass.getName == codeAttr.getName).map { a => + method.getAttributes().filter(_.getClass eq codeAttr).map { a => --- End diff -- Why do we need change this condition? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22103: [SPARK-25113][SQL] Add logging to CodeGenerator w...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22103#discussion_r209993118 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1385,9 +1386,15 @@ object CodeGenerator extends Logging { try { val cf = new ClassFile(new ByteArrayInputStream(classBytes)) val stats = cf.methodInfos.asScala.flatMap { method => - method.getAttributes().filter(_.getClass.getName == codeAttr.getName).map { a => + method.getAttributes().filter(_.getClass eq codeAttr).map { a => val byteCodeSize = codeAttrField.get(a).asInstanceOf[Array[Byte]].length CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(byteCodeSize) + +if (byteCodeSize > DEFAULT_JVM_HUGE_METHOD_LIMIT) { + logInfo("Generated method too long to be JIT compiled: " + --- End diff -- nit: `Generated method is too long ...`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 ping @rednaxelafx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createU...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21912#discussion_r209878827 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -735,70 +735,98 @@ class CodegenContext { } /** - * Generates code creating a [[UnsafeArrayData]]. + * Generates code creating a [[UnsafeArrayData]] or [[GenericArrayData]] based on + * given parameters. * * @param arrayName name of the array to create + * @param elementType data type of the elements in source array * @param numElements code representing the number of elements the array should contain - * @param elementType data type of the elements in the array * @param additionalErrorMessage string to include in the error message + * @param elementSize optional value which shows the size of an element of the allocated + *[[UnsafeArrayData]] or [[GenericArrayData]] + * + * @return code representing the allocation of [[ArrayData]] + * code representing a setter of an assignment for the generated array */ - def createUnsafeArray( + def createArrayData( --- End diff -- I think that it would be good to have a utility method to generate this call. WDYT? On the other hand, I agree that we have a method `setArrayElementFunc` or `setArrayElement`. Thus, we can split this method into two methods that have only one return value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createU...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21912#discussion_r209877426 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala --- @@ -34,6 +36,37 @@ object ArrayData { case a: Array[Double] => UnsafeArrayData.fromPrimitiveArray(a) case other => new GenericArrayData(other) } + + + /** + * Allocate [[UnsafeArrayData]] or [[GenericArrayData]] based on given parameters. + * + * @param elementSize a size of an element in bytes + * @param numElements the number of elements the array should contain + * @param isPrimitiveType whether the type of an element is primitive type + * @param additionalErrorMessage string to include in the error message + */ + def allocateArrayData( + elementSize: Int, + numElements : Long, + isPrimitiveType: Boolean, + additionalErrorMessage: String) : ArrayData = { +val arraySize = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(numElements, elementSize) +if (isPrimitiveType && !UnsafeArrayData.shouldUseGenericArrayData(elementSize, numElements)) { --- End diff -- When `UnsafeArrayData` can be used, `GenericArrayData` is also used. However, if the element size is large, `GenericArrayData` should be used. But, `UnsafeArrayData` cannot be used. Thus, I think that it would be good to use the current name `shouldUseGenericArrayData`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22101: [SPARK-23207][Core][FOLLOWUP] Fix RecordBinaryCom...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22101#discussion_r209863964 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java --- @@ -42,16 +42,16 @@ public int compare( while ((leftOff + i) % 8 != 0 && i < leftLen) { res = (Platform.getByte(leftObj, leftOff + i) & 0xff) - (Platform.getByte(rightObj, rightOff + i) & 0xff); -if (res != 0) return res; +if (res != 0) return (int) res; i += 1; } } // for architectures that support unaligned accesses, chew it up 8 bytes at a time if (Platform.unaligned() || (((leftOff + i) % 8 == 0) && ((rightOff + i) % 8 == 0))) { while (i <= leftLen - 8) { -res = (int) ((Platform.getLong(leftObj, leftOff + i) - -Platform.getLong(rightObj, rightOff + i)) % Integer.MAX_VALUE); -if (res != 0) return res; +res = Platform.getLong(leftObj, leftOff + i) - +Platform.getLong(rightObj, rightOff + i); +if (res != 0) return res > 0 ? 1 : -1; --- End diff -- +1 for no subtraction. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22101: [SPARK-23207][Core][FOLLOWUP] Fix RecordBinaryCom...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22101#discussion_r209855589 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java --- @@ -60,7 +60,7 @@ public int compare( while (i < leftLen) { res = (Platform.getByte(leftObj, leftOff + i) & 0xff) - (Platform.getByte(rightObj, rightOff + i) & 0xff); - if (res != 0) return res; + if (res != 0) return (int) res; --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22101: [SPARK-23207][Core][FOLLOWUP] Fix RecordBinaryCom...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22101#discussion_r209855528 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java --- @@ -42,16 +42,16 @@ public int compare( while ((leftOff + i) % 8 != 0 && i < leftLen) { res = (Platform.getByte(leftObj, leftOff + i) & 0xff) - (Platform.getByte(rightObj, rightOff + i) & 0xff); -if (res != 0) return res; +if (res != 0) return (int) res; --- End diff -- How about the following change to minimize and localize the change? ``` int res = (Platform.getByte(leftObj, leftOff + i) & 0xff) - (Platform.getByte(rightObj, rightOff + i) & 0xff); if (res != 0) return res; ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22101: [SPARK-23207][Core][FOLLOWUP] Fix RecordBinaryCom...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22101#discussion_r209855135 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java --- @@ -42,16 +42,16 @@ public int compare( while ((leftOff + i) % 8 != 0 && i < leftLen) { res = (Platform.getByte(leftObj, leftOff + i) & 0xff) - (Platform.getByte(rightObj, rightOff + i) & 0xff); -if (res != 0) return res; +if (res != 0) return (int) res; i += 1; } } // for architectures that support unaligned accesses, chew it up 8 bytes at a time if (Platform.unaligned() || (((leftOff + i) % 8 == 0) && ((rightOff + i) % 8 == 0))) { while (i <= leftLen - 8) { -res = (int) ((Platform.getLong(leftObj, leftOff + i) - -Platform.getLong(rightObj, rightOff + i)) % Integer.MAX_VALUE); -if (res != 0) return res; +res = Platform.getLong(leftObj, leftOff + i) - +Platform.getLong(rightObj, rightOff + i); +if (res != 0) return res > 0 ? 1 : -1; --- End diff -- How about the following change to minimize and localize the change? ``` long res = Platform.getLong(leftObj, leftOff + i) - Platform.getLong(rightObj, rightOff + i); if (res != 0) return res > 0 ? 1 : -1; ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22001 Just curious. It is very interesting to me since the recent three tries consistently cause a timeout failure at the same test. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94687 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94705 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94716 In addition, other PRs look successful without timeout. ``` [info] - abort the job if total size of results is too large (1 second, 122 milliseconds) Exception in thread "task-result-getter-3" java.lang.Error: java.lang.InterruptedException at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:206) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:222) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:115) at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:701) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932) at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ... 2 more ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21912 cc @ueshin @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22048: Fix the show method to display the wide character alignm...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22048 @xuejianbest Could you please create a JIRA entry and add test cases to this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20637: [SPARK-23466][SQL] Remove redundant null checks i...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20637#discussion_r209692496 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala --- @@ -43,25 +43,29 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro case _ => false } - // TODO: if the nullability of field is correct, we can use it to save null check. private def writeStructToBuffer( ctx: CodegenContext, input: String, index: String, - fieldTypes: Seq[DataType], + fieldTypeAndNullables: Seq[(DataType, Boolean)], --- End diff -- @cloud-fan I found this case class `case class Schema(dataType: DataType, nullable: Boolean)` at two places. 1. `ScalaReflection.Schema` 1. `SchemaConverters.SchemaType` Do we use one of them? Or do we define `org.apache.spark.sql.types.Schema`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createU...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21912#discussion_r209643599 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -735,70 +735,100 @@ class CodegenContext { } /** - * Generates code creating a [[UnsafeArrayData]]. + * Generates code creating a [[UnsafeArrayData]] or [[GenericArrayData]] based on + * given parameters. * * @param arrayName name of the array to create + * @param elementType data type of the elements in source array * @param numElements code representing the number of elements the array should contain - * @param elementType data type of the elements in the array * @param additionalErrorMessage string to include in the error message + * @param elementSize optional value which shows the size of an element of the allocated + *[[UnsafeArrayData]] or [[GenericArrayData]] + * + * @return code representing the allocation of [[ArrayData]] + * code representing a setter of an assignment for the generated array */ - def createUnsafeArray( + def createArrayData( arrayName: String, - numElements: String, elementType: DataType, - additionalErrorMessage: String): String = { -val arraySize = freshName("size") -val arrayBytes = freshName("arrayBytes") + numElements: String, + additionalErrorMessage: String, + elementSize: Option[Int] = None): (String, String) = { --- End diff -- Yeah, they are exclusive. However, when `elementType` is not used, the integer value is required. `MapEntries` stores `UnsafeRow` instead of primitive data. [This](https://github.com/apache/spark/blob/c15630014c7ef850fa21c1247fa5edd2b1bac81b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L405) requires non-fixed size unlike `elementType.defaultSize`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22053: [SPARK-25069][CORE]Using UnsafeAlignedOffset to make the...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22053 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22083: [SQL][Test][Minor] Add missing codes to ParquetCo...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/22083 [SQL][Test][Minor] Add missing codes to ParquetCompressionCodecPrecedenceSuite ## What changes were proposed in this pull request? This PR adds codes to ``"Test `spark.sql.parquet.compression.codec` config"` in `ParquetCompressionCodecPrecedenceSuite`. ## How was this patch tested? Existing UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark ParquetCompressionCodecPrecedenceSuite Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22083.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22083 commit 974224b495fc8155b9a6a73673ea478ccdb164d6 Author: Kazuaki Ishizaki Date: 2018-08-12T07:08:42Z update --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22082 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22007: [SPARK-25033] Bump Apache commons.{httpclient, httpcore}
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22007 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22007: [SPARK-25033] Bump Apache commons.{httpclient, httpcore}
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22007 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20637: [SPARK-23466][SQL] Remove redundant null checks i...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20637#discussion_r209416053 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala --- @@ -43,25 +43,29 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro case _ => false } - // TODO: if the nullability of field is correct, we can use it to save null check. private def writeStructToBuffer( ctx: CodegenContext, input: String, index: String, - fieldTypes: Seq[DataType], + fieldTypeAndNullables: Seq[(DataType, Boolean)], --- End diff -- @cloud-fan What name of the case class do you suggest? `DataTypeNullable`, or others? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22068: [MINOR][DOC]Add missing compression codec .
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22068 Thanks. BTW, I found another instance in test, not in doc. Do we address this in this PR? Or, do we address in another PR? @HyukjinKwon WDYT ? ``` class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with SharedSQLContext { test("Test `spark.sql.parquet.compression.codec` config") { Seq("NONE", "UNCOMPRESSED", "SNAPPY", "GZIP", "LZO").foreach { c => withSQLConf(SQLConf.PARQUET_COMPRESSION.key -> c) { val expected = if (c == "NONE") "UNCOMPRESSED" else c val option = new ParquetOptions(Map.empty[String, String], spark.sessionState.conf) assert(option.compressionCodecClassName == expected) } } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22016: Fix typos
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22016 #22070 addresses more typo. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22070: Fix typos detected by github.com/client9/misspell
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22070 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22067: [SPARK-25084][SQL] distribute by on multiple columns may...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22067 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22069: [MINOR][DOC] Fix Java example code in Column's comments
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22069 Do we need to update the following similar examples, too? Column.scala ``` * {{{ * // Example: encoding gender string column into integer. * * // Scala: * people.select(when(people("gender") === "male", 0) * .when(people("gender") === "female", 1) * .otherwise(2)) * * // Java: * people.select(when(col("gender").equalTo("male"), 0) * .when(col("gender").equalTo("female"), 1) * .otherwise(2)) * }}} ``` functions.scala ``` * {{{ * // Example: encoding gender string column into integer. * * // Scala: * people.select(when(people("gender") === "male", 0) * .when(people("gender") === "female", 1) * .otherwise(2)) * * // Java: * people.select(when(col("gender").equalTo("male"), 0) * .when(col("gender").equalTo("female"), 1) * .otherwise(2)) * }}} ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22068: [MINOR][DOC]Add missing compression codec .
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22068 Would it be better to update a comment in `DataFrameWriter.scala`, too? ``` * `compression` (default is the value specified in `spark.sql.parquet.compression.codec`): * compression codec to use when saving to file. This can be one of the known case-insensitive * shorten names(`none`, `snappy`, `gzip`, and `lzo`). This will override * `spark.sql.parquet.compression.codec`. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22053: [SPARK-25069][CORE]Using UnsafeAlignedOffset to make the...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22053 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20637: [SPARK-23466][SQL] Remove redundant null checks i...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20637#discussion_r209180525 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala --- @@ -43,25 +43,29 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro case _ => false } - // TODO: if the nullability of field is correct, we can use it to save null check. private def writeStructToBuffer( ctx: CodegenContext, input: String, index: String, - fieldTypes: Seq[DataType], + fieldTypeAndNullables: Seq[(DataType, Boolean)], --- End diff -- I think that it would be good since it is used at `JavaTypeInference` and `higherOrderFunctions`. cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20637: [SPARK-23466][SQL] Remove redundant null checks i...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20637#discussion_r209178573 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala --- @@ -170,6 +174,23 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro val element = CodeGenerator.getValue(tmpInput, et, index) +val primitiveTypeName = if (CodeGenerator.isPrimitiveType(jt)) { --- End diff -- good catch --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22053: [SPARK-25069][CORE]Using UnsafeAlignedOffset to make the...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22053 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22053: [SPARK-25069][CORE]Using UnsafeAlignedOffset to make the...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22053 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21994: [SPARK-24529][Build][test-maven][follow-up] Add s...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21994#discussion_r209147728 --- Diff: pom.xml --- @@ -2609,6 +2609,28 @@ + +com.github.spotbugs +spotbugs-maven-plugin --- End diff -- let me check the elapsed time on my environment. +1 for holding on for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22053: [SPARK-25069][CORE]Using UnsafeAlignedOffset to make the...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22053 I think that this is not a data correctness issue. This may cause unexpected program abort due to hardware memory access error. BTW, it would be good to backport it to increase stability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20637 The failure of `org.apache.spark.sql.catalyst.expressions.JsonExpressionsSuite.from_json missing fields` is due to passing `null` while the schema has `nullable=false`. This inconsistency is agreed in the discussion at [SPARK-23173](https://issues.apache.org/jira/browse/SPARK-23173). `Assume that each field in schema passed to from_json is nullable, and ignore the nullability information set in the passed schema.` When `spark.sql.fromJsonForceNullableSchema=false`, I think that a test is invalid to pass `nullable=false` in the corresponding schema to the missing field. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22053: [SPARK-25069][CORE]Using UnsafeAlignedOffset to make the...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22053 cc @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22059: [SPARK-25036][SQL] Avoid discarding unmoored doc comment...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22059 cc @srowen @ueshin @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22053: [SPARK-25069][CORE]Using UnsafeAlignedOffset to make the...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22053 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22059: [SPARK-25036][SQL] Avoid discarding unmoored doc comment...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22059 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22053: [SPARK-25069][CORE]Using UnsafeAlignedOffset to make the...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22053 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22058: [SPARK-25036][SQL][FOLLOW-UP] Avoid match may not be exh...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22058 I think that this is the last one with the following command. But, I would like to confirm this with @ueshin. `build/sbt -Pscala-2.12 -Phadoop-2.6 -Pkubernetes -Phive -Pmesos -Phive-thriftserver -Pflume -Pkinesis-asl -Pyarn package` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22053: [SPARK-25069][CORE]Using UnsafeAlignedOffset to make the...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22053 Good catch LGTM with a comment: Would it be better to update comments regarding `4 bytes` with `4 or 8 bytes` in `UnsafeExternalSorter.java`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22055: [MINOR][BUILD] Update Jetty to 9.3.24.v20180605
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22055 LGTM These changes are not huge. It looks non-risky. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22059: [SPARK-25036][SQL] Avoid discarding unmoored doc ...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/22059 [SPARK-25036][SQL] Avoid discarding unmoored doc comment in Scala-2.12. ## What changes were proposed in this pull request? This PR avoid the following compilation error using sbt in Scala-2.12. ``` [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:410: discarding unmoored doc comment [error] [warn] /** [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:441: discarding unmoored doc comment [error] [warn] /** [error] [warn] ... [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:440: discarding unmoored doc comment [error] [warn] /** [error] [warn] ``` ## How was this patch tested? Existing UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-25036d Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22059.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22059 commit 70ec11e5b85100b0b10257f7c0846b63af1ba494 Author: Kazuaki Ishizaki Date: 2018-08-09T17:11:30Z initial commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22058: [SPARK-25036][SQL][FOLLOW-UP] Avoid match may not be exh...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22058 cc @ueshin @srowen @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22058: [SPARK-25036][SQL][FOLLOW-UP] Avoid match may not...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/22058 [SPARK-25036][SQL][FOLLOW-UP] Avoid match may not be exhaustive in Scala-2.12. ## What changes were proposed in this pull request? This is a follow-up pr of #22014 and #22039 We still have some more compilation errors in mllib with scala-2.12 with sbt: ``` [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala:116: match may not be exhaustive. [error] It would fail on the following inputs: ("silhouette", _), (_, "cosine"), (_, "squaredEuclidean"), (_, String()), (_, _) [error] [warn] ($(metricName), $(distanceMeasure)) match { [error] [warn] ``` ## How was this patch tested? Existing UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-25036c Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22058.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22058 commit 6a0ee38c53d8a53d219bfec8cad9953bc9571e0c Author: Kazuaki Ishizaki Date: 2018-08-09T16:29:40Z initial commit commit 5655d83bd2c9d6ca872c371bff421f69409b6d0b Author: Kazuaki Ishizaki Date: 2018-08-09T16:45:23Z make the change one-liner --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20184: [SPARK-22987][Core] UnsafeExternalSorter cases OO...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20184#discussion_r208993136 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -116,13 +138,18 @@ public void loadNext() throws IOException { if (taskContext != null) { taskContext.killTaskIfInterrupted(); } -recordLength = din.readInt(); -keyPrefix = din.readLong(); -if (recordLength > arr.length) { - arr = new byte[recordLength]; +// check if the reader is closed to prevent reopen the in and din. +if (!hasNext()) { + throw new IndexOutOfBoundsException("Can not load next item when UnsafeSorterSpillReader is closed."); +} +recordLength = getDin().readInt(); +keyPrefix = getDin().readLong(); +int arrLength = Math.max(1024 * 1024, recordLength); +if (arrLength > arr.length) { + arr = new byte[arrLength]; baseObject = arr; } -ByteStreams.readFully(in, arr, 0, recordLength); +ByteStreams.readFully(getIn(), arr, 0, recordLength); --- End diff -- Is it fine if `recordLength` is greater than `1024 * 1024`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22001: [SPARK-24819][CORE] Fail fast when no enough slot...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22001#discussion_r208950122 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -929,6 +955,28 @@ class DAGScheduler( // HadoopRDD whose underlying HDFS files have been deleted. finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite) } catch { + case e: Exception if e.getMessage == + DAGScheduler.ERROR_MESSAGE_BARRIER_REQUIRE_MORE_SLOTS_THAN_CURRENT_TOTAL_NUMBER => +logWarning("The job requires to run a barrier stage that requires more slots than the " + + "total number of slots in the cluster currently.") +jobIdToNumTasksCheckFailures.putIfAbsent(jobId, 0) +val numCheckFailures = jobIdToNumTasksCheckFailures.get(jobId) + 1 --- End diff -- Is it OK while this increment is not atomic? In the following scenario, the value may not be correct 1. We assume `jobIdToNumTasksCheckFailures(jobId) = 1` 1. Thread A executes L963, then `numCheckFailures = 2` 1. Thread B executes L963, then `numCheckFailures = 2` 1. Thread B executes L964 and L965, then `jobIdToNumTasksCheckFailures(jobId)` has 2. 1. Thread A executes L964 and L965, then `jobIdToNumTasksCheckFailures(jobId)` has 2. Since two threads detected failure, we expect `3`. But, it is `2`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22001: [SPARK-24819][CORE] Fail fast when no enough slot...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22001#discussion_r208947201 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -203,6 +203,17 @@ class DAGScheduler( sc.getConf.getInt("spark.stage.maxConsecutiveAttempts", DAGScheduler.DEFAULT_MAX_CONSECUTIVE_STAGE_ATTEMPTS) + /** + * Number of max concurrent tasks check failures for each job. + */ + private[scheduler] val jobIdToNumTasksCheckFailures = new ConcurrentHashMap[Int, Int] + + /** + * Time in seconds to wait between a max concurrent tasks check failure and the next check. --- End diff -- nit: `a max` -> `max`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22001: [SPARK-24819][CORE] Fail fast when no enough slot...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22001#discussion_r208946523 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -577,4 +577,17 @@ package object config { .timeConf(TimeUnit.SECONDS) .checkValue(v => v > 0, "The value should be a positive time value.") .createWithDefaultString("365d") + + private[spark] val BARRIER_MAX_CONCURRENT_TASKS_CHECK_INTERVAL = + ConfigBuilder("spark.scheduler.barrier.maxConcurrentTasksCheck.interval") + .doc("Time in seconds to wait between a max concurrent tasks check failure and the next " + --- End diff -- nit: `a max` -> `max`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22001: [SPARK-24819][CORE] Fail fast when no enough slot...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22001#discussion_r208945843 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1602,6 +1602,15 @@ class SparkContext(config: SparkConf) extends Logging { } } + /** + * Get the max number of tasks that can be concurrent launched currently. --- End diff -- How about like this? ``` * Get the max number of tasks that can be concurrently launched when the method is called. * Note that please don't cache the value returned by this method, because the number can be * changed due to adding/removing executors. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21859 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22055: [MINOR][BUILD] Update Jetty to 9.3.24.v20180605
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22055 Release notes [9.3.21](https://github.com/eclipse/jetty.project/releases/tag/jetty-9.3.21.v20170918) [9.3.22](https://github.com/eclipse/jetty.project/releases/tag/jetty-9.3.22.v20171030) [9.3.23](https://github.com/eclipse/jetty.project/releases/tag/jetty-9.3.23.v20180228) [9.3.24](https://github.com/eclipse/jetty.project/releases/tag/jetty-9.3.24.v20180605) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21505 gentle ping @ssonker --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22044: [SPARK-23912][SQL][Followup] Refactor ArrayDistin...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22044#discussion_r208862917 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -3410,6 +3410,28 @@ case class ArrayDistinct(child: Expression) case _ => false } + @transient protected lazy val canUseSpecializedHashSet = elementType match { --- End diff -- We can do so. To minimize the changes due to remaining time for cutting, I would like to do this in another PR #21912. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20636: [SPARK-23415][SQL][TEST] Make behavior of BufferHolderSp...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20636 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22044: [SPARK-23912][SQL][Followup] Refactor ArrayDistinct
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22044 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21993: [SPARK-24983][Catalyst] Add configuration for max...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21993#discussion_r208822706 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -631,19 +631,26 @@ object ColumnPruning extends Rule[LogicalPlan] { object CollapseProject extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { -case p1 @ Project(_, p2: Project) => - if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) { -p1 - } else { -p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList)) - } +case p1@Project(_, p2: Project) => --- End diff -- nit: Do we need to change this line? We can keep this line as is. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22048: Fix the show method to display the wide character alignm...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22048 I created [a PR](https://github.com/apache/spark/pull/16086) to solve the same problem. Can this PR handle [East Asian Width](http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt) correctly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22048: Fix the show method to display the wide character alignm...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22048 Since this change does not look minor, could you please create a JIRA entry? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20636: [SPARK-23415][SQL][TEST] Make behavior of BufferHolderSp...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20636 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22044: [SPARK-23912][SQL][Followup] Refactor ArrayDistinct
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22044 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21859 Good point. [These failures](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94431/testReport/) may show that it affects other places. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20636: [SPARK-23415][SQL][TEST] Make behavior of BufferHolderSp...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20636 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22040: [Minor][Doc] Fix typo
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22040 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22044: [SPARK-23912][SQL][Followup] Refactor ArrayDistinct
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22044 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22007: [SPARK-25033] Bump Apache commons.{httpclient, httpcore}
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22007 This change looks non-risky to me. cc @swoen @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22044: [SPARK-23912][SQL][Followup] Refactor ArrayDistin...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/22044 [SPARK-23912][SQL][Followup] Refactor ArrayDistinct ## What changes were proposed in this pull request? This PR simplified code generation for `ArrayDistinct`. #21966 enabled code generation only if the type can be specialized by the hash set. This PR follows this strategy. Optimization of null handling will be implemented in #21912. ## How was this patch tested? Existing UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-23912-follow Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22044.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22044 commit 9d0cb100a85dc77569e1a62b83e0461ee2c33ddb Author: Kazuaki Ishizaki Date: 2018-08-08T18:43:51Z stop code generation when elementTypeSupportEquals is false --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20637 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20637: [SPARK-23466][SQL] Remove redundant null checks i...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20637#discussion_r208677733 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala --- @@ -142,7 +143,7 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro case _ => s"$rowWriter.write($index, ${input.value});" } -if (input.isNull == "false") { +if (input.isNull == "false" || !nullable) { --- End diff -- good catch, thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20637: [SPARK-23466][SQL] Remove redundant null checks i...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20637#discussion_r208677695 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala --- @@ -308,10 +319,10 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro expressions: Seq[Expression], useSubexprElimination: Boolean = false): ExprCode = { val exprEvals = ctx.generateExpressions(expressions, useSubexprElimination) -val exprTypes = expressions.map(_.dataType) +val exprTypeAndNullables = expressions.map(e => (e.dataType, e.nullable)) -val numVarLenFields = exprTypes.count { - case dt if UnsafeRow.isFixedLength(dt) => false +val numVarLenFields = exprTypeAndNullables.count { + case (dt, _) if UnsafeRow.isFixedLength(dt) => false --- End diff -- sure --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22039: [SPARK-25036][SQL][FOLLOW-UP] Avoid match may not...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22039#discussion_r208666588 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala --- @@ -273,6 +273,9 @@ class SymmetricHashJoinStateManager( s.copy(desc = newDesc(desc)) -> value case (s @ StateStoreCustomTimingMetric(_, desc), value) => s.copy(desc = newDesc(desc)) -> value +case (s, _) => + throw new IllegalArgumentException( + s"Unknown state store custom metric is found at metrics: $s") --- End diff -- nit: 2 more spaces? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22039: [SPARK-25036][SQL][FOLLOW-UP] Avoid match may not...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22039#discussion_r208666245 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -394,6 +394,9 @@ class MicroBatchExecution( case (src: Source, off) => src.commit(off) case (reader: MicroBatchReader, off) => reader.commit(reader.deserializeOffset(off.json)) + case (src, _) => +throw new IllegalArgumentException( + s"Unknows source is found at constructNextBatch: $src") --- End diff -- nit: `Unknows` -> `Unknown` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22041: [SPARK-25058][SQL] Use Block.isEmpty/nonEmpty to check w...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22041 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22040: [Minor][Doc] Fix typo
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22040 cc @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22040: [Minor][Doc] Fix typo
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/22040 [Minor][Doc] Fix typo ## What changes were proposed in this pull request? This PR fixes typo regarding `auxiliary verb + verb[s]`. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark spellcheck1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22040.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22040 commit f9f5feb6a82db36dd0bf6b2f2a18ded0a7d22d39 Author: Kazuaki Ishizaki Date: 2018-08-08T17:05:47Z fix typos commit aa4e530986a8d64cb76c0715ca8576f19c8ab8f9 Author: Kazuaki Ishizaki Date: 2018-08-08T17:09:35Z fix typo --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21956: [MINOR][DOCS] Fix grammatical error in SortShuffleManage...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21956 @HyukjinKwon sure, I will open a PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22007: [SPARK-25033] Bump Apache commons.{httpclient, httpcore}
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22007 @fokko Do we need to update files under `dev/deps`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21912 cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22035: [SPARK-23911][SQL][FOLLOW-UP] Fix examples of aggregate ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22035 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21860 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggrega...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21931#discussion_r208561981 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala --- @@ -366,6 +366,43 @@ class AggregateBenchmark extends BenchmarkBase { */ } + ignore("capacity for fast hash aggregate") { +val N = 20 << 20 +val M = 1 << 19 + +val benchmark = new Benchmark("Aggregate w multiple keys", N) +sparkSession.range(N) + .selectExpr( +"id", +s"(id % $M) as k1", +s"cast(id % $M as int) as k2", +s"cast(id % $M as double) as k3", +s"cast(id % $M as float) as k4") .createOrReplaceTempView("test") + +def f(): Unit = sparkSession.sql("select k1, k2, k3, k4, sum(k1), sum(k2), sum(k3), " + + "sum(k4) from test group by k1, k2, k3, k4").collect() + +benchmark.addCase(s"fasthash = default") { iter => + sparkSession.conf.set("spark.sql.codegen.aggregate.map.row.capacitybit", "16") + f() +} + +benchmark.addCase(s"fasthash = config") { iter => + sparkSession.conf.set("spark.sql.codegen.aggregate.map.row.capacitybit", "20") + f() +} + +benchmark.run() + +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel +Aggregate w multiple keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative --- End diff -- nit: we need to reduce # of characters per line up to 100. IIUC, the number is more than 100. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org