[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23217 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/23217#discussion_r238700354 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -47,13 +48,17 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria private lazy val keyGetter = InternalRow.getAccessor(keyType) private lazy val valueGetter = InternalRow.getAccessor(valueType) - def put(key: Any, value: Any): Unit = { + def put(key: Any, value: Any, withSizeCheck: Boolean = false): Unit = { if (key == null) { throw new RuntimeException("Cannot use null as map key.") } val index = keyToIndex.getOrDefault(key, -1) if (index == -1) { + if (withSizeCheck && size >= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { --- End diff -- ok, let me remove it then, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23217#discussion_r238698103 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -47,13 +48,17 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria private lazy val keyGetter = InternalRow.getAccessor(keyType) private lazy val valueGetter = InternalRow.getAccessor(valueType) - def put(key: Any, value: Any): Unit = { + def put(key: Any, value: Any, withSizeCheck: Boolean = false): Unit = { if (key == null) { throw new RuntimeException("Cannot use null as map key.") } val index = keyToIndex.getOrDefault(key, -1) if (index == -1) { + if (withSizeCheck && size >= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { --- End diff -- hmmm, I'd like to avoid premature optimization. Actually how much perf this can save? This code block is already doing some heavy work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/23217#discussion_r238690465 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -47,13 +48,17 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria private lazy val keyGetter = InternalRow.getAccessor(keyType) private lazy val valueGetter = InternalRow.getAccessor(valueType) - def put(key: Any, value: Any): Unit = { + def put(key: Any, value: Any, withSizeCheck: Boolean = false): Unit = { if (key == null) { throw new RuntimeException("Cannot use null as map key.") } val index = keyToIndex.getOrDefault(key, -1) if (index == -1) { + if (withSizeCheck && size >= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { --- End diff -- this flag is just for perf reasons, we can skip the check in some conditions and I didn't want to introduce perf overhead if not needed. If we remove the flag we would do the comparison for each item, also when it is not needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23217#discussion_r238651534 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -47,13 +48,17 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria private lazy val keyGetter = InternalRow.getAccessor(keyType) private lazy val valueGetter = InternalRow.getAccessor(valueType) - def put(key: Any, value: Any): Unit = { + def put(key: Any, value: Any, withSizeCheck: Boolean = false): Unit = { if (key == null) { throw new RuntimeException("Cannot use null as map key.") } val index = keyToIndex.getOrDefault(key, -1) if (index == -1) { + if (withSizeCheck && size >= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { +throw new RuntimeException(s"Unsuccessful attempt to concat maps with $size elements " + --- End diff -- nit: `concat` -> `build` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23217#discussion_r238651421 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -47,13 +48,17 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria private lazy val keyGetter = InternalRow.getAccessor(keyType) private lazy val valueGetter = InternalRow.getAccessor(valueType) - def put(key: Any, value: Any): Unit = { + def put(key: Any, value: Any, withSizeCheck: Boolean = false): Unit = { if (key == null) { throw new RuntimeException("Cannot use null as map key.") } val index = keyToIndex.getOrDefault(key, -1) if (index == -1) { + if (withSizeCheck && size >= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { --- End diff -- I think we should aways check the size. Such a big map is very likely to cause problems. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23217: [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat i...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/23217 [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat in order to check properly the limit size ## What changes were proposed in this pull request? The PR starts from the [comment](https://github.com/apache/spark/pull/23124#discussion_r236112390) in the main one and it aims at: - simplifying the code for `MapConcat`; - be more precise in checking the limit size. ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-25829_followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23217.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23217 commit 54f0f31aaa14de7c44c336580c7ed18e8ffb4b54 Author: Marco Gaido Date: 2018-12-04T12:35:09Z [SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat in order to check properly the limit size --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org