[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r143421894 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java --- @@ -35,6 +35,11 @@ * if the fields of row are all fixed-length, as the size of result row is also fixed. */ public class BufferHolder { + + // Some JVMs can't allocate arrays of length Integer.MAX_VALUE; actual max is somewhat + // smaller. Be conservative and lower the cap a little. + private static final int ARRAY_MAX = Integer.MAX_VALUE - 8; --- End diff -- @gatorsmile have a look at the JIRA for some detail; you can see a similar limit in the JDK at for example http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l229 You are right, I think around line 54 needs to be something straightforward like: ``` long totalSize = initialSize + bitsetWidthInBytes + 8L * row.numFields(); if (totalSize > ARRAY_MAX) { ...error... } this.buffer = new byte[(int) totalSize]; ``` Yes I agree with your new JIRA @liufengdb though think we'll need to go the other way to `Integer.MAX_VALUE - 15` where the value must be divisible by 8. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r143346155 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java --- @@ -35,6 +35,11 @@ * if the fields of row are all fixed-length, as the size of result row is also fixed. */ public class BufferHolder { + + // Some JVMs can't allocate arrays of length Integer.MAX_VALUE; actual max is somewhat + // smaller. Be conservative and lower the cap a little. + private static final int ARRAY_MAX = Integer.MAX_VALUE - 8; --- End diff -- @srowen I think we can use `Integer.MAX_VALUE - 7` instead of `Integer.MAX_VALUE - 8` to make the size align with words, otherwise, this check will fail: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L170. This is the reason why all the size inputs to the methods are rounded, for example, https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java#L216. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r143343968 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java --- @@ -35,6 +35,11 @@ * if the fields of row are all fixed-length, as the size of result row is also fixed. */ public class BufferHolder { + + // Some JVMs can't allocate arrays of length Integer.MAX_VALUE; actual max is somewhat + // smaller. Be conservative and lower the cap a little. + private static final int ARRAY_MAX = Integer.MAX_VALUE - 8; --- End diff -- Why not also fixing line 54? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r143343926 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java --- @@ -35,6 +35,11 @@ * if the fields of row are all fixed-length, as the size of result row is also fixed. */ public class BufferHolder { + + // Some JVMs can't allocate arrays of length Integer.MAX_VALUE; actual max is somewhat + // smaller. Be conservative and lower the cap a little. + private static final int ARRAY_MAX = Integer.MAX_VALUE - 8; --- End diff -- Just curious how to get this value `-8`? Also cc @liufengdb --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19266 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r139656129 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java --- @@ -39,7 +39,7 @@ private final long length; public LongArray(MemoryBlock memory) { -assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 4 billion elements"; +assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 2.1 billion elements"; --- End diff -- I wasn't sure whether the JVM array size limit applied here, because this represents a non-native array. Still wanted to fix the comment as I saw it. If an array exists, its length is valid, so didn't think that part of MemoryBlock represented an issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user buryat commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r139616346 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -344,7 +344,7 @@ class Word2Vec extends Serializable with Logging { val newSentences = sentences.repartition(numPartitions).cache() val initRandom = new XORShiftRandom(seed) -if (vocabSize.toLong * vectorSize >= Int.MaxValue) { +if (vocabSize.toLong * vectorSize >= Int.MaxValue - 8) { --- End diff -- I think this should be just `>` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user buryat commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r139616411 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -304,8 +304,8 @@ class BlockMatrix @Since("1.3.0") ( s"Int.MaxValue. Currently numRows: ${numRows()}") require(numCols() < Int.MaxValue, "The number of columns of this matrix should be less than " + s"Int.MaxValue. Currently numCols: ${numCols()}") -require(numRows() * numCols() < Int.MaxValue, "The length of the values array must be " + - s"less than Int.MaxValue. Currently numRows * numCols: ${numRows() * numCols()}") +require(numRows() * numCols() < Int.MaxValue - 8, "The length of the values array must be " + --- End diff -- `<=` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user buryat commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r139616165 --- Diff: core/src/main/scala/org/apache/spark/util/collection/CompactBuffer.scala --- @@ -126,22 +126,20 @@ private[spark] class CompactBuffer[T: ClassTag] extends Seq[T] with Serializable /** Increase our size to newSize and grow the backing array if needed. */ private def growToSize(newSize: Int): Unit = { -if (newSize < 0) { - throw new UnsupportedOperationException("Can't grow buffer past Int.MaxValue elements") +val arrayMax = Int.MaxValue - 8 +if (newSize < 0 || newSize - 2 > arrayMax) { + throw new UnsupportedOperationException(s"Can't grow buffer past $arrayMax elements") } val capacity = if (otherElements != null) otherElements.length + 2 else 2 if (newSize > capacity) { - var newArrayLen = 8 + var newArrayLen = 8L while (newSize - 2 > newArrayLen) { --- End diff -- ah, I see that it's reserved, wasn't clear to me why `- 2`, seems like a magic number to me, I see it other places too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user buryat commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r139615624 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java --- @@ -39,7 +39,7 @@ private final long length; public LongArray(MemoryBlock memory) { -assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 4 billion elements"; +assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 2.1 billion elements"; --- End diff -- maybe also add the exact number in the error message instead of `2.1 billion` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user buryat commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r139615418 --- Diff: core/src/main/scala/org/apache/spark/util/collection/PartitionedPairBuffer.scala --- @@ -96,5 +96,5 @@ private[spark] class PartitionedPairBuffer[K, V](initialCapacity: Int = 64) } private object PartitionedPairBuffer { - val MAXIMUM_CAPACITY = Int.MaxValue / 2 // 2 ^ 30 - 1 + val MAXIMUM_CAPACITY = (Int.MaxValue - 8) / 2 --- End diff -- maybe worth adding the `Int` type even though it's already an Int. Also the comment at the line 28 should be changed to `1073741819` i.e. `- 8/2` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r139582421 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java --- @@ -39,7 +39,7 @@ private final long length; public LongArray(MemoryBlock memory) { -assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 4 billion elements"; +assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 2.1 billion elements"; --- End diff -- We need the same assert below? https://github.com/apache/spark/blob/c66d64b3df9d9ffba0b16a62015680f6f876fc68/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java#L53 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r139580567 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java --- @@ -39,7 +39,7 @@ private final long length; public LongArray(MemoryBlock memory) { -assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 4 billion elements"; +assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 2.1 billion elements"; --- End diff -- ISTM `memory.size() < (long) Integer.MAX_VALUE * 8` is correct? see: https://github.com/apache/spark/blob/c66d64b3df9d9ffba0b16a62015680f6f876fc68/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java#L54 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user buryat commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r139492024 --- Diff: core/src/main/scala/org/apache/spark/util/collection/CompactBuffer.scala --- @@ -126,22 +126,20 @@ private[spark] class CompactBuffer[T: ClassTag] extends Seq[T] with Serializable /** Increase our size to newSize and grow the backing array if needed. */ private def growToSize(newSize: Int): Unit = { -if (newSize < 0) { - throw new UnsupportedOperationException("Can't grow buffer past Int.MaxValue elements") +val arrayMax = Int.MaxValue - 8 +if (newSize < 0 || newSize - 2 > arrayMax) { + throw new UnsupportedOperationException(s"Can't grow buffer past $arrayMax elements") } val capacity = if (otherElements != null) otherElements.length + 2 else 2 if (newSize > capacity) { - var newArrayLen = 8 + var newArrayLen = 8L while (newSize - 2 > newArrayLen) { --- End diff -- I think we can remove `- 2` now since `newArrayLen` is a Long --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user buryat commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r139489658 --- Diff: core/src/main/java/org/apache/spark/unsafe/map/HashMapGrowthStrategy.java --- @@ -30,11 +30,15 @@ HashMapGrowthStrategy DOUBLING = new Doubling(); class Doubling implements HashMapGrowthStrategy { + +private static final int ARRAY_MAX = Integer.MAX_VALUE - 8; --- End diff -- Maybe worth adding a comment why this values is chosen as the max Like here http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l223 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user buryat commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r139489491 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java --- @@ -39,7 +39,7 @@ private final long length; public LongArray(MemoryBlock memory) { -assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 4 billion elements"; +assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 2.1 billion elements"; --- End diff -- `assert memory.size() < (long) (Integer.MAX_VALUE - 8) * 8` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/19266 [SPARK-22033][CORE] BufferHolder, other size checks should account for the specific VM array size limitations ## What changes were proposed in this pull request? Try to avoid allocating an array bigger than Integer.MAX_VALUE - 8, which is the actual max size on some JVMs, in several places ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-22033 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19266.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19266 commit 8dbfa30848b49a9fe4038c327222ddc9fd9a0ec0 Author: Sean OwenDate: 2017-09-18T14:59:40Z Try to avoid allocating an array bigger than Integer.MAX_VALUE - 8, which is the actual max size on some JVMs, in several places --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org