[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-10-09 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r143421894
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java
 ---
@@ -35,6 +35,11 @@
  * if the fields of row are all fixed-length, as the size of result row is 
also fixed.
  */
 public class BufferHolder {
+
+  // Some JVMs can't allocate arrays of length Integer.MAX_VALUE; actual 
max is somewhat
+  // smaller. Be conservative and lower the cap a little.
+  private static final int ARRAY_MAX = Integer.MAX_VALUE - 8;
--- End diff --

@gatorsmile have a look at the JIRA for some detail; you can see a similar 
limit in the JDK at for example 
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l229
 

You are right, I think around line 54 needs to be something straightforward 
like:
```
long totalSize = initialSize + bitsetWidthInBytes + 8L * row.numFields();
if (totalSize > ARRAY_MAX) { ...error... }
this.buffer = new byte[(int) totalSize];
```

Yes I agree with your new JIRA @liufengdb though think we'll need to go the 
other way to `Integer.MAX_VALUE - 15` where the value must be divisible by 8.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-10-07 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r143346155
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java
 ---
@@ -35,6 +35,11 @@
  * if the fields of row are all fixed-length, as the size of result row is 
also fixed.
  */
 public class BufferHolder {
+
+  // Some JVMs can't allocate arrays of length Integer.MAX_VALUE; actual 
max is somewhat
+  // smaller. Be conservative and lower the cap a little.
+  private static final int ARRAY_MAX = Integer.MAX_VALUE - 8;
--- End diff --

@srowen I think we can use `Integer.MAX_VALUE - 7` instead of 
`Integer.MAX_VALUE - 8` to make the size align with words, otherwise, this 
check will fail: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L170.
  

This is the reason why all the size inputs to the methods are rounded, for 
example, 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java#L216.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-10-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r143343968
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java
 ---
@@ -35,6 +35,11 @@
  * if the fields of row are all fixed-length, as the size of result row is 
also fixed.
  */
 public class BufferHolder {
+
+  // Some JVMs can't allocate arrays of length Integer.MAX_VALUE; actual 
max is somewhat
+  // smaller. Be conservative and lower the cap a little.
+  private static final int ARRAY_MAX = Integer.MAX_VALUE - 8;
--- End diff --

Why not also fixing line 54?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-10-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r143343926
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java
 ---
@@ -35,6 +35,11 @@
  * if the fields of row are all fixed-length, as the size of result row is 
also fixed.
  */
 public class BufferHolder {
+
+  // Some JVMs can't allocate arrays of length Integer.MAX_VALUE; actual 
max is somewhat
+  // smaller. Be conservative and lower the cap a little.
+  private static final int ARRAY_MAX = Integer.MAX_VALUE - 8;
--- End diff --

Just curious how to get this value `-8`? 

Also cc @liufengdb 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19266


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r139656129
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java ---
@@ -39,7 +39,7 @@
   private final long length;
 
   public LongArray(MemoryBlock memory) {
-assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 4 
billion elements";
+assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 2.1 
billion elements";
--- End diff --

I wasn't sure whether the JVM array size limit applied here, because this 
represents a non-native array. Still wanted to fix the comment as I saw it. If 
an array exists, its length is valid, so didn't think that part of MemoryBlock 
represented an issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-19 Thread buryat
Github user buryat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r139616346
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -344,7 +344,7 @@ class Word2Vec extends Serializable with Logging {
 val newSentences = sentences.repartition(numPartitions).cache()
 val initRandom = new XORShiftRandom(seed)
 
-if (vocabSize.toLong * vectorSize >= Int.MaxValue) {
+if (vocabSize.toLong * vectorSize >= Int.MaxValue - 8) {
--- End diff --

I think this should be just `>`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-19 Thread buryat
Github user buryat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r139616411
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 ---
@@ -304,8 +304,8 @@ class BlockMatrix @Since("1.3.0") (
   s"Int.MaxValue. Currently numRows: ${numRows()}")
 require(numCols() < Int.MaxValue, "The number of columns of this 
matrix should be less than " +
   s"Int.MaxValue. Currently numCols: ${numCols()}")
-require(numRows() * numCols() < Int.MaxValue, "The length of the 
values array must be " +
-  s"less than Int.MaxValue. Currently numRows * numCols: ${numRows() * 
numCols()}")
+require(numRows() * numCols() < Int.MaxValue - 8, "The length of the 
values array must be " +
--- End diff --

`<=`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-19 Thread buryat
Github user buryat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r139616165
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/CompactBuffer.scala ---
@@ -126,22 +126,20 @@ private[spark] class CompactBuffer[T: ClassTag] 
extends Seq[T] with Serializable
 
   /** Increase our size to newSize and grow the backing array if needed. */
   private def growToSize(newSize: Int): Unit = {
-if (newSize < 0) {
-  throw new UnsupportedOperationException("Can't grow buffer past 
Int.MaxValue elements")
+val arrayMax = Int.MaxValue - 8
+if (newSize < 0 || newSize - 2 > arrayMax) {
+  throw new UnsupportedOperationException(s"Can't grow buffer past 
$arrayMax elements")
 }
 val capacity = if (otherElements != null) otherElements.length + 2 
else 2
 if (newSize > capacity) {
-  var newArrayLen = 8
+  var newArrayLen = 8L
   while (newSize - 2 > newArrayLen) {
--- End diff --

ah, I see that it's reserved, wasn't clear to me why `- 2`, seems like a 
magic number to me, I see it other places too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-19 Thread buryat
Github user buryat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r139615624
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java ---
@@ -39,7 +39,7 @@
   private final long length;
 
   public LongArray(MemoryBlock memory) {
-assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 4 
billion elements";
+assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 2.1 
billion elements";
--- End diff --

maybe also add the exact number in the error message instead of `2.1 
billion`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-19 Thread buryat
Github user buryat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r139615418
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/PartitionedPairBuffer.scala
 ---
@@ -96,5 +96,5 @@ private[spark] class PartitionedPairBuffer[K, 
V](initialCapacity: Int = 64)
 }
 
 private object PartitionedPairBuffer {
-  val MAXIMUM_CAPACITY = Int.MaxValue / 2 // 2 ^ 30 - 1
+  val MAXIMUM_CAPACITY = (Int.MaxValue - 8) / 2
--- End diff --

maybe worth adding the `Int` type even though it's already an Int.

Also the comment at the line 28 should be changed to `1073741819` i.e. `- 
8/2`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-18 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r139582421
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java ---
@@ -39,7 +39,7 @@
   private final long length;
 
   public LongArray(MemoryBlock memory) {
-assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 4 
billion elements";
+assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 2.1 
billion elements";
--- End diff --

We need the same assert below?
 
https://github.com/apache/spark/blob/c66d64b3df9d9ffba0b16a62015680f6f876fc68/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java#L53


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-18 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r139580567
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java ---
@@ -39,7 +39,7 @@
   private final long length;
 
   public LongArray(MemoryBlock memory) {
-assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 4 
billion elements";
+assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 2.1 
billion elements";
--- End diff --

ISTM `memory.size() < (long) Integer.MAX_VALUE * 8` is correct? see: 
https://github.com/apache/spark/blob/c66d64b3df9d9ffba0b16a62015680f6f876fc68/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java#L54


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-18 Thread buryat
Github user buryat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r139492024
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/CompactBuffer.scala ---
@@ -126,22 +126,20 @@ private[spark] class CompactBuffer[T: ClassTag] 
extends Seq[T] with Serializable
 
   /** Increase our size to newSize and grow the backing array if needed. */
   private def growToSize(newSize: Int): Unit = {
-if (newSize < 0) {
-  throw new UnsupportedOperationException("Can't grow buffer past 
Int.MaxValue elements")
+val arrayMax = Int.MaxValue - 8
+if (newSize < 0 || newSize - 2 > arrayMax) {
+  throw new UnsupportedOperationException(s"Can't grow buffer past 
$arrayMax elements")
 }
 val capacity = if (otherElements != null) otherElements.length + 2 
else 2
 if (newSize > capacity) {
-  var newArrayLen = 8
+  var newArrayLen = 8L
   while (newSize - 2 > newArrayLen) {
--- End diff --

I think we can remove `- 2` now since `newArrayLen` is a Long


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-18 Thread buryat
Github user buryat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r139489658
  
--- Diff: 
core/src/main/java/org/apache/spark/unsafe/map/HashMapGrowthStrategy.java ---
@@ -30,11 +30,15 @@
   HashMapGrowthStrategy DOUBLING = new Doubling();
 
   class Doubling implements HashMapGrowthStrategy {
+
+private static final int ARRAY_MAX = Integer.MAX_VALUE - 8;
--- End diff --

Maybe worth adding a comment why this values is chosen as the max
Like here

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l223


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-18 Thread buryat
Github user buryat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r139489491
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java ---
@@ -39,7 +39,7 @@
   private final long length;
 
   public LongArray(MemoryBlock memory) {
-assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 4 
billion elements";
+assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 2.1 
billion elements";
--- End diff --

`assert memory.size() < (long) (Integer.MAX_VALUE - 8) * 8`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-09-18 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/19266

[SPARK-22033][CORE] BufferHolder, other size checks should account for the 
specific VM array size limitations

## What changes were proposed in this pull request?

Try to avoid allocating an array bigger than Integer.MAX_VALUE - 8, which 
is the actual max size on some JVMs, in several places

## How was this patch tested?

Existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-22033

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19266.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19266


commit 8dbfa30848b49a9fe4038c327222ddc9fd9a0ec0
Author: Sean Owen 
Date:   2017-09-18T14:59:40Z

Try to avoid allocating an array bigger than Integer.MAX_VALUE - 8, which 
is the actual max size on some JVMs, in several places




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org