PARQUET-580: Switch int[] initialization in IntList to be lazy

Noticed that for a dataset that we were trying to import that had a lot of 
columns (few thousand) that weren't being used, we ended up allocating a lot of 
unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s 
turned out to be around 2GB or so (and results in some jobs OOMing). This seems 
unnecessary for columns that might not be used. The changes in this PR switch 
over to initialize the int[] only when it being used for the first time.

Also wondering if 64K is the right size to start off with. Wondering if a 
potential improvement is if we could allocate these int[]s in IntList in a way 
that slowly ramps up their size. So rather than create arrays of size 64K at a 
time (which is potentially wasteful if there are only a few hundred bytes), we 
could create say a 4K int[], then when it fills up an 8K[] and so on till we 
reach 64K (at which point the behavior is the same as the current 
implementation). If this sounds like a reasonable idea, I can update this PR to 
do that as well. Wasn't sure if there was some historical context around that..

Author: Piyush Narang <pnar...@twitter.com>

Closes #339 from piyushnarang/master and squashes the following commits:

3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy


Project: http://git-wip-us.apache.org/repos/asf/parquet-mr/repo
Commit: http://git-wip-us.apache.org/repos/asf/parquet-mr/commit/de17deab
Tree: http://git-wip-us.apache.org/repos/asf/parquet-mr/tree/de17deab
Diff: http://git-wip-us.apache.org/repos/asf/parquet-mr/diff/de17deab

Branch: refs/heads/parquet-1.8.x
Commit: de17deab9c52991d8eba493511caa7728bf40cf1
Parents: 1c60a56
Author: Piyush Narang <pnar...@twitter.com>
Authored: Sat Apr 16 17:25:31 2016 -0700
Committer: Ryan Blue <b...@apache.org>
Committed: Mon Jan 9 16:54:53 2017 -0800

----------------------------------------------------------------------
 .../column/values/dictionary/IntList.java       | 21 +++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/de17deab/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/IntList.java
----------------------------------------------------------------------
diff --git 
a/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/IntList.java
 
b/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/IntList.java
index 3201072..8e6228a 100644
--- 
a/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/IntList.java
+++ 
b/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/IntList.java
@@ -58,7 +58,7 @@ public class IntList {
     }
 
     /**
-     * @return wether there is a next value
+     * @return whether there is a next value
      */
     public boolean hasNext() {
       return current < count;
@@ -76,16 +76,12 @@ public class IntList {
   }
 
   private List<int[]> slabs = new ArrayList<int[]>();
+
+  // Lazy initialize currentSlab only when needed to save on memory in cases 
where items might
+  // not be added
   private int[] currentSlab;
   private int currentSlabPos;
 
-  /**
-   * construct an empty list
-   */
-  public IntList() {
-    initSlab();
-  }
-
   private void initSlab() {
     currentSlab = new int[SLAB_SIZE];
     currentSlabPos = 0;
@@ -95,10 +91,13 @@ public class IntList {
    * @param i value to append to the end of the list
    */
   public void add(int i) {
-    if (currentSlabPos == currentSlab.length) {
+    if (currentSlab == null) {
+      initSlab();
+    } else if (currentSlabPos == currentSlab.length) {
       slabs.add(currentSlab);
       initSlab();
     }
+
     currentSlab[currentSlabPos] = i;
     ++ currentSlabPos;
   }
@@ -108,6 +107,10 @@ public class IntList {
    * @return an IntIterator on the content
    */
   public IntIterator iterator() {
+    if (currentSlab == null) {
+      initSlab();
+    }
+
     int[][] itSlabs = slabs.toArray(new int[slabs.size() + 1][]);
     itSlabs[slabs.size()] = currentSlab;
     return new IntIterator(itSlabs, SLAB_SIZE * slabs.size() + currentSlabPos);

Reply via email to