[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14726


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-23 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/14726#discussion_r75960177
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -22,15 +22,21 @@
 import com.google.common.io.ByteStreams;
 import com.google.common.io.Closeables;
 
+import org.apache.spark.SparkEnv;
 import org.apache.spark.serializer.SerializerManager;
 import org.apache.spark.storage.BlockId;
 import org.apache.spark.unsafe.Platform;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 /**
  * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that 
class for a description
  * of the file format).
  */
 public final class UnsafeSorterSpillReader extends UnsafeSorterIterator 
implements Closeable {
+  private static final Logger logger = 
LoggerFactory.getLogger(UnsafeSorterSpillReader.class);
+  private static final int DEFAULT_BUFFER_SIZE_BYTES = 1024 * 1024; // 1 MB
--- End diff --

@rxin : Ok. I will do that change separately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14726#discussion_r75574332
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -22,15 +22,21 @@
 import com.google.common.io.ByteStreams;
 import com.google.common.io.Closeables;
 
+import org.apache.spark.SparkEnv;
 import org.apache.spark.serializer.SerializerManager;
 import org.apache.spark.storage.BlockId;
 import org.apache.spark.unsafe.Platform;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 /**
  * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that 
class for a description
  * of the file format).
  */
 public final class UnsafeSorterSpillReader extends UnsafeSorterIterator 
implements Closeable {
+  private static final Logger logger = 
LoggerFactory.getLogger(UnsafeSorterSpillReader.class);
+  private static final int DEFAULT_BUFFER_SIZE_BYTES = 1024 * 1024; // 1 MB
--- End diff --

Yes, hierarchical merging sounds good. Actually we might even want to 
memory management the buffers more explicitly (just ask the memory manager for 
spark.unsafe.sorter.spill.reader.buffer.size for every spill file we open.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-19 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/14726#discussion_r75573049
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -22,15 +22,21 @@
 import com.google.common.io.ByteStreams;
 import com.google.common.io.Closeables;
 
+import org.apache.spark.SparkEnv;
 import org.apache.spark.serializer.SerializerManager;
 import org.apache.spark.storage.BlockId;
 import org.apache.spark.unsafe.Platform;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 /**
  * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that 
class for a description
  * of the file format).
  */
 public final class UnsafeSorterSpillReader extends UnsafeSorterIterator 
implements Closeable {
+  private static final Logger logger = 
LoggerFactory.getLogger(UnsafeSorterSpillReader.class);
+  private static final int DEFAULT_BUFFER_SIZE_BYTES = 1024 * 1024; // 1 MB
--- End diff --

@rxin : In response to [0], I have changed to 1 MB. As per my experiments, 
1 MB gave good perf and we are using it as default for all prod jobs.

One concern / proposal: 

With the change, UnsafeSorterSpillReader would consume more memory than 
before as the buffer would increase from 8k to 1 MB. Overall per 
UnsafeSorterSpillReader object footprint would grow from 2.5 MB to 3.6 MB (I 
have profiled to the number. See [1]). In case of job(s) which spill a lot, 
there would be lot of these spill readers created (in the screenshot, there 
were 400+ readers). Current merging approach is to open all the spill files at 
the same time and merge them all at once using a priority queue. Having lots of 
these objects in memory can lead to OOMs as there is no accounting for buffers 
allocated inside UnsafeSorterSpillReader (even without this change, snappy 
already had its own buffers for compressed and uncompressed data). Also, from 
disk point of view, having lots of file open at the same time would lead to 
random seeks and won't play well with OS's cache for disk reads. One might say 
that users should tune the job so that the spills are lesser but it might not 
be o
 bvious for people who do not understand the system internals. Also, for 
pipelines the data changes everyday and one setting may not work everytime. 

Should we add some kinda hierarchical merging wherein spill files are 
iteratively merged in batches ? It could be turned on when there are say more 
than 100 spill files to be merged. AFAIK, Hadoop has this.

[0] : https://github.com/apache/spark/pull/14475#discussion_r75440822
[1] : https://postimg.org/image/cs5zr6lyx/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...

2016-08-19 Thread tejasapatil
GitHub user tejasapatil opened a pull request:

https://github.com/apache/spark/pull/14726

[SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader`

## What changes were proposed in this pull request?

Jira: https://issues.apache.org/jira/browse/SPARK-16862

`BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k 
buffer to read data off disk. This PR makes it configurable to improve on disk 
reads. I have made the default value to be 1 MB as with that value I observed 
improved performance.

## How was this patch tested?

I am relying on the existing unit tests.

## Performance

After deploying this change to prod and setting the config to 1 mb, there 
was a 12% reduction in the CPU time and 19.5% reduction in CPU reservation time.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tejasapatil/spark spill_buffer_2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14726.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14726


commit c4f37b6c8d3f1a8a565b1f215f55a501edece778
Author: Tejas Patil 
Date:   2016-08-20T05:06:03Z

[SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader`




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org