[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-17 Thread Aaditya Ramesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872698#comment-15872698
 ] 

Aaditya Ramesh commented on SPARK-19525:


We are suggesting to compress only before we write the checkpoint, not in 
memory. This is not happening right now - we just serialize the elements in the 
partition one by one and add to the serialization stream, according to 
{{ReliableCheckpointRDD.writePartitionToCheckpointFile}}:

{code}
val fileOutputStream = if (blockSize < 0) {
  fs.create(tempOutputPath, false, bufferSize)
} else {
  // This is mainly for testing purpose
  fs.create(tempOutputPath, false, bufferSize,
fs.getDefaultReplication(fs.getWorkingDirectory), blockSize)
}
val serializer = env.serializer.newInstance()
val serializeStream = serializer.serializeStream(fileOutputStream)
Utils.tryWithSafeFinally {
  serializeStream.writeAll(iterator)
} {
  serializeStream.close()
}

{code}

As you can see, we don't do any compression after the serialization step. In 
our patch, we just use the CompressionCodec and wrap the serialization stream 
in compression codec output stream, and correspondingly in the read.

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-17 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872652#comment-15872652
 ] 

Shixiong Zhu commented on SPARK-19525:
--

Hm, Spark should support compression for data in RDD. Which code path did you 
find that not compressing data?

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-17 Thread Aaditya Ramesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872441#comment-15872441
 ] 

Aaditya Ramesh commented on SPARK-19525:


[~zsxwing] Actually, we are compressing the data in the RDDs, not the streaming 
metadata. We compress all records in a partition together and write them to our 
DFS. In our case, the snappy-compressed size of each RDD partition is around 18 
MB, with 84 partitions, for a total of 1.5 GB per RDD.

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-09 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860236#comment-15860236
 ] 

Shixiong Zhu commented on SPARK-19525:
--

[~rameshaaditya117] Sounds a good idea. I thought the metadata checkpoint 
should be very small. How large of your checkpoint files?

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-08 Thread Aaditya Ramesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858810#comment-15858810
 ] 

Aaditya Ramesh commented on SPARK-19525:


We have a patch that works for an older version. I am currently trying to port 
it to Spark 2.1.0. Is this okay?

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org