Re: [ compress in-memory column storage used in sparksql cache table ]
Yeah, two of the reasons why the built-in in-memory columnar storage doesn't achieve comparable compression ratio as Parquet are: 1. The in-memory columnar representation doesn't handle nested types. So array/map/struct values are not compressed. 2. Parquet may use more than one kind of compression methods to compress a single column. For example, dictionary + RLE. Cheng On 9/2/15 3:58 PM, Nitin Goyal wrote: I think spark sql's in-memory columnar cache already does compression. Check out classes in following path :- https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/compression Although compression ratio is not as good as Parquet. Thanks -Nitin -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/compress-in-memory-column-storage-used-in-sparksql-cache-table-tp13932p13937.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [ compress in-memory column storage used in sparksql cache table ]
I think spark sql's in-memory columnar cache already does compression. Check out classes in following path :- https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/compression Although compression ratio is not as good as Parquet. Thanks -Nitin -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/compress-in-memory-column-storage-used-in-sparksql-cache-table-tp13932p13937.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[ compress in-memory column storage used in sparksql cache table ]
Hi, I have an idea, can someone give me some advice? I want to compress data in in-memory column storage which is used by cache table in spark. This will make cache table use less memory. I will set an conf to this function, so if anyone want to use this function, he can set this conf to true. Compress algorithom I want to use Dictionary Encoding. Do you think this method worth a try ?