Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Sadhan Sood
Thanks Chneg, Just one more question - does that mean that we still need
enough memory in the cluster to uncompress the data before it can be
compressed again or does that just read the raw data as is?

On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com wrote:

  Currently there’s no way to cache the compressed sequence file directly.
 Spark SQL uses in-memory columnar format while caching table rows, so we
 must read all the raw data and convert them into columnar format. However,
 you can enable in-memory columnar compression by setting
 spark.sql.inMemoryColumnarStorage.compressed to true. This property is
 already set to true by default in master branch and branch-1.2.

 On 11/13/14 7:16 AM, Sadhan Sood wrote:

   We noticed while caching data from our hive tables which contain data
 in compressed sequence file format that it gets uncompressed in memory when
 getting cached. Is there a way to turn this off and cache the compressed
 data as is ?

   ​



Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
No, the columnar buffer is built in a small batching manner, the batch 
size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| 
property. The default value for this in master and branch-1.2 is 10,000 
rows per batch.


On 11/14/14 1:27 AM, Sadhan Sood wrote:

Thanks Chneg, Just one more question - does that mean that we still 
need enough memory in the cluster to uncompress the data before it can 
be compressed again or does that just read the raw data as is?


On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


Currently there’s no way to cache the compressed sequence file
directly. Spark SQL uses in-memory columnar format while caching
table rows, so we must read all the raw data and convert them into
columnar format. However, you can enable in-memory columnar
compression by setting
|spark.sql.inMemoryColumnarStorage.compressed| to |true|. This
property is already set to true by default in master branch and
branch-1.2.

On 11/13/14 7:16 AM, Sadhan Sood wrote:


We noticed while caching data from our hive tables which contain
data in compressed sequence file format that it gets uncompressed
in memory when getting cached. Is there a way to turn this off
and cache the compressed data as is ?

​



​


Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Sadhan Sood
We noticed while caching data from our hive tables which contain data in
compressed sequence file format that it gets uncompressed in memory when
getting cached. Is there a way to turn this off and cache the compressed
data as is ?