[sparksql] sparse floating point data compression in sparksql cache

Nikita Dolgov Wed, 24 Jun 2015 13:31:54 -0700

When my 22M Parquet test file ended up taking 3G when cached in-memory I
looked closer at how column compression works in 1.4.0. My test dataset was
1,000 columns * 800,000 rows of mostly empty floating point columns with a
few dense long columns.


I was surprised to see that no real
"org.apache.spark.sql.columnar.compression.CompressionScheme" supports
floating-point types and so conversion falls back to the no-op
"PassThrough" implementation. In addition, the way
"org.apache.spark.sql.columnar.NullableColumnBuilder" encodes null values
(with four bytes for each of them) seems to be heavily biased against
sparse data.

It would be interesting to know if sparse floating point datasets were
neglected for a reason other than some obscure historical accident. Is
there anything in the Tungsten roadmap which would allow, for example,
https://drill.apache.org/docs/value-vectors/ -style efficiency for this
kind of data?

[sparksql] sparse floating point data compression in sparksql cache

Reply via email to