When my 22M Parquet test file ended up taking 3G when cached in-memory I looked closer at how column compression works in 1.4.0. My test dataset was 1,000 columns * 800,000 rows of mostly empty floating point columns with a few dense long columns.
I was surprised to see that no real "org.apache.spark.sql.columnar.compression.CompressionScheme" supports floating-point types and so conversion falls back to the no-op "PassThrough" implementation. In addition, the way "org.apache.spark.sql.columnar.NullableColumnBuilder" encodes null values (with four bytes for each of them) seems to be heavily biased against sparse data. It would be interesting to know if sparse floating point datasets were neglected for a reason other than some obscure historical accident. Is there anything in the Tungsten roadmap which would allow, for example, https://drill.apache.org/docs/value-vectors/ -style efficiency for this kind of data?
