Hi,

We are running some experiments with Spark and ORC, Parquet and plain CSV
files and we are observing some interesting effects.

The dataset we are initially looking into is smallish - ~100 MB (CSV) and
we encode it into Parquet and ORC.

When we run Spark SQL aggregate queries we get an insane performance
speedup. close to 10x, and consistently 2-4x.

General count queries are slower.

When we run MLlib random trees, we get very unusual performance result.

CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec.

I have few intuitions for why is performance on the aggregate queries so
good (sub-indexing and row/column groups internal statistics), but I am not
quite clear on the performance on random forrest.

Is ORC decoding algorithm or data retrieval inefficient for these kinds of
ML jobs?

This is for a performance study, so any insight would be highly
appreciated.

Reply via email to