Hi, We are running some experiments with Spark and ORC, Parquet and plain CSV files and we are observing some interesting effects.
The dataset we are initially looking into is smallish - ~100 MB (CSV) and we encode it into Parquet and ORC. When we run Spark SQL aggregate queries we get an insane performance speedup. close to 10x, and consistently 2-4x. General count queries are slower. When we run MLlib random trees, we get very unusual performance result. CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec. I have few intuitions for why is performance on the aggregate queries so good (sub-indexing and row/column groups internal statistics), but I am not quite clear on the performance on random forrest. Is ORC decoding algorithm or data retrieval inefficient for these kinds of ML jobs? This is for a performance study, so any insight would be highly appreciated.
