ORC's slow(er) performance with MLlib

Edmon Begoli Thu, 14 Apr 2016 18:51:12 -0700

Hi,

We are running some experiments with Spark and ORC, Parquet and plain CSV
files and we are observing some interesting effects.


The dataset we are initially looking into is smallish - ~100 MB (CSV) and
we encode it into Parquet and ORC.

When we run Spark SQL aggregate queries we get an insane performance
speedup. close to 10x, and consistently 2-4x.

General count queries are slower.

When we run MLlib random trees, we get very unusual performance result.

CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec.

I have few intuitions for why is performance on the aggregate queries so
good (sub-indexing and row/column groups internal statistics), but I am not
quite clear on the performance on random forrest.

Is ORC decoding algorithm or data retrieval inefficient for these kinds of
ML jobs?

This is for a performance study, so any insight would be highly
appreciated.

ORC's slow(er) performance with MLlib

Reply via email to