Edmon,
   I'd love to help figure out what it going on. A couple of questions:

* What file system are you reading from? HDFS? one of the S3-based ones?
local?
* Would it be possible to send me ([email protected]) the file's metadata
from orcfiledump?
* Do you know if MLlib is having the reader seek? At 100MB, it should just
read the file into memory.

Thanks,
   Owen

On Thu, Apr 14, 2016 at 9:50 PM, Edmon Begoli <[email protected]> wrote:

> Hi,
>
> We are running some experiments with Spark and ORC, Parquet and plain CSV
> files and we are observing some interesting effects.
>
> The dataset we are initially looking into is smallish - ~100 MB (CSV) and
> we encode it into Parquet and ORC.
>
> When we run Spark SQL aggregate queries we get an insane performance
> speedup. close to 10x, and consistently 2-4x.
>
> General count queries are slower.
>
> When we run MLlib random trees, we get very unusual performance result.
>
> CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec.
>
> I have few intuitions for why is performance on the aggregate queries so
> good (sub-indexing and row/column groups internal statistics), but I am not
> quite clear on the performance on random forrest.
>
> Is ORC decoding algorithm or data retrieval inefficient for these kinds of
> ML jobs?
>
> This is for a performance study, so any insight would be highly
> appreciated.
>

Reply via email to