Re: ORC's slow(er) performance with MLlib

Edmon Begoli Mon, 18 Apr 2016 11:37:29 -0700

Owen,

I am on travel, but I will try to send that over as soon as I am back.


It should not be a problem.

Edmon

On Mon, Apr 18, 2016 at 1:12 PM, Owen O'Malley <[email protected]> wrote:

> Edmon,
>    I'd love to help figure out what it going on. A couple of questions:
>
> * What file system are you reading from? HDFS? one of the S3-based ones?
> local?
> * Would it be possible to send me ([email protected]) the file's
> metadata from orcfiledump?
> * Do you know if MLlib is having the reader seek? At 100MB, it should just
> read the file into memory.
>
> Thanks,
>    Owen
>
> On Thu, Apr 14, 2016 at 9:50 PM, Edmon Begoli <[email protected]> wrote:
>
>> Hi,
>>
>> We are running some experiments with Spark and ORC, Parquet and plain CSV
>> files and we are observing some interesting effects.
>>
>> The dataset we are initially looking into is smallish - ~100 MB (CSV) and
>> we encode it into Parquet and ORC.
>>
>> When we run Spark SQL aggregate queries we get an insane performance
>> speedup. close to 10x, and consistently 2-4x.
>>
>> General count queries are slower.
>>
>> When we run MLlib random trees, we get very unusual performance result.
>>
>> CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec.
>>
>> I have few intuitions for why is performance on the aggregate queries so
>> good (sub-indexing and row/column groups internal statistics), but I am not
>> quite clear on the performance on random forrest.
>>
>> Is ORC decoding algorithm or data retrieval inefficient for these kinds
>> of ML jobs?
>>
>> This is for a performance study, so any insight would be highly
>> appreciated.
>>
>
>

Re: ORC's slow(er) performance with MLlib

Reply via email to