Hi Edmon,

First and foremost would you mind explaining the exact setup you have in
your infrastructure? Performance is a very complex subject and it has many
moving parts that you might not be aware of.  You can think of it as the
sum of its parts though. One problem is that you are comparing different
systems that use different code paths for execution. Anyways, lets assume
that everything is the same and the only difference indeed the file format
you picked.

ORC has a footer[1] and also and index[2] to speed up queries you
mentioned. This might be one reason why  you see the performance
characteristics you see.

If you let me know about your setup I could re-create your test here, but I
would like to advise you to greatly increase the test data size. In my
experience ORC shines when you have huge amount of data in the few
terabytes to petabytes range and you also have high repetition (userids,
hashes, etc.) within the stripes. You can obviously use it for other
things, just might not worth it. I have very limited knowledge about
Parquet, hopefully somebody can chime in and add some content about that.

1.

The file footer contains a list of stripes in the file, the number of rows
per stripe, and each column's data type. It also contains column-level
aggregates count, min, max, and sum.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-FileStructure

2.

Furthermore, ORC files include light weight indexes that
include the minimum and maximum values for each column in each set of
10,000 rows and the entire file. Using pushdown filters from Hive, the
file reader can skip entire sets of rows that aren't important for
this query.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-Introduction

Best regards,
Istvan

-- 
*Istvan Szukacs*
CTO

+31647081521
[email protected]
https://www.streambrightdata.com/


On Fri, Apr 15, 2016 at 3:50 AM, Edmon Begoli <[email protected]> wrote:

> Hi,
>
> We are running some experiments with Spark and ORC, Parquet and plain CSV
> files and we are observing some interesting effects.
>
> The dataset we are initially looking into is smallish - ~100 MB (CSV) and
> we encode it into Parquet and ORC.
>
> When we run Spark SQL aggregate queries we get an insane performance
> speedup. close to 10x, and consistently 2-4x.
>
> General count queries are slower.
>
> When we run MLlib random trees, we get very unusual performance result.
>
> CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec.
>
> I have few intuitions for why is performance on the aggregate queries so
> good (sub-indexing and row/column groups internal statistics), but I am not
> quite clear on the performance on random forrest.
>
> Is ORC decoding algorithm or data retrieval inefficient for these kinds of
> ML jobs?
>
> This is for a performance study, so any insight would be highly
> appreciated.
>



-- 
the sun shines for all

Reply via email to