Hi Edmon, First and foremost would you mind explaining the exact setup you have in your infrastructure? Performance is a very complex subject and it has many moving parts that you might not be aware of. You can think of it as the sum of its parts though. One problem is that you are comparing different systems that use different code paths for execution. Anyways, lets assume that everything is the same and the only difference indeed the file format you picked.
ORC has a footer[1] and also and index[2] to speed up queries you mentioned. This might be one reason why you see the performance characteristics you see. If you let me know about your setup I could re-create your test here, but I would like to advise you to greatly increase the test data size. In my experience ORC shines when you have huge amount of data in the few terabytes to petabytes range and you also have high repetition (userids, hashes, etc.) within the stripes. You can obviously use it for other things, just might not worth it. I have very limited knowledge about Parquet, hopefully somebody can chime in and add some content about that. 1. The file footer contains a list of stripes in the file, the number of rows per stripe, and each column's data type. It also contains column-level aggregates count, min, max, and sum. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-FileStructure 2. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren't important for this query. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-Introduction Best regards, Istvan -- *Istvan Szukacs* CTO +31647081521 [email protected] https://www.streambrightdata.com/ On Fri, Apr 15, 2016 at 3:50 AM, Edmon Begoli <[email protected]> wrote: > Hi, > > We are running some experiments with Spark and ORC, Parquet and plain CSV > files and we are observing some interesting effects. > > The dataset we are initially looking into is smallish - ~100 MB (CSV) and > we encode it into Parquet and ORC. > > When we run Spark SQL aggregate queries we get an insane performance > speedup. close to 10x, and consistently 2-4x. > > General count queries are slower. > > When we run MLlib random trees, we get very unusual performance result. > > CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec. > > I have few intuitions for why is performance on the aggregate queries so > good (sub-indexing and row/column groups internal statistics), but I am not > quite clear on the performance on random forrest. > > Is ORC decoding algorithm or data retrieval inefficient for these kinds of > ML jobs? > > This is for a performance study, so any insight would be highly > appreciated. > -- the sun shines for all
