Reading ORC Files from S3

David Rosenstrauch Mon, 28 Sep 2015 13:45:18 -0700

A colleague of mine posted to this list a few months ago about somedifficulties we were experiencing reading from ORC files stored onAmazon S3. What we were finding was that a set of ORC files that webuilt performed well on HDFS, but showed extremely poor performance whenstored on S3. I've been continuing my colleague's work, and have triedvarious and sundry fixes and tweaks to try to get the performance toimprove, but so far to no avail. I was hoping perhaps someone on thelist here might be able to shed some light as to why we're having theseproblems and/or have some suggestions on how we might be able to workaround them.


A bit more details about our issues:

We have 2 datasets that we've built which are stored as ORC files. Thefirst set is a series of records, sorted by record ID. The second setis an inverted index into the first set, where each record contains asearch key value followed by a record ID. (The 2nd dataset is sorted bysearch key value.) The first dataset contains ~4000 files, totaling500GB (i.e., ~120MB per file); the second also contains ~4000 files, buttotaling nearly 2TB (~230MB per file).

What I'm finding is that queries against the first dataset (the records)complete in a fairly reasonable amount of time, but queries against theindex dataset are taking a very long time. This is completely contraryto what I would expect, as the index dataset should be better able totake advantage of the efficiencies built into the ORC data storage, andso should be able to be queried faster. (I.e., theoretically ORC shouldbe able to skip reading large portions of the index files by jumpingdirectly to the index records that match the supplied search criteria.(Or at least jumping to a stripe close to them.)) But this is provingnot to be the case.

All of the ORC files are generated using a custom map/reduce job withOrcNewOutputFormat (using Hive 0.13.1 jars) and are being queried viaHive queries (using Hive 1.1.0). The files are initially written toHDFS, and then pushed to S3 (using distcp). But my queries are allbeing done directly against the files stored on S3. (I.e., a Hiveexternal table with a LOCATION pointing to S3.)

I've tried various tweaks to the ORC file generation process - largernumber of small files, smaller number of large files, stripe sizesvarying from 64MB to 256MB, etc. But nothing seems to make anydifference. Queries against the index dataset take a very long time nomatter what I try - as in 4x-5x longer than querying the records dataset.

One other thing that I'm finding particularly strange here is thatenabling predicate pushdown is seeming to have no effect here - andsometimes even makes things worse. When I set"hive.optimize.index.filter=true" I can see that the predicate pushdownis taking effect via output in the Hadoop job logs. But it doesn't seemlike the predicate pushdown is able to make the query run any fasterwhen the data is held on S3.

ORC isn't giving me much clue as to the cause for the delays either.When I look in the Hadoop job task logs, I see a message about theS3NativeFileSystem opening one of my ORC files ... and then 6-7 minutespass before I see the next log message about Hive starting to processthe records in the file.

One other thing I've noticed is that I don't seem to be the only oneexperiencing this issue. Googling on this topic turned up a few otherpeople with a similar problem, most notably the blog post athttp://bitmusings.tumblr.com/post/56081787247/orc-files-in-hive wherethe author wound up finding the performance so bad that he switched fromusing S3 native storage format to using the S3 block storage format inorder to work around these issues.

So .... anyone have any ideas as to what might be causing this issueand/or how to work around? Is ORC simply unable to work efficientlyagainst data stored on S3n? (I.e., due to network round-trips takingtoo long.) Any help anyone could offer would be greatly appreciated!This is proving to be a blocker issue for my project, and if I can'tfind a solution I'm likely going to wind up having to scrap the idea ofusing ORC to store the index.


Thanks!

Best,

DR

Reading ORC Files from S3

Reply via email to