Re: Reading ORC Files from S3

David Rosenstrauch Mon, 28 Sep 2015 21:44:42 -0700

OK, well that was easy. Figured out my issue and managed to get ORCworking over s3a. And got a huge speed-up over s3n! (On the order of 10x!)

So yeah, I'm game for testing some new code when/if you're feelingmotivated to work on this. Feel free to email me off-list and we canget into the details.


Best,

DR

On 09/28/2015 10:43 PM, David Rosenstrauch wrote:

Super helpful response - thanks so much!  At least I know I'm not crazy
now!  (And shouldn't spend any more time on tweaks trying to get this to
work on s3n.)

Let me try to start testing this using out-of-the-box s3a protocol.  (I
haven't been able to get that to work at all yet - keep getting "Unable
to load AWS credentials from any provider in the chain" errors.)  Once
I'm able to get that far I'd be up for trying to test some new code. (As
long as it doesn't wind up taking too much time.)

Will report back soon.

Thanks again!

DR

On 09/28/2015 06:14 PM, Gopal Vijayaraghavan wrote:

avail.  I was hoping perhaps someone on the list here might
be able to shed some light as to why we're having these problems and/or
have some suggestions on how we might be able to work around them.

...

  (I.e., theoretically ORC should be able to skip reading large portions
of the index files by jumping directly to the index
records that match the supplied search criteria. (Or at least jumping to
a stripe close to them.))  But this is proving not to be the case.


Not theoretically. ORC does that and that's the issue.

S3n is badly broken for a columnar format & even S3A is missing a couple
of features which are essential to get read performance over HTTP.

Here's one example - every seek() disconnects & restablishes an SSL
connection in S3 (that fix is a ~2x perf increase for S3a).

https://issues.apache.org/jira/browse/HADOOP-12444


In another scenario we found that a readFully(colOffset,... colSize) will
open an unbounded reader in S3n instead of reading the fixed chunk off
HTTP.

https://issues.apache.org/jira/browse/HADOOP-11867


The lack of this means that even the short-live keep-alive gets turned
off
by the S3 impl, when doing a forward-seek read pattern, because it is a
recv buffer-dropping disconnect, not a complete request.

The Amazon proprietary S3 drivers are not subject to these problems, so
they work with ORC very well. It's the open source S3 filesystem impls
which are holding us back.

Is ORC simply unable to work efficiently against data stored on S3n?
(I.e., due to network round-trips taking too long.)


S3n is unable to handle any columnar format efficiently - it fires an
HTTP
GET for each seek, marked till end of the file. Any format which requires
forward seeks or bounded readers is going to die via TCP window &
round-trip thrashing.


I know what's needed for s3a to work well with columnar readers
(Parquet/ORC/RCFile included) and future proof it so that it will work
fine when HTTP/2 arrives.

If you're interested in being guinea pig for S3a fixes, it is currently
sitting on my back burner (I'm not a hadoop committer) - the FS fixes are
about two weeks worth of work for a single motivated dev.

Cheers,
Gopal

Re: Reading ORC Files from S3

Reply via email to