On 12 Apr 2016, at 22:05, Martin Eden 
<martineden...@gmail.com<mailto:martineden...@gmail.com>> wrote:

Hi everyone,

Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I manage 
to process approx 1TB of strings inside gzipped parquet in about 50 mins on a 
20 node cluster (8 cores, 60Gb ram). That's about 17MBytes/sec per node.

This seems sub optimal.

The processing is very basic, simple fields extraction from the strings and a 
groupBy.

Watching Aaron's talk from the Spark EU Summit:
https://youtu.be/GzG9RTRTFck?t=863

it seems I am hitting the same issues with suboptimal S3 throughput he mentions 
there.

I tried different numbers of files for the input data set (more smaller files 
vs less larger files) combined with various settings for fs.s3n.block.size 
thinking that might help if each mapper streams larger chunks. It didn't! It 
actually seems that many small files gives better performance than less larger 
ones (of course with oversubscribed number of tasks/threads).

Similarly to what Aaron is mentioning with oversubscribed tasks/threads we also 
become CPU bound (reach 100% cpu utilisation).


Has anyone seen a similar behaviour? How can we optimise this?

Are the improvements mentioned in Aaron's talk now part of S3n or S3a driver or 
are they just available under DataBricksCloud? How can we benefit from those 
improvements?

Thanks,
Martin

P.S. Have not tried S3a.


s3n is getting, deliberately, absolutely no maintenance except for critical bug 
fixes. Every time something minor was done there (like bump up a jets3t 
version), something else breaks.

S3a is getting the work, and is the one you have to be using.

It's also where the the performance analysis and enhancement is going on

https://issues.apache.org/jira/browse/HADOOP-11694

a spark-related patch that's gone in is: 
https://issues.apache.org/jira/browse/HADOOP-12810 ; significant optimisation 
of time to enum lists of files in a partition; that could be the root cause of 
a mentioned problem. Otherwise, a recent change is HADOOP-12444, "Lazy-seek", 
where you can call seek() as much as you like, but theres no attempt to open or 
close+reopen an HTTPS connection until the next read. That really boosts 
performance on code doing iostream.readFully(position, ... ), because the way 
readFully() was implemented it was doing a

long pos  =getPos()
seek(newpos)
read(...)
seek(oldpos)

Two seeks, even if they were reading in sequential blocks in the files... this 
was hurting ORC reads where there's a lot of absolute reading and skipping of 
sections of the file.

There's also some ongoing work at the spark layer to do more in parallel: 
https://github.com/apache/spark/pull/11242

If you look at the a lot of the work, it's coming from people (there netflix), 
trying to understand why things are slow, and fixing them. Anything other 
people can do to help here is welcome

A key first step, even if you don't want to contribute any code back to the OSS 
projects is : test other people's work.

Anyone who can test the SPARK-9926 patch against their datasets should apply 
that PR and test to see if shows (a) speedup, (b) no change or (c) dramatic 
slowdown in performance.

Similarly, the s3a performance working going in hadoop 2.8+ are things you can 
test today, both by grabbing the branch and building spark against it, or even 
going one step up and testing those patches before they get in.

It's during the development time where any performance, functionality, 
reliability problems can be fixed overnight —get testing with your code, to 
know things will work when releases ship. This is particularly important 
against object stores, because none of the Jenkins builds of the projects test 
against S3. If you really care about S3 performance, and want to make sure 
ongoing development is improving your ilfe: you need to get involved.

-Steve




Reply via email to