On 12 Apr 2016, at 22:05, Martin Eden <martineden...@gmail.com<mailto:martineden...@gmail.com>> wrote:
Hi everyone, Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I manage to process approx 1TB of strings inside gzipped parquet in about 50 mins on a 20 node cluster (8 cores, 60Gb ram). That's about 17MBytes/sec per node. This seems sub optimal. The processing is very basic, simple fields extraction from the strings and a groupBy. Watching Aaron's talk from the Spark EU Summit: https://youtu.be/GzG9RTRTFck?t=863 it seems I am hitting the same issues with suboptimal S3 throughput he mentions there. I tried different numbers of files for the input data set (more smaller files vs less larger files) combined with various settings for fs.s3n.block.size thinking that might help if each mapper streams larger chunks. It didn't! It actually seems that many small files gives better performance than less larger ones (of course with oversubscribed number of tasks/threads). Similarly to what Aaron is mentioning with oversubscribed tasks/threads we also become CPU bound (reach 100% cpu utilisation). Has anyone seen a similar behaviour? How can we optimise this? Are the improvements mentioned in Aaron's talk now part of S3n or S3a driver or are they just available under DataBricksCloud? How can we benefit from those improvements? Thanks, Martin P.S. Have not tried S3a. s3n is getting, deliberately, absolutely no maintenance except for critical bug fixes. Every time something minor was done there (like bump up a jets3t version), something else breaks. S3a is getting the work, and is the one you have to be using. It's also where the the performance analysis and enhancement is going on https://issues.apache.org/jira/browse/HADOOP-11694 a spark-related patch that's gone in is: https://issues.apache.org/jira/browse/HADOOP-12810 ; significant optimisation of time to enum lists of files in a partition; that could be the root cause of a mentioned problem. Otherwise, a recent change is HADOOP-12444, "Lazy-seek", where you can call seek() as much as you like, but theres no attempt to open or close+reopen an HTTPS connection until the next read. That really boosts performance on code doing iostream.readFully(position, ... ), because the way readFully() was implemented it was doing a long pos =getPos() seek(newpos) read(...) seek(oldpos) Two seeks, even if they were reading in sequential blocks in the files... this was hurting ORC reads where there's a lot of absolute reading and skipping of sections of the file. There's also some ongoing work at the spark layer to do more in parallel: https://github.com/apache/spark/pull/11242 If you look at the a lot of the work, it's coming from people (there netflix), trying to understand why things are slow, and fixing them. Anything other people can do to help here is welcome A key first step, even if you don't want to contribute any code back to the OSS projects is : test other people's work. Anyone who can test the SPARK-9926 patch against their datasets should apply that PR and test to see if shows (a) speedup, (b) no change or (c) dramatic slowdown in performance. Similarly, the s3a performance working going in hadoop 2.8+ are things you can test today, both by grabbing the branch and building spark against it, or even going one step up and testing those patches before they get in. It's during the development time where any performance, functionality, reliability problems can be fixed overnight —get testing with your code, to know things will work when releases ship. This is particularly important against object stores, because none of the Jenkins builds of the projects test against S3. If you really care about S3 performance, and want to make sure ongoing development is improving your ilfe: you need to get involved. -Steve