i believe this is a known issue for using spark/hive with files on s3, this
huge delay on driver side is caused by partition listing and split
computation, and it is more like a issue by hive, since you are using
thrift server, the sql queries are running in HiveContext.

qubole made some optimizations for this:
https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/
, but it is not open sourced.

we, at zalando, are using spark on aws as well, we made some changes to
optimize insert performance for s3 files,  you can find the changes here:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando

hope this can give you/spark devs some ideas to fix this listing issue.


2016-02-16 19:59 GMT+01:00 Dukek, Dillon <dillon.du...@t-mobile.com>:

> Hello,
>
>
>
> I have been working on a project that allows a BI tool to query roughly 25
> TB of application event data from 2015 using the thrift server and Spark
> SQL. In general the jobs that are submitted have a step that submit many
> tasks in the order of hundreds of thousands and is equal to the number of
> files that need to be processed from s3. Once the step actually starts
> running tasks all of the nodes begin helping and executing tasks and
> finishes in a reasonable amount of time. However, after the logs say that
> the step was submitted with ~200,000 tasks there is a delay. I believe that
> the delay is caused by a shuffle step that happens after the step before
> that maps all of the files that we are going to process based on the date
> range specified in the query, but I’m not sure as I’m fairly new to Spark.
> I can see that that the driver node is processing during this time, but all
> of the other nodes are inactive. I was wondering if there is a way shorten
> the delay? I’m using Spark 1.6 on EMR  with yarn as the master, and ganglia
> to monitor the nodes. I’m giving 15GB to the driver and application master,
> and 3GB per executor with one core each.
>
>
>
>
>
> Thanks,
>
>
>
> Dillon Dukek
>
> Software Engineer,
>
> Product Realization
>
> Data Products & Intelligence
>
> *•**T**•••Mobile•*
>
> Email: dillon.du...@t-mobile.com
>
>
>

Reply via email to