Re: Spark querying parquet data partitioned in S3

Steve Loughran Wed, 05 Jul 2017 04:36:59 -0700

> On 29 Jun 2017, at 17:44, fran <francisco.bl...@hivehome.com> wrote:
> 
> We have got data stored in S3 partitioned by several columns. Let's say
> following this hierarchy:
> s3://bucket/data/column1=X/column2=Y/parquet-files
> 
> We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the
> following:
> 
> A) - When we declare the initial dataframe to be the whole dataset (val df =
> sqlContext.read.parquet("s3://bucket/data/) then the driver splits the job
> into several tasks (259) that are performed by the executors and we believe
> the driver gets back the parquet metadata.
> 
> Question: The above takes about 25 minutes for our dataset, we believe it
> should be a lazy query (as we are not performing any actions) however it
> looks like something is happening, all the executors are reading from S3. We
> have tried mergeData=false and setting the schema explicitly via
> .schema(someSchema). Is there any way to speed this up?
> 
> B) - When we declare the initial dataframe to be scoped by the first column
> (val df = sqlContext.read.parquet("s3://bucket/data/column1=X) then it seems
> that all the work (getting the parquet metadata) is done by the driver and
> there is no job submitted to Spark. 
> 
> Question: Why does (A) send the work to executors but (B) does not?
> 
> The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0.
> 
>


Split calculation can be very slow against object stores, especially if the 
directory structure is deep: the treewalking done here is pretty inefficient 
against the object store.

Then there's the schema merge, which looks at the tail of every file, so has to 
do a seek() against all of them. That is something which it parallelises around 
the cluster, before your job actually gets scheduled. 

Turning that off with spark.sql.parquet.mergeSchema = false should make it go 
away, but clearly not.

Aa call to jstack against the driver will show where it is at: you'll probably 
have to start from there


I know if you are using EMR you are stuck using Amazon's own s3 ciients; if you 
were on Apache's own artifacts you could move up to Hadoop 2.8 and set the 
spark.hadoop.fs.s3a.experimental.fadvise=random option for high speed random 
access. You can also turn off job summary creation in Spark


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark querying parquet data partitioned in S3

Reply via email to