> On 29 Jun 2017, at 17:44, fran <francisco.bl...@hivehome.com> wrote: > > We have got data stored in S3 partitioned by several columns. Let's say > following this hierarchy: > s3://bucket/data/column1=X/column2=Y/parquet-files > > We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the > following: > > A) - When we declare the initial dataframe to be the whole dataset (val df = > sqlContext.read.parquet("s3://bucket/data/) then the driver splits the job > into several tasks (259) that are performed by the executors and we believe > the driver gets back the parquet metadata. > > Question: The above takes about 25 minutes for our dataset, we believe it > should be a lazy query (as we are not performing any actions) however it > looks like something is happening, all the executors are reading from S3. We > have tried mergeData=false and setting the schema explicitly via > .schema(someSchema). Is there any way to speed this up? > > B) - When we declare the initial dataframe to be scoped by the first column > (val df = sqlContext.read.parquet("s3://bucket/data/column1=X) then it seems > that all the work (getting the parquet metadata) is done by the driver and > there is no job submitted to Spark. > > Question: Why does (A) send the work to executors but (B) does not? > > The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0. > >
Split calculation can be very slow against object stores, especially if the directory structure is deep: the treewalking done here is pretty inefficient against the object store. Then there's the schema merge, which looks at the tail of every file, so has to do a seek() against all of them. That is something which it parallelises around the cluster, before your job actually gets scheduled. Turning that off with spark.sql.parquet.mergeSchema = false should make it go away, but clearly not. Aa call to jstack against the driver will show where it is at: you'll probably have to start from there I know if you are using EMR you are stuck using Amazon's own s3 ciients; if you were on Apache's own artifacts you could move up to Hadoop 2.8 and set the spark.hadoop.fs.s3a.experimental.fadvise=random option for high speed random access. You can also turn off job summary creation in Spark --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org