could you give the event timeline and dag for the time consuming stages on
spark UI?

On Thu, Mar 23, 2017 at 4:30 AM, Matt Deaver <mattrdea...@gmail.com> wrote:

> For various reasons, our data set is partitioned in Spark by customer id
> and saved to S3. When trying to read this data, however, the larger
> partitions make it difficult to parallelize jobs. For example, out of a
> couple thousand companies, some have <10 MB data while some have >10GB.
> This is the code I'm using in a Zeppelin notebook and it takes a very long
> time to read in (2+ hours on a ~200 GB dataset from S3):
>
> df1 = sqlContext.read.parquet("s3a://[bucket1]/[prefix1]/")
> df2 = sqlContext.read.parquet("s3a://[bucket2]/[prefix2]/")
>
> # generate a bunch of derived columns here for df1
> df1 = df1.withColumn('derivedcol1', df1.source_col)
>
>
> # limit output columns for later union
> df1 = df1.select(
>                 [limited set of columns]
>                 )
>
> # generate a bunch of derived columns here for df2
> df2 = df2.withColumn('derivedcol1', df2.source_col)
>
> # limit output columns for later union
> df2 = df2.select(
>                 [limited set of columns]
>                 )
>
> print(df1.rdd.getNumPartitions())
> print(df2.rdd.getNumPartitions())
>
> merge_df = df1.unionAll(df2)
> merge_df.repartition(300)
>
> merge_df.registerTempTable("union_table")
> sqlContext.cacheTable("union_table")
> sqlContext.sql("select count(*) from union_table").collect()
>
> Any suggestions on making this faster?
>

Reply via email to