For various reasons, our data set is partitioned in Spark by customer id
and saved to S3. When trying to read this data, however, the larger
partitions make it difficult to parallelize jobs. For example, out of a
couple thousand companies, some have <10 MB data while some have >10GB.
This is the code I'm using in a Zeppelin notebook and it takes a very long
time to read in (2+ hours on a ~200 GB dataset from S3):

df1 = sqlContext.read.parquet("s3a://[bucket1]/[prefix1]/")
df2 = sqlContext.read.parquet("s3a://[bucket2]/[prefix2]/")

# generate a bunch of derived columns here for df1
df1 = df1.withColumn('derivedcol1', df1.source_col)


# limit output columns for later union
df1 = df1.select(
                [limited set of columns]
                )

# generate a bunch of derived columns here for df2
df2 = df2.withColumn('derivedcol1', df2.source_col)

# limit output columns for later union
df2 = df2.select(
                [limited set of columns]
                )

print(df1.rdd.getNumPartitions())
print(df2.rdd.getNumPartitions())

merge_df = df1.unionAll(df2)
merge_df.repartition(300)

merge_df.registerTempTable("union_table")
sqlContext.cacheTable("union_table")
sqlContext.sql("select count(*) from union_table").collect()

Any suggestions on making this faster?

Reply via email to