Your sample codes first select distinct zipcodes, and then save the rows of
each distinct zipcode into a parquet file.
So I think you can simply partition your data by using
`DataFrameWriter.partitionBy` API, e.g.,
df.repartition("zip_code").write.partitionBy("zip_code").parquet(.)
-
Hi,
val df = spark.read.parquet()
df.registerTempTable("df")
val zip = df.select("zip_code").distinct().as[String].rdd
def comp(zipcode:String):Unit={
val zipval = "SELECT * FROM df WHERE
zip_code='$zipvalrepl'".replace("$zipvalrepl",
zipcode)
val data = spark.sql(zipval)
data.write.parquet(
Hi,
You can't invoke any RDD actions/transformations inside another
transformations. They must be invoked by the driver.
If I understand your purpose correctly, you can partition your data (i.e.,
`partitionBy`) when writing out to parquet files.
-
Liang-Chi Hsieh | @viirya
Spark Technolo