Creating Spark buckets that Presto / Athena / Hive can leverage
Hi there! I am trying to optimize joins on data created by Spark, so I'd like to bucket the data to avoid shuffling. I am writing to immutable partitions every day by writing data to a local HDFS and then copying this data to S3, is there a combination of bucketBy options and DDL that I can use so that Presto/Athena JOINs leverage the special layout of the data? e.g. CREATE EXTERNAL TABLE ...(on Presto/Athena) df.write.bucketBy(...).partitionBy(...). (in spark) then copy this data to S3 with s3-dist-cp then MSCK REPAIR TABLE (on Presto/Athena) Daniel
Blog post: DataFrame.transform -- Spark function composition
Hi everyone! I just published this blog post on how Spark Scala custom transformations can be re-arranged to better be composed and used within .transform: https://medium.com/@dmateusp/dataframe-transform-spark-function-composition-eb8ec296c108 I found the discussions in this group to be largely around issues / performance, this post only concerns itself with code readability, hopefully not off-topic! I welcome any feedback I can get :) Daniel
[SPARK-SQL] Reading JSON column as a DataFrame and keeping partitioning information
I've been trying to figure out this one for some time now, I have JSONs representing Products coming (physically) partitioned by Brand and I would like to create a DataFrame from the JSON but also keep the partitioning information (Brand) ``` case class Product(brand: String, value: String) val df = spark.createDataFrame(Seq(Product("something", """{"a": "b", "c": "d"}"""))) df.write.partitionBy("brand").mode("overwrite").json("/tmp/products5/") val df2 = spark.read.json("/tmp/products5/") df2.show /* ++--+ | value|brand| ++--+ |{"a": "b", "c": "d"}| something| ++--+ */ // This is simple and effective but it gets rid of the brand! spark.read.json(df2.select("value").as[String]).show /* +---+---+ | a| c| +---+---+ | b| d| +---+---+ */ ``` Ideally I'd like something similar to spark.read.json that would keep the partitioning values and merge it with the rest of the DataFrame End result I would like: ``` /* +---+---+---+ | a| c| brand| +---+---+---+ | b| d| something| +---+---+---+ */ ``` Best regards, Daniel Mateus Pires - To unsubscribe e-mail: user-unsubscr...@spark.apache.org