Creating Spark buckets that Presto / Athena / Hive can leverage

2019-06-15 Thread Daniel Mateus Pires
Hi there!

I am trying to optimize joins on data created by Spark, so I'd like to
bucket the data to avoid shuffling.

I am writing to immutable partitions every day by writing data to a local
HDFS and then copying this data to S3, is there a combination of bucketBy
options and DDL that I can use so that Presto/Athena JOINs leverage the
special layout of the data?

e.g.
CREATE EXTERNAL TABLE ...(on Presto/Athena)
df.write.bucketBy(...).partitionBy(...). (in spark)
then copy this data to S3 with s3-dist-cp
then MSCK REPAIR TABLE (on Presto/Athena)

Daniel


Blog post: DataFrame.transform -- Spark function composition

2019-06-05 Thread Daniel Mateus Pires
Hi everyone!

I just published this blog post on how Spark Scala custom transformations
can be re-arranged to better be composed and used within .transform:


https://medium.com/@dmateusp/dataframe-transform-spark-function-composition-eb8ec296c108

I found the discussions in this group to be largely around issues /
performance, this post only concerns itself with code readability,
hopefully not off-topic!

I welcome any feedback I can get :)

Daniel


[SPARK-SQL] Reading JSON column as a DataFrame and keeping partitioning information

2018-07-20 Thread Daniel Mateus Pires
I've been trying to figure out this one for some time now, I have JSONs 
representing Products coming (physically) partitioned by Brand and I would like 
to create a DataFrame from the JSON but also keep the partitioning information 
(Brand)

```
case class Product(brand: String, value: String)
val df = spark.createDataFrame(Seq(Product("something", """{"a": "b", "c": 
"d"}""")))
df.write.partitionBy("brand").mode("overwrite").json("/tmp/products5/")
val df2 = spark.read.json("/tmp/products5/")

df2.show
/*
++--+
|   value|brand|
++--+
|{"a": "b", "c": "d"}|  something|
++--+
*/


// This is simple and effective but it gets rid of the brand!
spark.read.json(df2.select("value").as[String]).show
/*
+---+---+
|  a|  c|
+---+---+
|  b|  d|
+---+---+
*/
```

Ideally I'd like something similar to spark.read.json that would keep the 
partitioning values and merge it with the rest of the DataFrame

End result I would like:
```
/*
+---+---+---+
|  a|  c| brand|
+---+---+---+
|  b|  d| something|
+---+---+---+
*/
```

Best regards,
Daniel Mateus Pires
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org