subject:"Extracting Input and Output Partitions in Spark"

Re: Extracting Input and Output Partitions in Spark

2024-02-13 Thread Daniel Saha

This would be helpful for a few use cases. For context my team works in security space, and customers access data through a wrapper around spark sql connected to hive metastore. 1. When snapshot (non-partitioned) tables are queried, it’s not clear when the underlying snapshot was last updated. hav

Re: Extracting Input and Output Partitions in Spark

2024-02-12 Thread Aditya Sohoni

Sharing an example since a few people asked me off-list: We have stored the partition details in the read/write nodes of the physical plan. So this can be accessed via the plan like plan.getInputPartitions or plan.getOutputPartitions, which internally loops through the nodes in the plan and collec

Extracting Input and Output Partitions in Spark

2024-01-30 Thread Aditya Sohoni

Hello Spark Devs! We are from Uber's Spark team. Our ETL jobs use Spark to read and write from Hive datasets stored in HDFS. The freshness of the partition written to depends on the freshness of the data in the input partition(s). We monitor this freshness score, so that partitions in our criti