[Python] How to know what partitions will dataset.write_dataset affect when writting?

Ira Saktor Thu, 25 Mar 2021 04:41:26 -0700

Hello,

I am trying to overwrite partitions when writing a table to HDFS using
pyarrow. I would like to know what is the recommended way to figure out
which directories I should clear before writing the dataset?


My current approach is to convert the pyarrow.table to pandas dataframe,
use groupby on the partitioning columns and from that figure out which
directories will be affected. However, I'd like to avoid conversion to
pandas if possible and I hope that since pyarrow is able to figure out
where to write the data quite fast, I could somehow reuse the way it
detects the paths to write to.

Thank you!

Best regards,

Ira

[Python] How to know what partitions will dataset.write_dataset affect when writting?

Reply via email to