Hello, I am trying to overwrite partitions when writing a table to HDFS using pyarrow. I would like to know what is the recommended way to figure out which directories I should clear before writing the dataset?
My current approach is to convert the pyarrow.table to pandas dataframe, use groupby on the partitioning columns and from that figure out which directories will be affected. However, I'd like to avoid conversion to pandas if possible and I hope that since pyarrow is able to figure out where to write the data quite fast, I could somehow reuse the way it detects the paths to write to. Thank you! Best regards, Ira
