If your processing task inherently processes input data by month you
may want to "manually" partition the output data by month as well as
by day, that is to save it with a file name including the given month,
i.e. "dataset.parquet/month=01". Then you will be able to use the
overwrite mode with each month partition. Hope this could be of some
help.

-- 
Pavel Knoblokh

On Fri, Sep 29, 2017 at 5:31 PM, peay <p...@protonmail.com> wrote:
> Hello,
>
> I am trying to use
> data_frame.write.partitionBy("day").save("dataset.parquet") to write a
> dataset while splitting by day.
>
> I would like to run a Spark job  to process, e.g., a month:
> dataset.parquet/day=2017-01-01/...
> ...
>
> and then run another Spark job to add another month using the same folder
> structure, getting me
> dataset.parquet/day=2017-01-01/
> ...
> dataset.parquet/day=2017-02-01/
> ...
>
> However:
> - with save mode "overwrite", when I process the second month, all of
> dataset.parquet/ gets removed and I lose whatever was already computed for
> the previous month.
> - with save mode "append", then I can't get idempotence: if I run the job to
> process a given month twice, I'll get duplicate data in all the subfolders
> for that month.
>
> Is there a way to do "append in terms of the subfolders from partitionBy,
> but overwrite within each such partitions? Any help would be appreciated.
>
> Thanks!



-- 
Pavel Knoblokh

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to