Hi,
let's assume I have a dataset and depending on the input data and
different filter operations this dataset can be empty. Now I want to
output the dataset to HD, but I want that files are only created if the
dataset is not empty. If the dataset is empty I don't want any files.
The default way: dataset.write(...) will always create as many files as
the parallelism of this operator is configured - in case of an empty
dataset all files would be empty as well. I thought about doing
something like:
if (dataset.count() > 0) {
dataset.write(...)
}
but I don't think thats the way to go, because dataset.count() triggers
a execution of the (sub)program.
Is there a simple way how to avoid creating empty files for empty
datasets?
Regards,
Lars