Hi Mani,
BoundedReadFromUnboundedSource was originally intended to be used in
batch pipelines. In batch, runners typically do not perform
checkpointing. In case of failures, they re-run the entire pipeline.
Keep in mind that, even with checkpointing, reading for a finite time in
the processing time domain from an unbounded source rarely gives
consistent results across runs.
However, ignoring the checkpoint looks problematic. We may want to fail
during checkpointing to prevent violating correctness (e.g. exactly-once
semantics).
-Max
On 21.07.20 11:36, Sunny, Mani Kolbe wrote:
Observed on v2.22.0
When withMaxReadTime() is used, Beam creates
a BoundedReadFromUnboundedSource [1]. The ReadFn<T> class
in BoundedReadFromUnboundedSource which is responsible for reading
records from source. You can see this class doesnt verify if there is a
recoverable checkpoint exist. Instead it always creates Reader with
checkpointMark set as null [2].
Is this desired behavior? More importantly, do you guys recommend using
withMaxReadTime() in production setup? Or is this more for demo usecases?
I have created a jira for the same (BEAM-104934
<https://issues.apache.org/jira/browse/BEAM-104934>) and looking to work
on a patch for the same. But would like a confirmation on the above first.
Regards,
Mani
Reference:
[1]https://github.com/apache/beam/blob/v2.22.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L205
[2]https://github.com/apache/beam/blob/v2.22.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedReadFromUnboundedSource.java#L193