Hi Mani,

BoundedReadFromUnboundedSource was originally intended to be used in batch pipelines. In batch, runners typically do not perform checkpointing. In case of failures, they re-run the entire pipeline.

Keep in mind that, even with checkpointing, reading for a finite time in the processing time domain from an unbounded source rarely gives consistent results across runs.

However, ignoring the checkpoint looks problematic. We may want to fail during checkpointing to prevent violating correctness (e.g. exactly-once semantics).

-Max

On 21.07.20 11:36, Sunny, Mani Kolbe wrote:
Observed on v2.22.0

When withMaxReadTime() is used, Beam creates a BoundedReadFromUnboundedSource [1].  The ReadFn<T> class in BoundedReadFromUnboundedSource which is responsible for reading records from source. You can see this class doesnt verify if there is a recoverable checkpoint exist. Instead it always creates Reader with checkpointMark set as null [2].

Is this desired behavior? More importantly, do you guys recommend using withMaxReadTime() in production setup? Or is this more for demo usecases?

I have created a jira for the same (BEAM-104934 <https://issues.apache.org/jira/browse/BEAM-104934>) and looking to work on a patch for the same. But would like a confirmation on the above first.

Regards,

Mani

Reference:

[1]https://github.com/apache/beam/blob/v2.22.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L205

[2]https://github.com/apache/beam/blob/v2.22.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedReadFromUnboundedSource.java#L193

Reply via email to