[
https://issues.apache.org/jira/browse/BEAM-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125154#comment-17125154
]
Tianzi Cai commented on BEAM-9354:
--
Any update?
> How long does PubSubIO message deduplication last?
> --
>
> Key: BEAM-9354
> URL: https://issues.apache.org/jira/browse/BEAM-9354
> Project: Beam
> Issue Type: Improvement
> Components: io-java-gcp
>Reporter: Tianzi Cai
>Assignee: Reuven Lax
>Priority: P2
> Labels: gcp, pubsubio, stale-assigned
>
> GCP documentation heavily
> [promotes|https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub]
> Beam's PubSubIO for Pub/Sub message deduplication. Yet nowhere in the
> documentation, including the [source
> code|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java],
> tells users how long this deduplication is supposed to last.
> In
> [`PubsubIO.java`|https://github.com/apache/beam/blob/a24bc3bae54f089b93bd66a118bd4bf09dbc9254/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L842-L853]:
> {code:java}
> /**
> * When reading from Cloud Pub/Sub where unique record identifiers are
> provided as Pub/Sub
> * message attributes, specifies the name of the attribute containing the
> unique identifier. The
> * value of the attribute can be any string that uniquely identifies this
> record.
> *
> * Pub/Sub cannot guarantee that no duplicate data will be delivered
> on the Pub/Sub stream.
> * If {@code idAttribute} is not provided, Beam cannot guarantee that no
> duplicate data will be
> * delivered, and deduplication of the stream will be strictly best
> effort.
> */
> public Read withIdAttribute(String idAttribute) {
> return toBuilder().setIdAttribute(idAttribute).build();
> }
> {code}
> This information here isn't enough for users to know if a second message,
> published with the same custom IdAttribute as that of a first message, which
> was published `x` minutes ago, would be deduplicated by the Dataflow runner.
> Better documentation will help. I imagine a lot of users will wonder about
> this and may even ask how to configure this period, but that will probably
> need a separate ticket.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)