[jira] [Commented] (BEAM-9354) How long does PubSubIO message deduplication last?

2020-06-03 Thread Tianzi Cai (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125154#comment-17125154
 ] 

Tianzi Cai commented on BEAM-9354:
--

Any update?

> How long does PubSubIO message deduplication last?
> --
>
> Key: BEAM-9354
> URL: https://issues.apache.org/jira/browse/BEAM-9354
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Tianzi Cai
>Assignee: Reuven Lax
>Priority: P2
>  Labels: gcp, pubsubio, stale-assigned
>
> GCP documentation heavily 
> [promotes|https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub]
>  Beam's PubSubIO for Pub/Sub message deduplication. Yet nowhere in the 
> documentation, including the [source 
> code|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java],
>  tells users how long this deduplication is supposed to last. 
> In 
> [`PubsubIO.java`|https://github.com/apache/beam/blob/a24bc3bae54f089b93bd66a118bd4bf09dbc9254/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L842-L853]:
> {code:java}
> /**
>  * When reading from Cloud Pub/Sub where unique record identifiers are 
> provided as Pub/Sub
>  * message attributes, specifies the name of the attribute containing the 
> unique identifier. The
>  * value of the attribute can be any string that uniquely identifies this 
> record.
>  *
>  * Pub/Sub cannot guarantee that no duplicate data will be delivered 
> on the Pub/Sub stream.
>  * If {@code idAttribute} is not provided, Beam cannot guarantee that no 
> duplicate data will be
>  * delivered, and deduplication of the stream will be strictly best 
> effort.
>  */
> public Read withIdAttribute(String idAttribute) {
>   return toBuilder().setIdAttribute(idAttribute).build();
> }
> {code}
> This information here isn't enough for users to know if a second message, 
> published with the same custom IdAttribute as that of a first message, which 
> was published `x` minutes ago, would be deduplicated by the Dataflow runner. 
> Better documentation will help. I imagine a lot of users will wonder about 
> this and may even ask how to configure this period, but that will probably 
> need a separate ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9354) How long does PubSubIO message deduplication last?

2020-02-21 Thread Tianzi Cai (Jira)
Tianzi Cai created BEAM-9354:


 Summary: How long does PubSubIO message deduplication last?
 Key: BEAM-9354
 URL: https://issues.apache.org/jira/browse/BEAM-9354
 Project: Beam
  Issue Type: Improvement
  Components: io-java-gcp
Reporter: Tianzi Cai


GCP documentation heavily 
[promotes|https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub]
 Beam's PubSubIO for Pub/Sub message deduplication. Yet nowhere in the 
documentation, including the [source 
code|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java],
 tells users how long this deduplication is supposed to last. 

In 
[`PubsubIO.java`|https://github.com/apache/beam/blob/a24bc3bae54f089b93bd66a118bd4bf09dbc9254/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L842-L853]:
{code:java}
/**
 * When reading from Cloud Pub/Sub where unique record identifiers are 
provided as Pub/Sub
 * message attributes, specifies the name of the attribute containing the 
unique identifier. The
 * value of the attribute can be any string that uniquely identifies this 
record.
 *
 * Pub/Sub cannot guarantee that no duplicate data will be delivered on 
the Pub/Sub stream.
 * If {@code idAttribute} is not provided, Beam cannot guarantee that no 
duplicate data will be
 * delivered, and deduplication of the stream will be strictly best effort.
 */
public Read withIdAttribute(String idAttribute) {
  return toBuilder().setIdAttribute(idAttribute).build();
}
{code}
This information here isn't enough for users to know if a second message, 
published with the same custom IdAttribute as that of a first message, which 
was published `x` minutes ago, would be deduplicated by the Dataflow runner. 

Better documentation will help. I imagine a lot of users will wonder about this 
and may even ask how to configure this period, but that will probably need a 
separate ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)