Re: Count based triggers and latency
On Mon, Oct 12, 2020 at 9:23 PM KV 59 wrote: > Thanks for your responses. > > I have a follow-up question, when you say > >> elementCountAtLeast means that the runner can buffer as many as it wants >> and can decide to offer a low latency pipeline by triggering often or >> better throughput through the use of buffering. > > > Does it mean, I as a pipeline developer cannot control how often the > runner triggers? > (Please correct if I am wrong): Yes, as I recall when trigger conditions meet, it allows the runner to fire the trigger, but the runner can choose when to fire. > > Kishore > > On Mon, Oct 12, 2020 at 2:15 PM Kenneth Knowles wrote: > >> Another thing to keep in mind - apologies if it was already clear: >> triggering governs aggregation (GBK / Combine). It does not have any effect >> on stateful DoFn. >> >> On Mon, Oct 12, 2020 at 9:24 AM Luke Cwik wrote: >> >>> The default trigger will only fire when the global window closes which >>> does happen with sources whose watermark goes > GlobalWindow.MAX_TIMESTAMP >>> or during pipeline drain with partial results in streaming. Bounded sources >>> commonly have their watermark advance to the end of time when they complete >>> and some unbounded sources can stop producing output if they detect the end. >>> >>> Parallelization for stateful DoFns are per key and window. >>> Parallelization for GBK is per key and window pane. Note that >>> elementCountAtLeast means that the runner can buffer as many as it wants >>> and can decide to offer a low latency pipeline by triggering often or >>> better throughput through the use of buffering. >>> >>> >>> >>> On Mon, Oct 12, 2020 at 8:22 AM KV 59 wrote: >>> Hi All, I'm building a pipeline to process events as they come and do not really care about the event time and watermark. I'm more interested in not discarding the events and reducing the latency. The downstream pipeline has a stateful DoFn. I understand that the default window strategy is Global Windows,. I did not completely understand the default trigger as per https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case of global window how does this work (there is no end of window)?. My source is Google PubSub and pipeline is running on Dataflow runner I have defined my window transform as below input.apply(TRANSFORM_NAME, Window.into(new GlobalWindows()) > .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))) > .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes()) A couple of questions 1. Is triggering after each element inefficient in terms of persistence(serialization) after each element and also parallelism triggering after each looks like a serial execution? 2. How does Dataflow parallelize in such cases of triggers? Thanks and appreciate the responses. Kishore >>>
Re: Count based triggers and latency
Thanks for your responses. I have a follow-up question, when you say > elementCountAtLeast means that the runner can buffer as many as it wants > and can decide to offer a low latency pipeline by triggering often or > better throughput through the use of buffering. Does it mean, I as a pipeline developer cannot control how often the runner triggers? Kishore On Mon, Oct 12, 2020 at 2:15 PM Kenneth Knowles wrote: > Another thing to keep in mind - apologies if it was already clear: > triggering governs aggregation (GBK / Combine). It does not have any effect > on stateful DoFn. > > On Mon, Oct 12, 2020 at 9:24 AM Luke Cwik wrote: > >> The default trigger will only fire when the global window closes which >> does happen with sources whose watermark goes > GlobalWindow.MAX_TIMESTAMP >> or during pipeline drain with partial results in streaming. Bounded sources >> commonly have their watermark advance to the end of time when they complete >> and some unbounded sources can stop producing output if they detect the end. >> >> Parallelization for stateful DoFns are per key and window. >> Parallelization for GBK is per key and window pane. Note that >> elementCountAtLeast means that the runner can buffer as many as it wants >> and can decide to offer a low latency pipeline by triggering often or >> better throughput through the use of buffering. >> >> >> >> On Mon, Oct 12, 2020 at 8:22 AM KV 59 wrote: >> >>> Hi All, >>> >>> I'm building a pipeline to process events as they come and do not really >>> care about the event time and watermark. I'm more interested in not >>> discarding the events and reducing the latency. The downstream pipeline has >>> a stateful DoFn. I understand that the default window strategy is Global >>> Windows,. I did not completely understand the default trigger as per >>> >>> https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html >>> it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case >>> of global window how does this work (there is no end of window)?. >>> >>> My source is Google PubSub and pipeline is running on Dataflow runner I >>> have defined my window transform as below >>> >>> input.apply(TRANSFORM_NAME, Window.into(new GlobalWindows()) .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes()) >>> >>> >>> A couple of questions >>> >>>1. Is triggering after each element inefficient in terms of >>>persistence(serialization) after each element and also parallelism >>>triggering after each looks like a serial execution? >>>2. How does Dataflow parallelize in such cases of triggers? >>> >>> >>> Thanks and appreciate the responses. >>> >>> Kishore >>> >>
Re: Count based triggers and latency
Another thing to keep in mind - apologies if it was already clear: triggering governs aggregation (GBK / Combine). It does not have any effect on stateful DoFn. On Mon, Oct 12, 2020 at 9:24 AM Luke Cwik wrote: > The default trigger will only fire when the global window closes which > does happen with sources whose watermark goes > GlobalWindow.MAX_TIMESTAMP > or during pipeline drain with partial results in streaming. Bounded sources > commonly have their watermark advance to the end of time when they complete > and some unbounded sources can stop producing output if they detect the end. > > Parallelization for stateful DoFns are per key and window. Parallelization > for GBK is per key and window pane. Note that elementCountAtLeast means > that the runner can buffer as many as it wants and can decide to offer a > low latency pipeline by triggering often or better throughput through the > use of buffering. > > > > On Mon, Oct 12, 2020 at 8:22 AM KV 59 wrote: > >> Hi All, >> >> I'm building a pipeline to process events as they come and do not really >> care about the event time and watermark. I'm more interested in not >> discarding the events and reducing the latency. The downstream pipeline has >> a stateful DoFn. I understand that the default window strategy is Global >> Windows,. I did not completely understand the default trigger as per >> >> https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html >> it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case of >> global window how does this work (there is no end of window)?. >> >> My source is Google PubSub and pipeline is running on Dataflow runner I >> have defined my window transform as below >> >> input.apply(TRANSFORM_NAME, Window.into(new GlobalWindows()) >>> .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))) >>> .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes()) >> >> >> A couple of questions >> >>1. Is triggering after each element inefficient in terms of >>persistence(serialization) after each element and also parallelism >>triggering after each looks like a serial execution? >>2. How does Dataflow parallelize in such cases of triggers? >> >> >> Thanks and appreciate the responses. >> >> Kishore >> >
Re: Querying Dataflow job status via Java SDK
Thanks for the link, Steve - very helpful! On Mon, Oct 12, 2020 at 11:31 AM Steve Niemitz wrote: > This is what I was referencing: > https://github.com/googleapis/google-api-java-client-services/tree/master/clients/google-api-services-dataflow/v1b3 > > > > > On Mon, Oct 12, 2020 at 2:23 PM Peter Littig > wrote: > >> Thanks for the replies, Lukasz and Steve! >> >> Steve: do you have a link to the google client api wrappers (I'm not sure >> if I know what they are.) >> >> Thank you! >> >> On Mon, Oct 12, 2020 at 11:04 AM Steve Niemitz >> wrote: >> >>> We use the Dataflow API [1] directly, via the google api client wrappers >>> (both python and java), pretty extensively. It works well and doesn't >>> require a dependency on beam. >>> >>> [1] https://cloud.google.com/dataflow/docs/reference/rest >>> >>> On Mon, Oct 12, 2020 at 1:56 PM Luke Cwik wrote: >>> It is your best way to do this right now and this hasn't changed in a while (region was added to project and job ids in the past 6 years). On Mon, Oct 12, 2020 at 10:53 AM Peter Littig wrote: > Thanks for the reply, Kyle. > > The DataflowClient::getJob method uses a Dataflow instance that's > provided at construction time (via > DataflowPipelineOptions::getDataflowClient). If that Dataflow instance can > be obtained from a minimal instance of the options (i.e., containing only > the project ID and region) then it looks like everything should work. > > I suppose a secondary question here is whether or not this approach is > the recommended way to solve my problem (but I don't know of any > alternatives). > > On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver > wrote: > >> > I think the answer is to use a DataflowClient in the second >> service, but creating one requires DataflowPipelineOptions. Are these >> options supposed to be exactly the same as those used by the first >> service? >> Or do only some of the fields have to be the same? >> >> Most options are not necessary for retrieving a job. In general, >> Dataflow jobs can always be uniquely identified by the project, region >> and >> job ID. >> https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100 >> >> On Mon, Oct 12, 2020 at 9:31 AM Peter Littig >> wrote: >> >>> Hello, Beam users! >>> >>> Suppose I want to build two (Java) services, one that launches >>> (long-running) dataflow jobs, and the other that monitors the status of >>> dataflow jobs. Within a single service, I could simply track a >>> PipelineResult for each dataflow run and periodically call getState. How >>> can I monitor job status like this from a second, independent service? >>> >>> I think the answer is to use a DataflowClient in the second service, >>> but creating one requires DataflowPipelineOptions. Are these options >>> supposed to be exactly the same as those used by the first service? Or >>> do >>> only some of the fields have to be the same? >>> >>> Or maybe there's a better alternative than DataflowClient? >>> >>> Thanks in advance! >>> >>> Peter >>> >>
Re: Querying Dataflow job status via Java SDK
This is what I was referencing: https://github.com/googleapis/google-api-java-client-services/tree/master/clients/google-api-services-dataflow/v1b3 On Mon, Oct 12, 2020 at 2:23 PM Peter Littig wrote: > Thanks for the replies, Lukasz and Steve! > > Steve: do you have a link to the google client api wrappers (I'm not sure > if I know what they are.) > > Thank you! > > On Mon, Oct 12, 2020 at 11:04 AM Steve Niemitz > wrote: > >> We use the Dataflow API [1] directly, via the google api client wrappers >> (both python and java), pretty extensively. It works well and doesn't >> require a dependency on beam. >> >> [1] https://cloud.google.com/dataflow/docs/reference/rest >> >> On Mon, Oct 12, 2020 at 1:56 PM Luke Cwik wrote: >> >>> It is your best way to do this right now and this hasn't changed in a >>> while (region was added to project and job ids in the past 6 years). >>> >>> On Mon, Oct 12, 2020 at 10:53 AM Peter Littig >>> wrote: >>> Thanks for the reply, Kyle. The DataflowClient::getJob method uses a Dataflow instance that's provided at construction time (via DataflowPipelineOptions::getDataflowClient). If that Dataflow instance can be obtained from a minimal instance of the options (i.e., containing only the project ID and region) then it looks like everything should work. I suppose a secondary question here is whether or not this approach is the recommended way to solve my problem (but I don't know of any alternatives). On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver wrote: > > I think the answer is to use a DataflowClient in the second service, > but creating one requires DataflowPipelineOptions. Are these options > supposed to be exactly the same as those used by the first service? Or do > only some of the fields have to be the same? > > Most options are not necessary for retrieving a job. In general, > Dataflow jobs can always be uniquely identified by the project, region and > job ID. > https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100 > > On Mon, Oct 12, 2020 at 9:31 AM Peter Littig > wrote: > >> Hello, Beam users! >> >> Suppose I want to build two (Java) services, one that launches >> (long-running) dataflow jobs, and the other that monitors the status of >> dataflow jobs. Within a single service, I could simply track a >> PipelineResult for each dataflow run and periodically call getState. How >> can I monitor job status like this from a second, independent service? >> >> I think the answer is to use a DataflowClient in the second service, >> but creating one requires DataflowPipelineOptions. Are these options >> supposed to be exactly the same as those used by the first service? Or do >> only some of the fields have to be the same? >> >> Or maybe there's a better alternative than DataflowClient? >> >> Thanks in advance! >> >> Peter >> >
Re: Querying Dataflow job status via Java SDK
Thanks for the replies, Lukasz and Steve! Steve: do you have a link to the google client api wrappers (I'm not sure if I know what they are.) Thank you! On Mon, Oct 12, 2020 at 11:04 AM Steve Niemitz wrote: > We use the Dataflow API [1] directly, via the google api client wrappers > (both python and java), pretty extensively. It works well and doesn't > require a dependency on beam. > > [1] https://cloud.google.com/dataflow/docs/reference/rest > > On Mon, Oct 12, 2020 at 1:56 PM Luke Cwik wrote: > >> It is your best way to do this right now and this hasn't changed in a >> while (region was added to project and job ids in the past 6 years). >> >> On Mon, Oct 12, 2020 at 10:53 AM Peter Littig >> wrote: >> >>> Thanks for the reply, Kyle. >>> >>> The DataflowClient::getJob method uses a Dataflow instance that's >>> provided at construction time (via >>> DataflowPipelineOptions::getDataflowClient). If that Dataflow instance can >>> be obtained from a minimal instance of the options (i.e., containing only >>> the project ID and region) then it looks like everything should work. >>> >>> I suppose a secondary question here is whether or not this approach is >>> the recommended way to solve my problem (but I don't know of any >>> alternatives). >>> >>> On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver wrote: >>> > I think the answer is to use a DataflowClient in the second service, but creating one requires DataflowPipelineOptions. Are these options supposed to be exactly the same as those used by the first service? Or do only some of the fields have to be the same? Most options are not necessary for retrieving a job. In general, Dataflow jobs can always be uniquely identified by the project, region and job ID. https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100 On Mon, Oct 12, 2020 at 9:31 AM Peter Littig wrote: > Hello, Beam users! > > Suppose I want to build two (Java) services, one that launches > (long-running) dataflow jobs, and the other that monitors the status of > dataflow jobs. Within a single service, I could simply track a > PipelineResult for each dataflow run and periodically call getState. How > can I monitor job status like this from a second, independent service? > > I think the answer is to use a DataflowClient in the second service, > but creating one requires DataflowPipelineOptions. Are these options > supposed to be exactly the same as those used by the first service? Or do > only some of the fields have to be the same? > > Or maybe there's a better alternative than DataflowClient? > > Thanks in advance! > > Peter >
Re: Querying Dataflow job status via Java SDK
We use the Dataflow API [1] directly, via the google api client wrappers (both python and java), pretty extensively. It works well and doesn't require a dependency on beam. [1] https://cloud.google.com/dataflow/docs/reference/rest On Mon, Oct 12, 2020 at 1:56 PM Luke Cwik wrote: > It is your best way to do this right now and this hasn't changed in a > while (region was added to project and job ids in the past 6 years). > > On Mon, Oct 12, 2020 at 10:53 AM Peter Littig > wrote: > >> Thanks for the reply, Kyle. >> >> The DataflowClient::getJob method uses a Dataflow instance that's >> provided at construction time (via >> DataflowPipelineOptions::getDataflowClient). If that Dataflow instance can >> be obtained from a minimal instance of the options (i.e., containing only >> the project ID and region) then it looks like everything should work. >> >> I suppose a secondary question here is whether or not this approach is >> the recommended way to solve my problem (but I don't know of any >> alternatives). >> >> On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver wrote: >> >>> > I think the answer is to use a DataflowClient in the second service, >>> but creating one requires DataflowPipelineOptions. Are these options >>> supposed to be exactly the same as those used by the first service? Or do >>> only some of the fields have to be the same? >>> >>> Most options are not necessary for retrieving a job. In general, >>> Dataflow jobs can always be uniquely identified by the project, region and >>> job ID. >>> https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100 >>> >>> On Mon, Oct 12, 2020 at 9:31 AM Peter Littig >>> wrote: >>> Hello, Beam users! Suppose I want to build two (Java) services, one that launches (long-running) dataflow jobs, and the other that monitors the status of dataflow jobs. Within a single service, I could simply track a PipelineResult for each dataflow run and periodically call getState. How can I monitor job status like this from a second, independent service? I think the answer is to use a DataflowClient in the second service, but creating one requires DataflowPipelineOptions. Are these options supposed to be exactly the same as those used by the first service? Or do only some of the fields have to be the same? Or maybe there's a better alternative than DataflowClient? Thanks in advance! Peter >>>
Re: Querying Dataflow job status via Java SDK
It is your best way to do this right now and this hasn't changed in a while (region was added to project and job ids in the past 6 years). On Mon, Oct 12, 2020 at 10:53 AM Peter Littig wrote: > Thanks for the reply, Kyle. > > The DataflowClient::getJob method uses a Dataflow instance that's provided > at construction time (via DataflowPipelineOptions::getDataflowClient). If > that Dataflow instance can be obtained from a minimal instance of the > options (i.e., containing only the project ID and region) then it looks > like everything should work. > > I suppose a secondary question here is whether or not this approach is the > recommended way to solve my problem (but I don't know of any alternatives). > > On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver wrote: > >> > I think the answer is to use a DataflowClient in the second service, >> but creating one requires DataflowPipelineOptions. Are these options >> supposed to be exactly the same as those used by the first service? Or do >> only some of the fields have to be the same? >> >> Most options are not necessary for retrieving a job. In general, Dataflow >> jobs can always be uniquely identified by the project, region and job ID. >> https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100 >> >> On Mon, Oct 12, 2020 at 9:31 AM Peter Littig >> wrote: >> >>> Hello, Beam users! >>> >>> Suppose I want to build two (Java) services, one that launches >>> (long-running) dataflow jobs, and the other that monitors the status of >>> dataflow jobs. Within a single service, I could simply track a >>> PipelineResult for each dataflow run and periodically call getState. How >>> can I monitor job status like this from a second, independent service? >>> >>> I think the answer is to use a DataflowClient in the second service, but >>> creating one requires DataflowPipelineOptions. Are these options supposed >>> to be exactly the same as those used by the first service? Or do only some >>> of the fields have to be the same? >>> >>> Or maybe there's a better alternative than DataflowClient? >>> >>> Thanks in advance! >>> >>> Peter >>> >>
Re: Querying Dataflow job status via Java SDK
Thanks for the reply, Kyle. The DataflowClient::getJob method uses a Dataflow instance that's provided at construction time (via DataflowPipelineOptions::getDataflowClient). If that Dataflow instance can be obtained from a minimal instance of the options (i.e., containing only the project ID and region) then it looks like everything should work. I suppose a secondary question here is whether or not this approach is the recommended way to solve my problem (but I don't know of any alternatives). On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver wrote: > > I think the answer is to use a DataflowClient in the second service, but > creating one requires DataflowPipelineOptions. Are these options supposed > to be exactly the same as those used by the first service? Or do only some > of the fields have to be the same? > > Most options are not necessary for retrieving a job. In general, Dataflow > jobs can always be uniquely identified by the project, region and job ID. > https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100 > > On Mon, Oct 12, 2020 at 9:31 AM Peter Littig > wrote: > >> Hello, Beam users! >> >> Suppose I want to build two (Java) services, one that launches >> (long-running) dataflow jobs, and the other that monitors the status of >> dataflow jobs. Within a single service, I could simply track a >> PipelineResult for each dataflow run and periodically call getState. How >> can I monitor job status like this from a second, independent service? >> >> I think the answer is to use a DataflowClient in the second service, but >> creating one requires DataflowPipelineOptions. Are these options supposed >> to be exactly the same as those used by the first service? Or do only some >> of the fields have to be the same? >> >> Or maybe there's a better alternative than DataflowClient? >> >> Thanks in advance! >> >> Peter >> >
Re: Querying Dataflow job status via Java SDK
> I think the answer is to use a DataflowClient in the second service, but creating one requires DataflowPipelineOptions. Are these options supposed to be exactly the same as those used by the first service? Or do only some of the fields have to be the same? Most options are not necessary for retrieving a job. In general, Dataflow jobs can always be uniquely identified by the project, region and job ID. https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100 On Mon, Oct 12, 2020 at 9:31 AM Peter Littig wrote: > Hello, Beam users! > > Suppose I want to build two (Java) services, one that launches > (long-running) dataflow jobs, and the other that monitors the status of > dataflow jobs. Within a single service, I could simply track a > PipelineResult for each dataflow run and periodically call getState. How > can I monitor job status like this from a second, independent service? > > I think the answer is to use a DataflowClient in the second service, but > creating one requires DataflowPipelineOptions. Are these options supposed > to be exactly the same as those used by the first service? Or do only some > of the fields have to be the same? > > Or maybe there's a better alternative than DataflowClient? > > Thanks in advance! > > Peter >
Querying Dataflow job status via Java SDK
Hello, Beam users! Suppose I want to build two (Java) services, one that launches (long-running) dataflow jobs, and the other that monitors the status of dataflow jobs. Within a single service, I could simply track a PipelineResult for each dataflow run and periodically call getState. How can I monitor job status like this from a second, independent service? I think the answer is to use a DataflowClient in the second service, but creating one requires DataflowPipelineOptions. Are these options supposed to be exactly the same as those used by the first service? Or do only some of the fields have to be the same? Or maybe there's a better alternative than DataflowClient? Thanks in advance! Peter
Re: Count based triggers and latency
The default trigger will only fire when the global window closes which does happen with sources whose watermark goes > GlobalWindow.MAX_TIMESTAMP or during pipeline drain with partial results in streaming. Bounded sources commonly have their watermark advance to the end of time when they complete and some unbounded sources can stop producing output if they detect the end. Parallelization for stateful DoFns are per key and window. Parallelization for GBK is per key and window pane. Note that elementCountAtLeast means that the runner can buffer as many as it wants and can decide to offer a low latency pipeline by triggering often or better throughput through the use of buffering. On Mon, Oct 12, 2020 at 8:22 AM KV 59 wrote: > Hi All, > > I'm building a pipeline to process events as they come and do not really > care about the event time and watermark. I'm more interested in not > discarding the events and reducing the latency. The downstream pipeline has > a stateful DoFn. I understand that the default window strategy is Global > Windows,. I did not completely understand the default trigger as per > > https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html > it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case of > global window how does this work (there is no end of window)?. > > My source is Google PubSub and pipeline is running on Dataflow runner I > have defined my window transform as below > > input.apply(TRANSFORM_NAME, Window.into(new GlobalWindows()) >> .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))) >> .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes()) > > > A couple of questions > >1. Is triggering after each element inefficient in terms of >persistence(serialization) after each element and also parallelism >triggering after each looks like a serial execution? >2. How does Dataflow parallelize in such cases of triggers? > > > Thanks and appreciate the responses. > > Kishore >
Count based triggers and latency
Hi All, I'm building a pipeline to process events as they come and do not really care about the event time and watermark. I'm more interested in not discarding the events and reducing the latency. The downstream pipeline has a stateful DoFn. I understand that the default window strategy is Global Windows,. I did not completely understand the default trigger as per https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case of global window how does this work (there is no end of window)?. My source is Google PubSub and pipeline is running on Dataflow runner I have defined my window transform as below input.apply(TRANSFORM_NAME, Window.into(new GlobalWindows()) > .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))) > .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes()) A couple of questions 1. Is triggering after each element inefficient in terms of persistence(serialization) after each element and also parallelism triggering after each looks like a serial execution? 2. How does Dataflow parallelize in such cases of triggers? Thanks and appreciate the responses. Kishore