Re: Count based triggers and latency

2020-10-12 Thread Rui Wang
On Mon, Oct 12, 2020 at 9:23 PM KV 59  wrote:

> Thanks for your responses.
>
> I have a follow-up question, when you say
>
>> elementCountAtLeast means that the runner can buffer as many as it wants
>> and can decide to offer a low latency pipeline by triggering often or
>> better throughput through the use of buffering.
>
>
> Does it mean, I as a pipeline developer cannot control how often the
> runner triggers?
>

(Please correct if I am wrong):

Yes, as I recall when trigger conditions meet, it allows the runner to fire
the trigger, but the runner can choose when to fire.



>
> Kishore
>
> On Mon, Oct 12, 2020 at 2:15 PM Kenneth Knowles  wrote:
>
>> Another thing to keep in mind - apologies if it was already clear:
>> triggering governs aggregation (GBK / Combine). It does not have any effect
>> on stateful DoFn.
>>
>> On Mon, Oct 12, 2020 at 9:24 AM Luke Cwik  wrote:
>>
>>> The default trigger will only fire when the global window closes which
>>> does happen with sources whose watermark goes > GlobalWindow.MAX_TIMESTAMP
>>> or during pipeline drain with partial results in streaming. Bounded sources
>>> commonly have their watermark advance to the end of time when they complete
>>> and some unbounded sources can stop producing output if they detect the end.
>>>
>>> Parallelization for stateful DoFns are per key and window.
>>> Parallelization for GBK is per key and window pane. Note that
>>> elementCountAtLeast means that the runner can buffer as many as it wants
>>> and can decide to offer a low latency pipeline by triggering often or
>>> better throughput through the use of buffering.
>>>
>>>
>>>
>>> On Mon, Oct 12, 2020 at 8:22 AM KV 59  wrote:
>>>
 Hi All,

 I'm building a pipeline to process events as they come and do not
 really care about the event time and watermark. I'm more interested in not
 discarding the events and reducing the latency. The downstream pipeline has
 a stateful DoFn. I understand that the default window strategy is Global
 Windows,. I did not completely understand the default trigger as per

 https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html
 it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case
 of global window how does this work (there is no end of window)?.

 My source is Google PubSub and pipeline is running on Dataflow runner I
 have defined my window transform as below

 input.apply(TRANSFORM_NAME, Window.into(new GlobalWindows())
> .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
> .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes())


 A couple of questions

1. Is triggering after each element inefficient in terms of
persistence(serialization) after each element and also parallelism
triggering after each looks like a serial execution?
2. How does Dataflow parallelize in such cases of triggers?


 Thanks and appreciate the responses.

 Kishore

>>>


Re: Count based triggers and latency

2020-10-12 Thread KV 59
Thanks for your responses.

I have a follow-up question, when you say

> elementCountAtLeast means that the runner can buffer as many as it wants
> and can decide to offer a low latency pipeline by triggering often or
> better throughput through the use of buffering.


Does it mean, I as a pipeline developer cannot control how often the runner
triggers?

Kishore

On Mon, Oct 12, 2020 at 2:15 PM Kenneth Knowles  wrote:

> Another thing to keep in mind - apologies if it was already clear:
> triggering governs aggregation (GBK / Combine). It does not have any effect
> on stateful DoFn.
>
> On Mon, Oct 12, 2020 at 9:24 AM Luke Cwik  wrote:
>
>> The default trigger will only fire when the global window closes which
>> does happen with sources whose watermark goes > GlobalWindow.MAX_TIMESTAMP
>> or during pipeline drain with partial results in streaming. Bounded sources
>> commonly have their watermark advance to the end of time when they complete
>> and some unbounded sources can stop producing output if they detect the end.
>>
>> Parallelization for stateful DoFns are per key and window.
>> Parallelization for GBK is per key and window pane. Note that
>> elementCountAtLeast means that the runner can buffer as many as it wants
>> and can decide to offer a low latency pipeline by triggering often or
>> better throughput through the use of buffering.
>>
>>
>>
>> On Mon, Oct 12, 2020 at 8:22 AM KV 59  wrote:
>>
>>> Hi All,
>>>
>>> I'm building a pipeline to process events as they come and do not really
>>> care about the event time and watermark. I'm more interested in not
>>> discarding the events and reducing the latency. The downstream pipeline has
>>> a stateful DoFn. I understand that the default window strategy is Global
>>> Windows,. I did not completely understand the default trigger as per
>>>
>>> https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html
>>> it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case
>>> of global window how does this work (there is no end of window)?.
>>>
>>> My source is Google PubSub and pipeline is running on Dataflow runner I
>>> have defined my window transform as below
>>>
>>> input.apply(TRANSFORM_NAME, Window.into(new GlobalWindows())
 .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
 .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes())
>>>
>>>
>>> A couple of questions
>>>
>>>1. Is triggering after each element inefficient in terms of
>>>persistence(serialization) after each element and also parallelism
>>>triggering after each looks like a serial execution?
>>>2. How does Dataflow parallelize in such cases of triggers?
>>>
>>>
>>> Thanks and appreciate the responses.
>>>
>>> Kishore
>>>
>>


Re: Count based triggers and latency

2020-10-12 Thread Kenneth Knowles
Another thing to keep in mind - apologies if it was already clear:
triggering governs aggregation (GBK / Combine). It does not have any effect
on stateful DoFn.

On Mon, Oct 12, 2020 at 9:24 AM Luke Cwik  wrote:

> The default trigger will only fire when the global window closes which
> does happen with sources whose watermark goes > GlobalWindow.MAX_TIMESTAMP
> or during pipeline drain with partial results in streaming. Bounded sources
> commonly have their watermark advance to the end of time when they complete
> and some unbounded sources can stop producing output if they detect the end.
>
> Parallelization for stateful DoFns are per key and window. Parallelization
> for GBK is per key and window pane. Note that  elementCountAtLeast means
> that the runner can buffer as many as it wants and can decide to offer a
> low latency pipeline by triggering often or better throughput through the
> use of buffering.
>
>
>
> On Mon, Oct 12, 2020 at 8:22 AM KV 59  wrote:
>
>> Hi All,
>>
>> I'm building a pipeline to process events as they come and do not really
>> care about the event time and watermark. I'm more interested in not
>> discarding the events and reducing the latency. The downstream pipeline has
>> a stateful DoFn. I understand that the default window strategy is Global
>> Windows,. I did not completely understand the default trigger as per
>>
>> https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html
>> it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case of
>> global window how does this work (there is no end of window)?.
>>
>> My source is Google PubSub and pipeline is running on Dataflow runner I
>> have defined my window transform as below
>>
>> input.apply(TRANSFORM_NAME, Window.into(new GlobalWindows())
>>> .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
>>> .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes())
>>
>>
>> A couple of questions
>>
>>1. Is triggering after each element inefficient in terms of
>>persistence(serialization) after each element and also parallelism
>>triggering after each looks like a serial execution?
>>2. How does Dataflow parallelize in such cases of triggers?
>>
>>
>> Thanks and appreciate the responses.
>>
>> Kishore
>>
>


Re: Querying Dataflow job status via Java SDK

2020-10-12 Thread Peter Littig
Thanks for the link, Steve - very helpful!

On Mon, Oct 12, 2020 at 11:31 AM Steve Niemitz  wrote:

> This is what I was referencing:
> https://github.com/googleapis/google-api-java-client-services/tree/master/clients/google-api-services-dataflow/v1b3
>
>
>
>
> On Mon, Oct 12, 2020 at 2:23 PM Peter Littig 
> wrote:
>
>> Thanks for the replies, Lukasz and Steve!
>>
>> Steve: do you have a link to the google client api wrappers (I'm not sure
>> if I know what they are.)
>>
>> Thank you!
>>
>> On Mon, Oct 12, 2020 at 11:04 AM Steve Niemitz 
>> wrote:
>>
>>> We use the Dataflow API [1] directly, via the google api client wrappers
>>> (both python and java), pretty extensively.  It works well and doesn't
>>> require a dependency on beam.
>>>
>>> [1] https://cloud.google.com/dataflow/docs/reference/rest
>>>
>>> On Mon, Oct 12, 2020 at 1:56 PM Luke Cwik  wrote:
>>>
 It is your best way to do this right now and this hasn't changed in a
 while (region was added to project and job ids in the past 6 years).

 On Mon, Oct 12, 2020 at 10:53 AM Peter Littig 
 wrote:

> Thanks for the reply, Kyle.
>
> The DataflowClient::getJob method uses a Dataflow instance that's
> provided at construction time (via
> DataflowPipelineOptions::getDataflowClient). If that Dataflow instance can
> be obtained from a minimal instance of the options (i.e., containing only
> the project ID and region) then it looks like everything should work.
>
> I suppose a secondary question here is whether or not this approach is
> the recommended way to solve my problem (but I don't know of any
> alternatives).
>
> On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver 
> wrote:
>
>> > I think the answer is to use a DataflowClient in the second
>> service, but creating one requires DataflowPipelineOptions. Are these
>> options supposed to be exactly the same as those used by the first 
>> service?
>> Or do only some of the fields have to be the same?
>>
>> Most options are not necessary for retrieving a job. In general,
>> Dataflow jobs can always be uniquely identified by the project, region 
>> and
>> job ID.
>> https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100
>>
>> On Mon, Oct 12, 2020 at 9:31 AM Peter Littig 
>> wrote:
>>
>>> Hello, Beam users!
>>>
>>> Suppose I want to build two (Java) services, one that launches
>>> (long-running) dataflow jobs, and the other that monitors the status of
>>> dataflow jobs. Within a single service, I could simply track a
>>> PipelineResult for each dataflow run and periodically call getState. How
>>> can I monitor job status like this from a second, independent service?
>>>
>>> I think the answer is to use a DataflowClient in the second service,
>>> but creating one requires DataflowPipelineOptions. Are these options
>>> supposed to be exactly the same as those used by the first service? Or 
>>> do
>>> only some of the fields have to be the same?
>>>
>>> Or maybe there's a better alternative than DataflowClient?
>>>
>>> Thanks in advance!
>>>
>>> Peter
>>>
>>


Re: Querying Dataflow job status via Java SDK

2020-10-12 Thread Steve Niemitz
This is what I was referencing:
https://github.com/googleapis/google-api-java-client-services/tree/master/clients/google-api-services-dataflow/v1b3




On Mon, Oct 12, 2020 at 2:23 PM Peter Littig 
wrote:

> Thanks for the replies, Lukasz and Steve!
>
> Steve: do you have a link to the google client api wrappers (I'm not sure
> if I know what they are.)
>
> Thank you!
>
> On Mon, Oct 12, 2020 at 11:04 AM Steve Niemitz 
> wrote:
>
>> We use the Dataflow API [1] directly, via the google api client wrappers
>> (both python and java), pretty extensively.  It works well and doesn't
>> require a dependency on beam.
>>
>> [1] https://cloud.google.com/dataflow/docs/reference/rest
>>
>> On Mon, Oct 12, 2020 at 1:56 PM Luke Cwik  wrote:
>>
>>> It is your best way to do this right now and this hasn't changed in a
>>> while (region was added to project and job ids in the past 6 years).
>>>
>>> On Mon, Oct 12, 2020 at 10:53 AM Peter Littig 
>>> wrote:
>>>
 Thanks for the reply, Kyle.

 The DataflowClient::getJob method uses a Dataflow instance that's
 provided at construction time (via
 DataflowPipelineOptions::getDataflowClient). If that Dataflow instance can
 be obtained from a minimal instance of the options (i.e., containing only
 the project ID and region) then it looks like everything should work.

 I suppose a secondary question here is whether or not this approach is
 the recommended way to solve my problem (but I don't know of any
 alternatives).

 On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver 
 wrote:

> > I think the answer is to use a DataflowClient in the second service,
> but creating one requires DataflowPipelineOptions. Are these options
> supposed to be exactly the same as those used by the first service? Or do
> only some of the fields have to be the same?
>
> Most options are not necessary for retrieving a job. In general,
> Dataflow jobs can always be uniquely identified by the project, region and
> job ID.
> https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100
>
> On Mon, Oct 12, 2020 at 9:31 AM Peter Littig 
> wrote:
>
>> Hello, Beam users!
>>
>> Suppose I want to build two (Java) services, one that launches
>> (long-running) dataflow jobs, and the other that monitors the status of
>> dataflow jobs. Within a single service, I could simply track a
>> PipelineResult for each dataflow run and periodically call getState. How
>> can I monitor job status like this from a second, independent service?
>>
>> I think the answer is to use a DataflowClient in the second service,
>> but creating one requires DataflowPipelineOptions. Are these options
>> supposed to be exactly the same as those used by the first service? Or do
>> only some of the fields have to be the same?
>>
>> Or maybe there's a better alternative than DataflowClient?
>>
>> Thanks in advance!
>>
>> Peter
>>
>


Re: Querying Dataflow job status via Java SDK

2020-10-12 Thread Peter Littig
Thanks for the replies, Lukasz and Steve!

Steve: do you have a link to the google client api wrappers (I'm not sure
if I know what they are.)

Thank you!

On Mon, Oct 12, 2020 at 11:04 AM Steve Niemitz  wrote:

> We use the Dataflow API [1] directly, via the google api client wrappers
> (both python and java), pretty extensively.  It works well and doesn't
> require a dependency on beam.
>
> [1] https://cloud.google.com/dataflow/docs/reference/rest
>
> On Mon, Oct 12, 2020 at 1:56 PM Luke Cwik  wrote:
>
>> It is your best way to do this right now and this hasn't changed in a
>> while (region was added to project and job ids in the past 6 years).
>>
>> On Mon, Oct 12, 2020 at 10:53 AM Peter Littig 
>> wrote:
>>
>>> Thanks for the reply, Kyle.
>>>
>>> The DataflowClient::getJob method uses a Dataflow instance that's
>>> provided at construction time (via
>>> DataflowPipelineOptions::getDataflowClient). If that Dataflow instance can
>>> be obtained from a minimal instance of the options (i.e., containing only
>>> the project ID and region) then it looks like everything should work.
>>>
>>> I suppose a secondary question here is whether or not this approach is
>>> the recommended way to solve my problem (but I don't know of any
>>> alternatives).
>>>
>>> On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver  wrote:
>>>
 > I think the answer is to use a DataflowClient in the second service,
 but creating one requires DataflowPipelineOptions. Are these options
 supposed to be exactly the same as those used by the first service? Or do
 only some of the fields have to be the same?

 Most options are not necessary for retrieving a job. In general,
 Dataflow jobs can always be uniquely identified by the project, region and
 job ID.
 https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100

 On Mon, Oct 12, 2020 at 9:31 AM Peter Littig 
 wrote:

> Hello, Beam users!
>
> Suppose I want to build two (Java) services, one that launches
> (long-running) dataflow jobs, and the other that monitors the status of
> dataflow jobs. Within a single service, I could simply track a
> PipelineResult for each dataflow run and periodically call getState. How
> can I monitor job status like this from a second, independent service?
>
> I think the answer is to use a DataflowClient in the second service,
> but creating one requires DataflowPipelineOptions. Are these options
> supposed to be exactly the same as those used by the first service? Or do
> only some of the fields have to be the same?
>
> Or maybe there's a better alternative than DataflowClient?
>
> Thanks in advance!
>
> Peter
>



Re: Querying Dataflow job status via Java SDK

2020-10-12 Thread Steve Niemitz
We use the Dataflow API [1] directly, via the google api client wrappers
(both python and java), pretty extensively.  It works well and doesn't
require a dependency on beam.

[1] https://cloud.google.com/dataflow/docs/reference/rest

On Mon, Oct 12, 2020 at 1:56 PM Luke Cwik  wrote:

> It is your best way to do this right now and this hasn't changed in a
> while (region was added to project and job ids in the past 6 years).
>
> On Mon, Oct 12, 2020 at 10:53 AM Peter Littig 
> wrote:
>
>> Thanks for the reply, Kyle.
>>
>> The DataflowClient::getJob method uses a Dataflow instance that's
>> provided at construction time (via
>> DataflowPipelineOptions::getDataflowClient). If that Dataflow instance can
>> be obtained from a minimal instance of the options (i.e., containing only
>> the project ID and region) then it looks like everything should work.
>>
>> I suppose a secondary question here is whether or not this approach is
>> the recommended way to solve my problem (but I don't know of any
>> alternatives).
>>
>> On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver  wrote:
>>
>>> > I think the answer is to use a DataflowClient in the second service,
>>> but creating one requires DataflowPipelineOptions. Are these options
>>> supposed to be exactly the same as those used by the first service? Or do
>>> only some of the fields have to be the same?
>>>
>>> Most options are not necessary for retrieving a job. In general,
>>> Dataflow jobs can always be uniquely identified by the project, region and
>>> job ID.
>>> https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100
>>>
>>> On Mon, Oct 12, 2020 at 9:31 AM Peter Littig 
>>> wrote:
>>>
 Hello, Beam users!

 Suppose I want to build two (Java) services, one that launches
 (long-running) dataflow jobs, and the other that monitors the status of
 dataflow jobs. Within a single service, I could simply track a
 PipelineResult for each dataflow run and periodically call getState. How
 can I monitor job status like this from a second, independent service?

 I think the answer is to use a DataflowClient in the second service,
 but creating one requires DataflowPipelineOptions. Are these options
 supposed to be exactly the same as those used by the first service? Or do
 only some of the fields have to be the same?

 Or maybe there's a better alternative than DataflowClient?

 Thanks in advance!

 Peter

>>>


Re: Querying Dataflow job status via Java SDK

2020-10-12 Thread Luke Cwik
It is your best way to do this right now and this hasn't changed in a while
(region was added to project and job ids in the past 6 years).

On Mon, Oct 12, 2020 at 10:53 AM Peter Littig 
wrote:

> Thanks for the reply, Kyle.
>
> The DataflowClient::getJob method uses a Dataflow instance that's provided
> at construction time (via DataflowPipelineOptions::getDataflowClient). If
> that Dataflow instance can be obtained from a minimal instance of the
> options (i.e., containing only the project ID and region) then it looks
> like everything should work.
>
> I suppose a secondary question here is whether or not this approach is the
> recommended way to solve my problem (but I don't know of any alternatives).
>
> On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver  wrote:
>
>> > I think the answer is to use a DataflowClient in the second service,
>> but creating one requires DataflowPipelineOptions. Are these options
>> supposed to be exactly the same as those used by the first service? Or do
>> only some of the fields have to be the same?
>>
>> Most options are not necessary for retrieving a job. In general, Dataflow
>> jobs can always be uniquely identified by the project, region and job ID.
>> https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100
>>
>> On Mon, Oct 12, 2020 at 9:31 AM Peter Littig 
>> wrote:
>>
>>> Hello, Beam users!
>>>
>>> Suppose I want to build two (Java) services, one that launches
>>> (long-running) dataflow jobs, and the other that monitors the status of
>>> dataflow jobs. Within a single service, I could simply track a
>>> PipelineResult for each dataflow run and periodically call getState. How
>>> can I monitor job status like this from a second, independent service?
>>>
>>> I think the answer is to use a DataflowClient in the second service, but
>>> creating one requires DataflowPipelineOptions. Are these options supposed
>>> to be exactly the same as those used by the first service? Or do only some
>>> of the fields have to be the same?
>>>
>>> Or maybe there's a better alternative than DataflowClient?
>>>
>>> Thanks in advance!
>>>
>>> Peter
>>>
>>


Re: Querying Dataflow job status via Java SDK

2020-10-12 Thread Peter Littig
Thanks for the reply, Kyle.

The DataflowClient::getJob method uses a Dataflow instance that's provided
at construction time (via DataflowPipelineOptions::getDataflowClient). If
that Dataflow instance can be obtained from a minimal instance of the
options (i.e., containing only the project ID and region) then it looks
like everything should work.

I suppose a secondary question here is whether or not this approach is the
recommended way to solve my problem (but I don't know of any alternatives).

On Mon, Oct 12, 2020 at 9:55 AM Kyle Weaver  wrote:

> > I think the answer is to use a DataflowClient in the second service, but
> creating one requires DataflowPipelineOptions. Are these options supposed
> to be exactly the same as those used by the first service? Or do only some
> of the fields have to be the same?
>
> Most options are not necessary for retrieving a job. In general, Dataflow
> jobs can always be uniquely identified by the project, region and job ID.
> https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100
>
> On Mon, Oct 12, 2020 at 9:31 AM Peter Littig 
> wrote:
>
>> Hello, Beam users!
>>
>> Suppose I want to build two (Java) services, one that launches
>> (long-running) dataflow jobs, and the other that monitors the status of
>> dataflow jobs. Within a single service, I could simply track a
>> PipelineResult for each dataflow run and periodically call getState. How
>> can I monitor job status like this from a second, independent service?
>>
>> I think the answer is to use a DataflowClient in the second service, but
>> creating one requires DataflowPipelineOptions. Are these options supposed
>> to be exactly the same as those used by the first service? Or do only some
>> of the fields have to be the same?
>>
>> Or maybe there's a better alternative than DataflowClient?
>>
>> Thanks in advance!
>>
>> Peter
>>
>


Re: Querying Dataflow job status via Java SDK

2020-10-12 Thread Kyle Weaver
> I think the answer is to use a DataflowClient in the second service, but
creating one requires DataflowPipelineOptions. Are these options supposed
to be exactly the same as those used by the first service? Or do only some
of the fields have to be the same?

Most options are not necessary for retrieving a job. In general, Dataflow
jobs can always be uniquely identified by the project, region and job ID.
https://github.com/apache/beam/blob/ecedd3e654352f1b51ab2caae0fd4665403bd0eb/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowClient.java#L100

On Mon, Oct 12, 2020 at 9:31 AM Peter Littig 
wrote:

> Hello, Beam users!
>
> Suppose I want to build two (Java) services, one that launches
> (long-running) dataflow jobs, and the other that monitors the status of
> dataflow jobs. Within a single service, I could simply track a
> PipelineResult for each dataflow run and periodically call getState. How
> can I monitor job status like this from a second, independent service?
>
> I think the answer is to use a DataflowClient in the second service, but
> creating one requires DataflowPipelineOptions. Are these options supposed
> to be exactly the same as those used by the first service? Or do only some
> of the fields have to be the same?
>
> Or maybe there's a better alternative than DataflowClient?
>
> Thanks in advance!
>
> Peter
>


Querying Dataflow job status via Java SDK

2020-10-12 Thread Peter Littig
Hello, Beam users!

Suppose I want to build two (Java) services, one that launches
(long-running) dataflow jobs, and the other that monitors the status of
dataflow jobs. Within a single service, I could simply track a
PipelineResult for each dataflow run and periodically call getState. How
can I monitor job status like this from a second, independent service?

I think the answer is to use a DataflowClient in the second service, but
creating one requires DataflowPipelineOptions. Are these options supposed
to be exactly the same as those used by the first service? Or do only some
of the fields have to be the same?

Or maybe there's a better alternative than DataflowClient?

Thanks in advance!

Peter


Re: Count based triggers and latency

2020-10-12 Thread Luke Cwik
The default trigger will only fire when the global window closes which does
happen with sources whose watermark goes > GlobalWindow.MAX_TIMESTAMP or
during pipeline drain with partial results in streaming. Bounded sources
commonly have their watermark advance to the end of time when they complete
and some unbounded sources can stop producing output if they detect the end.

Parallelization for stateful DoFns are per key and window. Parallelization
for GBK is per key and window pane. Note that  elementCountAtLeast means
that the runner can buffer as many as it wants and can decide to offer a
low latency pipeline by triggering often or better throughput through the
use of buffering.



On Mon, Oct 12, 2020 at 8:22 AM KV 59  wrote:

> Hi All,
>
> I'm building a pipeline to process events as they come and do not really
> care about the event time and watermark. I'm more interested in not
> discarding the events and reducing the latency. The downstream pipeline has
> a stateful DoFn. I understand that the default window strategy is Global
> Windows,. I did not completely understand the default trigger as per
>
> https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html
> it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case of
> global window how does this work (there is no end of window)?.
>
> My source is Google PubSub and pipeline is running on Dataflow runner I
> have defined my window transform as below
>
> input.apply(TRANSFORM_NAME, Window.into(new GlobalWindows())
>> .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
>> .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes())
>
>
> A couple of questions
>
>1. Is triggering after each element inefficient in terms of
>persistence(serialization) after each element and also parallelism
>triggering after each looks like a serial execution?
>2. How does Dataflow parallelize in such cases of triggers?
>
>
> Thanks and appreciate the responses.
>
> Kishore
>


Count based triggers and latency

2020-10-12 Thread KV 59
Hi All,

I'm building a pipeline to process events as they come and do not really
care about the event time and watermark. I'm more interested in not
discarding the events and reducing the latency. The downstream pipeline has
a stateful DoFn. I understand that the default window strategy is Global
Windows,. I did not completely understand the default trigger as per
https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html
it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case of
global window how does this work (there is no end of window)?.

My source is Google PubSub and pipeline is running on Dataflow runner I
have defined my window transform as below

input.apply(TRANSFORM_NAME, Window.into(new GlobalWindows())
> .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
> .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes())


A couple of questions

   1. Is triggering after each element inefficient in terms of
   persistence(serialization) after each element and also parallelism
   triggering after each looks like a serial execution?
   2. How does Dataflow parallelize in such cases of triggers?


Thanks and appreciate the responses.

Kishore