Re: File processing triggered from external source

Sozonoff Serge Tue, 25 May 2021 11:14:30 -0700

Hi,

Thanks for the clarification.


What is an issue with applying windowing/triggering strategy for your case?

The problem was actually not the trigger but the whole approach I took.


I guess fundamentally the whole issue for me boils down to the fact the with 
bound pipelines we have quite a few approaches which can be taken to enrich 
data and with unbound pipelines we have very few. Obviously in a bound pipeline 
you dont really need to worry about refreshing your enriching data either since 
its all built when the pipeline launches. 

So, I had this perfectly working batching pipeline and everything fell apart 
when it became unbound. In an ideal world we could mix an unbound pipeline with 
a bound pipeline. The requirement was fairly simple, process a bunch of CSV 
lines when a new file arrives. The only unbound element in this pipeline is 
when the file arrives and what its path and name are. From the moment the file 
becomes available everything else is batch processing.

This left me looking at options for streaming pipelines.

The Side input approach seemed to fit the best but since I needed a refreshing 
side input I quickly stumbled over the fact that you can't use the Beam 
connectors for this and you need to write your own code to fetch the data!! To 
me it made no sense, Beam has all these IO connectors but I cant use them for 
my side input!

It could be that with IO connectors which implement ReadAll my statement is no 
longer true (I did not find any examples), I have not tested it but in any case 
I would have needed DynamoDBIO which does not implement ReadAll.

So after having spent the weekend playing around with various thoughts and 
ideas I did end up coding a lookup against DynamoDB in a DoFn, using the AWS 
SDK :-)

Kind Thanks,
Serge




On 25 May 2021 at 18:18:31, Alexey Romanenko ([email protected]) wrote:

You don’t need to use windowing strategy or aggregation triggers for a pipeline 
with bounded source to perform GbK-like transforms, but since you started to 
use unbounded source then your pcollections became unbounded and you need to do 
that. Otherwise, it’s unknown at which point of time your GbK transforms will 
have all data arrived to process it (in theory, it will never happened because 
of “unbounded” definition).

What is an issue with applying windowing/triggering strategy for your case?

—
Alexey

On 24 May 2021, at 10:25, Sozonoff Serge <[email protected]> wrote:

Hi,

Referring to the explanation found at the following link under (Stream 
processing triggered from an external source)

https://beam.apache.org/documentation/patterns/file-processing/


While implementing this solution I am trying to figure out how to deal with the 
fact that my pipeline, which was bound, has now become unbound. It exposes me 
to windowing/triggering concerns which I did not have de deal with before and 
in essence are unnecessary since I am still fundamentally dealing with bound 
data. The only reason I have an unbound source involved is as a trigger and 
provider of the file to be processed.

Since my pipeline uses GroupByKey transforms I get the following error.

Exception in thread "main" java.lang.IllegalStateException: GroupByKey cannot 
be applied to non-bounded PCollection in the GlobalWindow without a trigger. 
Use a Window.into or Window.triggering transform prior to GroupByKey.

Do I really need to add windowing/triggering semantics to the PCollections 
which are built from bound data ?

Thanks for any pointers.

Serge

Re: File processing triggered from external source

Reply via email to