Hi Ana,
what you describe sounds like logical grouping to me. For example - when
Beam runs a stateful operation (DoFn), every record has to be associated
with a _key_. All records with the same key are then processed by the
same worker. If you have some resources that need to be downloaded
(cached) from the outside of the Pipeline, one option would be to use a
stateful DoFn, which would look into its local cache (held in a state)
and download the required resource if it does not have it (or if it is
stale). There would probably be needed more logic around freeing the
state, but I'll leave that out for now.
Would that work for your case?
Jan
On 9/7/21 7:12 PM, Ana Markovic wrote:
Hi Jan,
Thanks for the fast reply! I came across an example that I wanted to
recreate in Beam, and I'm sharing the link below. Generally speaking,
nodes keep their favourite words and accept only jobs that involve
those favourites. This is a simple example but could be beneficial in
processing large pieces of data (for example, software repositories),
where nodes could work on the repositories they already processed (and
have some files already downloaded) and avoid downloading unnecessary
repository contents if another node already has them. This could be
enabled by allowing nodes to check their internal state and decide if
they want to accept/reject a certain repository as a job. I know that
the "more complicated" example might be a far fetch, but I wanted to
give you more context on what I'd want to know about Beam.
Thanks for all the insights!
Best,
Ana
[1]
https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated
<https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated>
On Tue, 7 Sept 2021 at 13:57, Jan Lukavský <[email protected]
<mailto:[email protected]>> wrote:
Hi Ana,
in general, worker nodes do not share any state, and cannot
themselves decide which work to accept and which to reject. How
the work is distributed to downstream processing is defined by a
runner, not the Beam model. On the other hand, what you ask for
might be possibly accomplished using a grouping operation - either
a GroupByKey or a stateful DoFn might help you with that. Can you
further describe your intent?
Best,
Jan
On 9/7/21 12:32 PM, Ana Markovic wrote:
To whom this may concern,
I've been looking into polyglot data processing frameworks
recently, and I read Beam's documentation as well as developed a
few examples to get some hands-on experience. I've been
wondering, and I haven't found this in the documentation, is
there a way to set up worker nodes so they are "opinionated" or
"smart" in a sense that they can decide for themselves which jobs
they will perform? For example, in a word count example, an
opinionated worker node could only decide to monitor
occurrences of a specific word if it's among the node's favourite
words.
I hope I explained it well, but please let me know if more
details are needed to answer this question.
Thankful in advance,
Ana
--
Best,
Ana