Re: [GENERAL QUESTION] How independent are worker nodes

Jan Lukavský Tue, 07 Sep 2021 11:29:24 -0700

Hi Ana,

what you describe sounds like logical grouping to me. For example - whenBeam runs a stateful operation (DoFn), every record has to be associatedwith a _key_. All records with the same key are then processed by thesame worker. If you have some resources that need to be downloaded(cached) from the outside of the Pipeline, one option would be to use astateful DoFn, which would look into its local cache (held in a state)and download the required resource if it does not have it (or if it isstale). There would probably be needed more logic around freeing thestate, but I'll leave that out for now.


Would that work for your case?

 Jan

On 9/7/21 7:12 PM, Ana Markovic wrote:

Hi Jan,
Thanks for the fast reply! I came across an example that I wanted torecreate in Beam, and I'm sharing the link below. Generally speaking,nodes keep their favourite words and accept only jobs that involvethose favourites. This is a simple example but could be beneficial inprocessing large pieces of data (for example, software repositories),where nodes could work on the repositories they already processed (andhave some files already downloaded) and avoid downloading unnecessaryrepository contents if another node already has them. This could beenabled by allowing nodes to check their internal state and decide ifthey want to accept/reject a certain repository as a job. I know thatthe "more complicated" example might be a far fetch, but I wanted togive you more context on what I'd want to know about Beam.
Thanks for all the insights!

Best,
Ana
[1]https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated<https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated>
On Tue, 7 Sept 2021 at 13:57, Jan Lukavský <[email protected]<mailto:[email protected]>> wrote:
    Hi Ana,

    in general, worker nodes do not share any state, and cannot
    themselves decide which work to accept and which to reject. How
    the work is distributed to downstream processing is defined by a
    runner, not the Beam model. On the other hand, what you ask for
    might be possibly accomplished using a grouping operation - either
    a GroupByKey or a stateful DoFn might help you with that. Can you
    further describe your intent?

    Best,

     Jan

    On 9/7/21 12:32 PM, Ana Markovic wrote:
    To whom this may concern,

    I've been looking into polyglot data processing frameworks
    recently, and I read Beam's documentation as well as developed a
    few examples to get some hands-on experience. I've been
    wondering, and I haven't found this in the documentation, is
    there a way to set up worker nodes so they are "opinionated" or
    "smart" in a sense that they can decide for themselves which jobs
    they will perform? For example, in a word count example, an
    opinionated worker node could only decide to monitor
    occurrences of a specific word if it's among the node's favourite
    words.

    I hope I explained it well, but please let me know if more
    details are needed to answer this question.

    Thankful in advance,
    Ana
--
Best,
Ana

Re: [GENERAL QUESTION] How independent are worker nodes

Reply via email to