Hi Ana,

I faced the issue where multiple workers need to access file from downloaded 
repositories. From my experiences you could try NFS disk, so that multiple 
workers can share the same disk. Performance is slower so you could try to copy 
it into local disk for git operations.
For a Flink on K8S cluster, setting an NFS disk is quite easy, you can also use 
AWS EBS or AWS disk that support ReadWriteMany.
Best,
Thanh
On Sep 8 2021, at 12:12 am, Ana Markovic <[email protected]> wrote:
> Hi Jan,
>
> Thanks for the fast reply! I came across an example that I wanted to recreate 
> in Beam, and I'm sharing the link below. Generally speaking, nodes keep their 
> favourite words and accept only jobs that involve those favourites. This is a 
> simple example but could be beneficial in processing large pieces of data 
> (for example, software repositories), where nodes could work on the 
> repositories they already processed (and have some files already downloaded) 
> and avoid downloading unnecessary repository contents if another node already 
> has them. This could be enabled by allowing nodes to check their internal 
> state and decide if they want to accept/reject a certain repository as a job. 
> I know that the "more complicated" example might be a far fetch, but I wanted 
> to give you more context on what I'd want to know about Beam.
>
> Thanks for all the insights!
>
> Best,
> Ana
>
> [1] 
> https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated
>  
> (https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fgithub.com%2Fcrossflowlabs%2Fcrossflow%2Ftree%2Fmaster%2Forg.crossflow.tests%2Fsrc%2Forg%2Fcrossflow%2Ftests%2Fopinionated&recipient=dXNlckBiZWFtLmFwYWNoZS5vcmc%3D)
>
> On Tue, 7 Sept 2021 at 13:57, Jan Lukavský <[email protected] 
> (mailto:[email protected])> wrote:
> > Hi Ana,
> >
> > in general, worker nodes do not share any state, and cannot themselves 
> > decide which work to accept and which to reject. How the work is 
> > distributed to downstream processing is defined by a runner, not the Beam 
> > model. On the other hand, what you ask for might be possibly accomplished 
> > using a grouping operation - either a GroupByKey or a stateful DoFn might 
> > help you with that. Can you further describe your intent?
> > Best,
> > Jan
> > On 9/7/21 12:32 PM, Ana Markovic wrote:
> > > To whom this may concern,
> > >
> > > I've been looking into polyglot data processing frameworks recently, and 
> > > I read Beam's documentation as well as developed a few examples to get 
> > > some hands-on experience. I've been wondering, and I haven't found this 
> > > in the documentation, is there a way to set up worker nodes so they are 
> > > "opinionated" or "smart" in a sense that they can decide for themselves 
> > > which jobs they will perform? For example, in a word count example, an 
> > > opinionated worker node could only decide to monitor occurrences of a 
> > > specific word if it's among the node's favourite words.
> > >
> > > I hope I explained it well, but please let me know if more details are 
> > > needed to answer this question.
> > >
> > > Thankful in advance,
> > > Ana
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>
>
> --
> Best,
> Ana
>
>

Reply via email to