Hi Lukasz, thanks for the proposed solution. This was also one of the alternative implementations that I thought of. When you are talking about launching a job from another job, I understand doing a system call from another python job and getting result by some means (reading synchronously the output of child jobs), am I correct? I'll first test this with the DirectRunner calling other DirectRunner(s), and afterwards doing it on GCP with DataFlow. Regarding nesting pipeline, I can provide support to build a demonstrator if I can have some support from the community. Thanks again and very best regards, Pascal
On Thu, Aug 16, 2018 at 8:43 PM, Lukasz Cwik <[email protected]> wrote: > You can launch another Dataflow job from within an existing Dataflow job. > For all intensive purposes, Dataflow won't know that the jobs are related > in any way so they will only be "nested" because your outer pipeline knows > about the inner pipeline. > > You should be able to do this for all runners (granted you need to > propagate all runner/pipeline configuration through) and you should be able > to take a job from one runner and launch a job on a different runner > (you'll have to deal with the complexities of having two runners and their > dependencies somehow though). > > There was some work investigating supporting nested graphs within Apache > Beam and to support dynamic graph expansion during execution as a general > concept. This was to support use cases such as recursion and loops but this > didn't progress much more then the idea generation phase. > > On Thu, Aug 16, 2018 at 9:47 AM Pascal Gula <[email protected]> wrote: > >> Hi Robin, >> this is unfortunate news, but I already anticipated such answer with an >> alternative implementation. >> It would be however interesting to support such feature since I am >> probably not the first person asking for this. >> Best regards, >> Pascal >> >> On Thu, Aug 16, 2018 at 6:20 PM, Robin Qiu <[email protected]> wrote: >> >>> Hi Pascal, >>> >>> As far as I know, you can't create sub-pipeline within a DoFn, i.e. >>> nested pipelines are not supported. >>> >>> Best, >>> Robin >>> >>> On Thu, Aug 16, 2018 at 7:03 AM Pascal Gula <[email protected]> wrote: >>> >>>> As a bonus, here is a simplified diagram view of the use-case: >>>> >>>> Cheers, >>>> Pascal >>>> >>>> >>>> On Thu, Aug 16, 2018 at 3:12 PM, Pascal Gula <[email protected]> >>>> wrote: >>>> >>>>> Hello, >>>>> I am currently evaluating Apache Beam (later executing on Google >>>>> DataFlow), and for the first use-case I am working on, I have a kinda >>>>> design question to see if any of you already had a similar one. >>>>> Namely, we have a DB describing dashboards views, and for each views, >>>>> we would like to perform some aggregation transform. >>>>> My first approach would be to create a higher level pipeline that will >>>>> fetch all view configurations from our mongoDB (BTW, we released a mongoDB >>>>> IO connector here: https://pypi.org/project/beam-extended/). With >>>>> this views PColl, the idea is to have a ParDo, with a DoFn that will >>>>> create >>>>> sub-pipleine to perform the aggregation on data from our plant database >>>>> with a qurey derived from the view configuration. Afterwards, the idea is >>>>> to save for the higher level pipeline, some performance/data metrics >>>>> related to the execution of the array of sub-pipeline. >>>>> The main question is: are nested pipeline supported by the runner? >>>>> I hope that my description was clear enough. I will work on a diagram >>>>> view meanwhile. >>>>> Very best regards, >>>>> Pascal >>>>> >>>>> -- >>>>> >>>>> Pascal Gula >>>>> Senior Data Engineer / Scientist >>>>> +49 (0)176 34232684www.plantix.net <http://plantix.net/> >>>>> PEAT GmbH >>>>> Kastanienallee 4 >>>>> 10435 Berlin // Germany >>>>> >>>>> <https://play.google.com/store/apps/details?id=com.peat.GartenBank>Download >>>>> the App! >>>>> <https://play.google.com/store/apps/details?id=com.peat.GartenBank> >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Pascal Gula >>>> Senior Data Engineer / Scientist >>>> +49 (0)176 34232684www.plantix.net <http://plantix.net/> >>>> PEAT GmbH >>>> Kastanienallee 4 >>>> 10435 Berlin // Germany >>>> >>>> <https://play.google.com/store/apps/details?id=com.peat.GartenBank>Download >>>> the App! >>>> <https://play.google.com/store/apps/details?id=com.peat.GartenBank> >>>> >>>> >> >> >> -- >> >> Pascal Gula >> Senior Data Engineer / Scientist >> +49 (0)176 34232684www.plantix.net <http://plantix.net/> >> PEAT GmbH >> Kastanienallee 4 >> 10435 Berlin // Germany >> <https://play.google.com/store/apps/details?id=com.peat.GartenBank>Download >> the App! <https://play.google.com/store/apps/details?id=com.peat.GartenBank> >> >> -- Pascal Gula Senior Data Engineer / Scientist +49 (0)176 34232684www.plantix.net <http://plantix.net/> PEAT GmbH Kastanienallee 4 10435 Berlin // Germany <https://play.google.com/store/apps/details?id=com.peat.GartenBank>Download the App! <https://play.google.com/store/apps/details?id=com.peat.GartenBank>
