Hi Nima, I think the answer depends on the use-case, but what you suggest is on the list of practices that hurt scalability of pipelines as it will be an example of “Expensive Per-Record Operations”, look it up here: https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind
Cheers, Sobi Sent from my iPhone > On Oct 23, 2018, at 23:35, Nima Mousavi <[email protected]> wrote: > > Hi, > > We have a dataflow pipeline written in Apache python beam, and are wondering > if we can run a third party code (written in perl) in the pipeline. We > basically want to run > > perl myscript.pl $DATA > > for every DATA in a PCollection passed to a DoFn > > and write the result back into Bigquery. We could have setup a server for > myscript.pl, and send HTTP/RPC request to the server from each worker > instead. But we are wondering if it is possible to run the script directly > inside the Beam worker? Or even through a docker container packaging our perl > script? If yes, how? what do you think of this approach? Any caveat we should > be aware of? > > Thanks!
