Hi Nima,

I think the answer depends on the use-case, but what you suggest is on the list 
of practices that hurt scalability of pipelines as it will be an example of 
“Expensive Per-Record Operations”, look it up here:
https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind

Cheers,
Sobi

Sent from my iPhone

> On Oct 23, 2018, at 23:35, Nima Mousavi <[email protected]> wrote:
> 
> Hi, 
> 
> We have a dataflow pipeline written in Apache python beam, and are wondering 
> if we can run a third party code (written in perl) in the pipeline. We 
> basically want to run 
> 
> perl myscript.pl $DATA 
> 
> for every DATA in a PCollection passed to a DoFn
> 
> and write the result back into Bigquery.  We could have setup a server for 
> myscript.pl, and send HTTP/RPC request to the server from each worker 
> instead. But we are wondering if it is possible to run the script directly 
> inside the Beam worker? Or even through a docker container packaging our perl 
> script? If yes, how? what do you think of this approach? Any caveat we should 
> be aware of?
> 
> Thanks!

Reply via email to