Another option here would be to make the perl script operate on batches. Your DoFn could then store the records to a buffer rather than outputting them and then periodically flush the buffer, sending records through the perl script and sending to output.
On Wed, Oct 24, 2018 at 3:03 PM Robert Bradshaw <[email protected]> wrote: > While one does want to watch out for expensive per-record operations, this > may still be preferable to (and cheaper than) setting up a server and > making RPC requests. It depends on the nature of the operation. If > executing the perl script is (say) 100ms of "startup" for 1ms of actually > processing $DATA, then you'll be wasting a lot of cycles and a server may > be the way to go, but if it's 1ms of startup for 100ms of processing $DATA > than this startup cost won't matter at all. > > If the startup cost is prohibitive, you could also start up a local > "server" on the worker in startBundle (or even setUp), and shut it down in > finishBundle, and communicate with it in your processElement. > > The other bit is actually shipping your perl script (and, more tricky, its > dependencies). Currently that's very runner-dependent, and typically you > end up packing it as data in your jars and then trying to unpack/install it > on the workers at runtime. One of the goals of > https://beam.apache.org/contribute/portability/ is to make this easier, > specifically, you can set up your worker environment as a docker container > with everything you need and this will get used as the environment in which > your DoFns are executed. > > > On Wed, Oct 24, 2018 at 6:48 AM Sobhan Badiozamany < > [email protected]> wrote: > >> Hi Nima, >> >> I think the answer depends on the use-case, but what you suggest is on >> the list of practices that hurt scalability of pipelines as it will be an >> example of “Expensive Per-Record Operations”, look it up here: >> >> https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind >> >> Cheers, >> Sobi >> >> Sent from my iPhone >> >> On Oct 23, 2018, at 23:35, Nima Mousavi <[email protected]> wrote: >> >> Hi, >> >> We have a dataflow pipeline written in Apache python beam, and are >> wondering if we can run a third party code (written in perl) in the >> pipeline. We basically want to run >> >> perl myscript.pl $DATA >> >> for every DATA in a PCollection passed to a DoFn >> >> and write the result back into Bigquery. We could have setup a server >> for myscript.pl, and send HTTP/RPC request to the server from each >> worker instead. But we are wondering if it is possible to run the script >> directly inside the Beam worker? Or even through a docker container >> packaging our perl script? If yes, how? what do you think of this approach? >> Any caveat we should be aware of? >> >> Thanks! >> >>
