Thanks for the tip. I don't think that packaging the application will be a problem since I think that I can use docker.
However, the external program might use up to several gigabytes of RAM to run and I am worried that Beam might overschedule too many steps. Is there a way to control that to avoid OOM? On Thu, Apr 9, 2020 at 6:29 PM Luke Cwik <[email protected]> wrote: > Yes this is feasible and has been done by others. > You can launch any process from within an Apache Beam DoFn using standard > process libraries (e.g. ProcessBuilder in Java, subprocess in Python, ...). > > The trickier question is how do you ensure the environment that the > "worker" is executing in has the process installed and this is somewhat > runner dependent. Some solutions in the past have been: > * for runners where you manage the worker pool: preinstall the process on > all workers > * for runners where you have a lot of permissions on the worker pool: > install the process on demand on workers during DoFn setup > * for any runner: build a statically linked version of the process and > ship it with your pipeline and run that > * for runners that support custom containers: extend the Apache Beam > docker worker container and install your application there > > > On Thu, Apr 9, 2020 at 4:21 AM Tadas Šubonis <[email protected]> > wrote: > >> Hi all, >> >> I am processing tens of TBs of data in the form of several hundred MBs >> binary files. At the moment, I am using custom made Java Queue-Worker >> system to process these files but I would like to give Apache Beam a go. >> >> The problem is that files need to be read with an external program and >> the whole communication is done via stdio. The system basically needs to >> download a binary file (from a list of many), open it with a specified >> program, read its data using stdio, and return the results for further >> processing (the remainder of the processing happens in the program >> that called the external one). >> >> Is it feasible to migrate this flow to Apache Beam? What it would take to >> make Beam call the external program and communicate via stdio? >> >> In the end, I would like to have a pipeline that's easier to monitor and >> reschedule failed steps easier (automatically). >> >> -- >> >> Kind Regards, >> Tadas Šubonis >> > -- Kind Regards, Tadas Šubonis
