Thanks for the tip. I don't think that packaging the application will be a
problem since I think that I can use docker.

However, the external program might use up to several gigabytes of RAM to
run and I am worried that Beam might overschedule too many steps. Is there
a way to control that to avoid OOM?

On Thu, Apr 9, 2020 at 6:29 PM Luke Cwik <[email protected]> wrote:

> Yes this is feasible and has been done by others.
> You can launch any process from within an Apache Beam DoFn using standard
> process libraries (e.g. ProcessBuilder in Java, subprocess in Python, ...).
>
> The trickier question is how do you ensure the environment that the
> "worker" is executing in has the process installed and this is somewhat
> runner dependent. Some solutions in the past have been:
> * for runners where you manage the worker pool: preinstall the process on
> all workers
> * for runners where you have a lot of permissions on the worker pool:
> install the process on demand on workers during DoFn setup
> * for any runner: build a statically linked version of the process and
> ship it with your pipeline and run that
> * for runners that support custom containers: extend the Apache Beam
> docker worker container and install your application there
>
>
> On Thu, Apr 9, 2020 at 4:21 AM Tadas Šubonis <[email protected]>
> wrote:
>
>> Hi all,
>>
>> I am processing tens of TBs of data in the form of several hundred MBs
>> binary files. At the moment, I am using custom made Java Queue-Worker
>> system to process these files but I would like to give Apache Beam a go.
>>
>> The problem is that files need to be read with an external program and
>> the whole communication is done via stdio. The system basically needs to
>> download a binary file (from a list of many), open it with a specified
>> program, read its data using stdio, and return the results for further
>> processing (the remainder of the processing happens in the program
>> that called the external one).
>>
>> Is it feasible to migrate this flow to Apache Beam? What it would take to
>> make Beam call the external program and communicate via stdio?
>>
>> In the end, I would like to have a pipeline that's easier to monitor and
>> reschedule failed steps easier (automatically).
>>
>> --
>>
>> Kind Regards,
>> Tadas Šubonis
>>
>

-- 

Kind Regards,
Tadas Šubonis

Reply via email to