Hi all, I am processing tens of TBs of data in the form of several hundred MBs binary files. At the moment, I am using custom made Java Queue-Worker system to process these files but I would like to give Apache Beam a go.
The problem is that files need to be read with an external program and the whole communication is done via stdio. The system basically needs to download a binary file (from a list of many), open it with a specified program, read its data using stdio, and return the results for further processing (the remainder of the processing happens in the program that called the external one). Is it feasible to migrate this flow to Apache Beam? What it would take to make Beam call the external program and communicate via stdio? In the end, I would like to have a pipeline that's easier to monitor and reschedule failed steps easier (automatically). -- Kind Regards, Tadas Šubonis
