Hello,
You could try using mapPartitions function if you can send partial data to your C++ program:

mapPartitions(func):
Similar to map, but runs separately on each partition (block) of the RDD, so /func/ must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.

That way you can write partition data to temp file, call your C++ app, then delete the temp file. Of course your data would be limited to all rows in one partition.

Also the latest release of Spark (2.4.0) introduced barrier execution mode:
https://issues.apache.org/jira/browse/SPARK-24374

Maybe you could combine the two, just using mapPartitions will give you single partition data only, and your app call will be repeated on all nodes, not necessarily at the same time.

Spark's strong point is parallel execution, so what you're trying to do kind of defeats that. But if you do not need to combine all the data before calling your app then you could do it.
Or you could split your job into Spark -> app -> Spark chain.
Good luck,

Joe



On 11/11/2018 02:13 PM, Steve Lewis wrote:
I have a problem where a critical step needs to be performed by  a third party c++ application. I can send or install this program on the worker nodes. I can construct  a function holding all the data this program needs to process. The problem is that the program is designed to read and write from the local file system. I can call the program from Java and read its output as  a  local file - then deleting all temporary files but I doubt that it is possible to get the program to read from hdfs or any shared file system. My question is can a function running on a worker node create temporary files and pass the names of these to a local process assuming everything is cleaned up after the call?

--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to