Hello,
You could try using mapPartitions function if you can send partial data
to your C++ program:
mapPartitions(func):
Similar to map, but runs separately on each partition (block) of the
RDD, so /func/ must be of type Iterator<T> => Iterator<U> when running
on an RDD of type T.
That way you can write partition data to temp file, call your C++ app,
then delete the temp file. Of course your data would be limited to all
rows in one partition.
Also the latest release of Spark (2.4.0) introduced barrier execution mode:
https://issues.apache.org/jira/browse/SPARK-24374
Maybe you could combine the two, just using mapPartitions will give you
single partition data only, and your app call will be repeated on all
nodes, not necessarily at the same time.
Spark's strong point is parallel execution, so what you're trying to do
kind of defeats that.
But if you do not need to combine all the data before calling your app
then you could do it.
Or you could split your job into Spark -> app -> Spark chain.
Good luck,
Joe
On 11/11/2018 02:13 PM, Steve Lewis wrote:
I have a problem where a critical step needs to be performed by a
third party c++ application. I can send or install this program on the
worker nodes. I can construct a function holding all the data this
program needs to process. The problem is that the program is designed
to read and write from the local file system. I can call the program
from Java and read its output as a local file - then deleting all
temporary files but I doubt that it is possible to get the program to
read from hdfs or any shared file system.
My question is can a function running on a worker node create
temporary files and pass the names of these to a local process
assuming everything is cleaned up after the call?
--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org