Re: Execute function once on each node

2016-07-19 Thread Rabin Banerjee
" I am working on a spark application that requires the ability to run a function on each node in the cluster " -- Use Apache Ignite instead of Spark. Trust me it's awesome for this use case. Regards, Rabin Banerjee On Jul 19, 2016 3:27 AM, "joshuata" wrote: > I am working on a spark application

Re: Execute function once on each node

2016-07-19 Thread Josh Asplund
Technical limitations keep us from running another filesystem on the SSDs. We are running on a very large HPC cluster without control over low-level system components. We have tried setting up an ad-hoc HDFS cluster on the nodes in our allocation, but we have had very little luck. It ends up being

Re: Execute function once on each node

2016-07-19 Thread Josh Asplund
Thank you for that advice. I have tried similar techniques, but not that one. On Mon, Jul 18, 2016 at 11:42 PM Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Thanks for the explanation. Try creating a custom RDD whose getPartitions > returns an array of custom partition objects of size n

Re: Execute function once on each node

2016-07-19 Thread Koert Kuipers
The whole point of a well designed global filesystem is to not move the data On Jul 19, 2016 10:07, "Koert Kuipers" wrote: > If you run hdfs on those ssds (with low replication factor) wouldn't it > also effectively write to local disk with low latency? > > On Jul 18, 2016 21:54, "Josh Asplund"

Re: Execute function once on each node

2016-07-18 Thread Aniket Bhatnagar
Thanks for the explanation. Try creating a custom RDD whose getPartitions returns an array of custom partition objects of size n (= number of nodes). In a custom partition object, you can have the file path and ip/hostname where the partition needs to be computed. Then, have getPreferredLocations r

Re: Execute function once on each node

2016-07-18 Thread Josh Asplund
The spark workers are running side-by-side with scientific simulation code. The code writes output to local SSDs to keep latency low. Due to the volume of data being moved (10's of terabytes +), it isn't really feasible to copy the data to a global filesystem. Executing a function on each node woul

Re: Execute function once on each node

2016-07-18 Thread Aniket Bhatnagar
You can't assume that the number to nodes will be constant as some may fail, hence you can't guarantee that a function will execute at most once or atleast once on a node. Can you explain your use case in a bit more detail? On Mon, Jul 18, 2016, 10:57 PM joshuata wrote: > I am working on a spark