Thanks! So I assume I can safely run a function *F* of mine within the spark driver program, without dispatching it to the cluster (?), thereby sticking to one piece of code for *both* a real cluster run over big data, and for small on-demand runs for a single input (now and then), both scenarios using my same code attached to the same application-specific configuration of my business logic. Is that correct?
Can I still write its output the same way Spark actions allow for a real distributed task? Would I see it as a task in the monitoring UI (http://<driver-node>:4040)of the driver? Thanks for the newb support. Thanks, Matan On Wed, Jan 28, 2015 at 12:19 PM, Sean Owen <[email protected]> wrote: > Processing one object isn't a distributed operation, and doesn't > really involve Spark. Just invoke your function on your object in the > driver; there's no magic at all to that. > > You can make an RDD of one object and invoke a distributed Spark > operation on it, but assuming you mean you have it on the driver, > that's wasteful. It just copies the object to another machine to > invoke the function. > > On Wed, Jan 28, 2015 at 10:14 AM, Matan Safriel <[email protected]> > wrote: > > Hi, > > > > How would I run a given function in Spark, over a single input object? > > Would I first add the input to the file system, then somehow invoke the > > Spark function on just that input? or should I rather twist the Spark > > streaming api for it? > > > > Assume I'd like to run a piece of computation that normally runs over a > > large dataset, over just one new added datum. I'm a bit reticent > adapting my > > code to Spark without knowing the limits of this scenario. > > > > Many thanks! > > Matan >
