No, most rdds partition input data appropriately. On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter <franc.car...@rozettatech.com> wrote:
> > One more question, to be clarify. Will every node pull in all the data ? > > thanks > > On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> If you are not co-locating spark executor processes on the same machines >> where the data is stored, and using an rdd that knows about which node to >> prefer scheduling a task on, yes, the data will be pulled over the network. >> >> Of the options you listed, S3 and DynamoDB cannot have spark running on >> the same machines. Cassandra can be run on the same nodes as spark, and >> recent versions of the spark cassandra connector implement preferred >> locations. You can run an rdbms on the same nodes as spark, but JdbcRDD >> doesn't implement preferred locations. >> >> On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter < >> franc.car...@rozettatech.com> wrote: >> >>> >>> Hi, >>> >>> I'm trying to understand how a Spark Cluster behaves when the data it is >>> processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, >>> RDBMS etc). >>> >>> Does every node in the cluster retrieve all the data from the central >>> store ? >>> >>> thanks >>> >>> -- >>> >>> *Franc Carter* | Systems Architect | Rozetta Technology >>> >>> franc.car...@rozettatech.com <franc.car...@rozettatech.com>| >>> www.rozettatechnology.com >>> >>> Tel: +61 2 8355 2515 >>> >>> Level 4, 55 Harrington St, The Rocks NSW 2000 >>> >>> PO Box H58, Australia Square, Sydney NSW 1215 >>> >>> AUSTRALIA >>> >>> >> > > > -- > > *Franc Carter* | Systems Architect | Rozetta Technology > > franc.car...@rozettatech.com <franc.car...@rozettatech.com>| > www.rozettatechnology.com > > Tel: +61 2 8355 2515 > > Level 4, 55 Harrington St, The Rocks NSW 2000 > > PO Box H58, Australia Square, Sydney NSW 1215 > > AUSTRALIA > >