No, most rdds partition input data appropriately.

On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter <franc.car...@rozettatech.com>
wrote:

>
> One more question, to be clarify. Will every node pull in all the data ?
>
> thanks
>
> On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> If you are not co-locating spark executor processes on the same machines
>> where the data is stored, and using an rdd that knows about which node to
>> prefer scheduling a task on, yes, the data will be pulled over the network.
>>
>> Of the options you listed, S3 and DynamoDB cannot have spark running on
>> the same machines. Cassandra can be run on the same nodes as spark, and
>> recent versions of the spark cassandra connector implement preferred
>> locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
>> doesn't implement preferred locations.
>>
>> On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter <
>> franc.car...@rozettatech.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> I'm trying to understand how a Spark Cluster behaves when the data it is
>>> processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
>>> RDBMS etc).
>>>
>>> Does every node in the cluster retrieve all the data from the central
>>> store ?
>>>
>>> thanks
>>>
>>> --
>>>
>>> *Franc Carter* | Systems Architect | Rozetta Technology
>>>
>>> franc.car...@rozettatech.com  <franc.car...@rozettatech.com>|
>>> www.rozettatechnology.com
>>>
>>> Tel: +61 2 8355 2515
>>>
>>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>>
>>> PO Box H58, Australia Square, Sydney NSW 1215
>>>
>>> AUSTRALIA
>>>
>>>
>>
>
>
> --
>
> *Franc Carter* | Systems Architect | Rozetta Technology
>
> franc.car...@rozettatech.com  <franc.car...@rozettatech.com>|
> www.rozettatechnology.com
>
> Tel: +61 2 8355 2515
>
> Level 4, 55 Harrington St, The Rocks NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
> AUSTRALIA
>
>

Reply via email to