Hi Phil,

The short answer is that there is a driver machine (which handles the
distribution of tasks and data) and a number of worker nodes (which receive
data and perform the actual tasks). That being said, certain tasks need to
be performed on the driver, because they require all of the data.

I'd recommend taking a look at the video below, which will explain this
concept in much greater detail. It also goes through an example and shows
you how to use the logging tools to understand what is happening within
your program.

https://www.youtube.com/watch?v=dmL0N3qfSc8

Thanks,
Kevin

On Thu, Jan 28, 2016 at 4:41 AM, Philip Lee <[email protected]> wrote:

> Hi,
>
> Simple Question about Spark Distribution of Small Dataset.
>
> Let's say I have 8 machine with 48 cores and 48GB of RAM as a cluster.
> Dataset  (format is ORC by Hive) is so small like 1GB, but I copied it to
> HDFS.
>
> 1) if spark-sql run the dataset distributed on HDFS in each machine, what
> happens to the job? I meant one machine handles the dataset because it is
> so small?
>
> 2) but the thing is dataset is already distributed in each machine.
> or each machine handles the distributed dataset and send it to the Master
> Node?
>
> Could you explain about this in detail in a distributed way?
>
> Best,
> Phil
>
>
>
>

Reply via email to