Well described​, thanks!

On 04-May-2017 4:07 AM, "JayeshLalwani" <jayesh.lalw...@capitalone.com>
wrote:

> In any distributed application, you scale up by splitting execution up on
> multiple machines. The way Spark does this is by slicing the data into
> partitions and spreading them on multiple machines. Logically, an RDD is
> exactly that: data is split up and spread around on multiple machines. When
> you perform operations on an RDD, Spark tells all the machines to perform
> that operation on their own slice of data. SO, for example, if you perform
> a
> filter operation (or if you are using SQL, you do /Select * from tablename
> where col=colval/, Spark tells each machine to look for rows that match
> your
> filter criteria in their own slice of data. This operation results in
> another distributed dataset that contains the filtered records. Note that
> when you do a filter operation, Spark doesn't move data outside of the
> machines that they reside in. It keeps the filtered records in the same
> machine. This ability of Spark to keep data in place is what provides
> scalability. As long as your operations keep data in place, you can scale
> up
> infinitely. If you got 10x more records, you can add 10x more machines, and
> you will get the same performance
>
> However, the problem is that a lot of operations cannot be done by keeping
> data in place. For example, let's say you have 2 tables/dataframes. Spark
> will slice both up and spread them around the machines. Now let's say, you
> joined both tables. It may happen that the slice of data that resides in
> one
> machine has matching records in another machine. So, now, Spark has to
> bring
> data over from one machine to another. This is what Spark calls a
> /shuffle/Spark does this intelligently. However, whenever data leaves one
> machine and goes to other machines, you cannot scale infinitely. There will
> be a point at which you will overwhelm the network, and adding more
> machines
> isn't going to improve performance.
>
> So, the point is that you have to avoid shuffles as much as possible. You
> cannot eliminate shuffles altogether, but you can reduce them
>
> Now, /collect/ is the granddaddy of all shuffles. It causes Spark to bring
> all the data that it has distributedd over the machines into a single
> machine. If you call collect on a large table, it's analogous to drinking
> from a firehose. You are going to drown.Calling collect on a small table is
> fine, because very little data will move
>
> Usually, it's recommended to run all your aggregations using Spark SQL, and
> when you get the data boiled down to a small enough size that can be
> presented to a human, you can call collect on it to fetch it and present it
> to the human user.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-SQL-collect-function-tp28644p28647.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to