Well described​, thanks! On 04-May-2017 4:07 AM, "JayeshLalwani" <jayesh.lalw...@capitalone.com> wrote:
> In any distributed application, you scale up by splitting execution up on > multiple machines. The way Spark does this is by slicing the data into > partitions and spreading them on multiple machines. Logically, an RDD is > exactly that: data is split up and spread around on multiple machines. When > you perform operations on an RDD, Spark tells all the machines to perform > that operation on their own slice of data. SO, for example, if you perform > a > filter operation (or if you are using SQL, you do /Select * from tablename > where col=colval/, Spark tells each machine to look for rows that match > your > filter criteria in their own slice of data. This operation results in > another distributed dataset that contains the filtered records. Note that > when you do a filter operation, Spark doesn't move data outside of the > machines that they reside in. It keeps the filtered records in the same > machine. This ability of Spark to keep data in place is what provides > scalability. As long as your operations keep data in place, you can scale > up > infinitely. If you got 10x more records, you can add 10x more machines, and > you will get the same performance > > However, the problem is that a lot of operations cannot be done by keeping > data in place. For example, let's say you have 2 tables/dataframes. Spark > will slice both up and spread them around the machines. Now let's say, you > joined both tables. It may happen that the slice of data that resides in > one > machine has matching records in another machine. So, now, Spark has to > bring > data over from one machine to another. This is what Spark calls a > /shuffle/Spark does this intelligently. However, whenever data leaves one > machine and goes to other machines, you cannot scale infinitely. There will > be a point at which you will overwhelm the network, and adding more > machines > isn't going to improve performance. > > So, the point is that you have to avoid shuffles as much as possible. You > cannot eliminate shuffles altogether, but you can reduce them > > Now, /collect/ is the granddaddy of all shuffles. It causes Spark to bring > all the data that it has distributedd over the machines into a single > machine. If you call collect on a large table, it's analogous to drinking > from a firehose. You are going to drown.Calling collect on a small table is > fine, because very little data will move > > Usually, it's recommended to run all your aggregations using Spark SQL, and > when you get the data boiled down to a small enough size that can be > presented to a human, you can call collect on it to fetch it and present it > to the human user. > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Spark-SQL-collect-function-tp28644p28647.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >