For 2), If the input is Range, Spark only needs the start value and the end
value for each partition, so the overhead of Range is little. But
for ArrayBuffer, Spark needs to serialize all of the data into the task.
That's why it's huge in your case.

For 1), Spark does not always travel the data to Executor. It's only sent
the task. If creating a RDD from HDFS files, it only sent the file metadata
in the task. However, parallelize(ArrayBuffer) is an exception, it needs to
send the data in ArrayBuffer by design. When you call an second action in
the driver on the same RDD, if the data is not persisted, Spark needs to
load the data again. You can call RDD.cache to persist the RDD in the
memory.


Best Regards,
Shixiong Zhu

2014-11-06 11:35 GMT+08:00 nsareen <nsar...@gmail.com>:

> I noticed a behaviour where it was observed that, if i'm using
> val temp = sc.parallelize ( 1 to 100000)
>
> temp.collect
>
> Task size will be in bytes let's say ( 1120 bytes).
>
> But if i change this to a for loop
>
> import scala.collection.mutable.ArrayBuffer
> val data= new ArrayBuffer[Integer]()
> for(i <- 1 to 1000000)data+=i
> val distData = sc.parallelize(data)
> distData.collect
>
> Here the task size is in MB's 5000120 bytes.
>
> Any inputs here would be appreciated, this is really confusing!!!!
>
> 1) Why does the data travel from Driver to Executor every time an Action is
> performed ( i thought the data exists in the Executor's memory, and only
> the
> code is pushed from driver to executor ) ??
>
> 2) Why does Range not increase the task size, where as any other collection
> increases the size exponentially ??
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Task-size-variation-while-using-Range-Vs-List-tp18243.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to