Spark is very hot now, but after reading the paper, I found it surprisingly
similar to PIG's concept: the RDD is just Relation/set in PIG's
terminology.

I think a great strength of Spark is that it tries to merge multiple
"narrow dependency" stages together to avoid too much IO. does PIG do that
too? otherwise, I can't figure out what other major design differences
would lead to huge performance difference, if Spark also uses on-disk
storage. The overhead to start a MR task should not be that big.

Reply via email to