subject:"Data locality across jobs"

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza

This isn't currently a capability that Spark has, though it has definitely
been discussed: https://issues.apache.org/jira/browse/SPARK-1061.  The
primary obstacle at this point is that Hadoop's FileInputFormat doesn't
guarantee that each file corresponds to a single split, so the records
corresponding to a particular partition at the end of the first job can end
up split across multiple partitions in the second job.

-Sandy

On Wed, Apr 1, 2015 at 9:09 PM, kjsingh kanwaljit.si...@guavus.com wrote:

 Hi,

 We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of
 Tuple2. At the end of day, a daily job is launched, which works on the
 outputs of the hourly jobs.

 For data locality and speed, we wish that when the daily job launches, it
 finds all instances of a given key at a single executor rather than
 fetching
 it from others during shuffle.

 Is it possible to maintain key partitioning across jobs? We can control
 partitioning in one job. But how do we send keys to the executors of same
 node manager across jobs? And while saving data to HDFS, are the blocks
 allocated to the same data node machine as the executor for a partition?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-across-jobs-tp22351.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Data locality across jobs

2015-04-01 Thread kjsingh

Hi,

We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of
Tuple2. At the end of day, a daily job is launched, which works on the
outputs of the hourly jobs.

For data locality and speed, we wish that when the daily job launches, it
finds all instances of a given key at a single executor rather than fetching
it from others during shuffle.

Is it possible to maintain key partitioning across jobs? We can control
partitioning in one job. But how do we send keys to the executors of same
node manager across jobs? And while saving data to HDFS, are the blocks
allocated to the same data node machine as the executor for a partition?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-across-jobs-tp22351.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Data locality across jobs

Data locality across jobs

2 matches

Site Navigation

Mail list logo

Footer information