Hi,
one of the major selling points of HDFS is (was?) that it is possible to
schedule a Hadoop job close to where the data that it operates on is. I am
not using HDFS, but I was wondering if/how Mesos supports an approach to
schedule a job to a machine that has a certain file/dataset already locally
as opposed to scheduling it to a machine that would have to access it via
the network or download to the local disk first.
I was wondering if Mesos attributes could be used: I could have an
attribute `datasets` of type `set` and then node A could have {dataset1,
dataset17, dataset3} and node B could have {dataset17, dataset5} and during
scheduling I could decide based on this attribute where to run a task.
However, I was wondering if there are dynamic changes of such attributes
possible. Imagine that node A deletes dataset17 from the local cache and
downloads dataset5 instead, then I would like to update the `datasets`
attribute dynamically, but without affecting the jobs that are running on
node A. Is such a thing possible?
Is there an approach other than attributes to describe the data that
resides on a node in order to achieve data locality?
Thanks
Tobias