Scheduling and node affinity

Brian Candler Fri, 12 Jun 2015 03:55:44 -0700

I would like to know if Spark has any facility by which particular taskscan be scheduled to run on chosen nodes.

The use case: we have a large custom-format database. It is partitionedand the segments are stored on local SSD on multiple nodes. Incomingqueries are matched against the database; this involves either sendingeach key to the correct node, or sending a batch to all nodessimultaneously, where the queries are filtered and processed against onesegment each, and the results are merged at the end.

Currently we are doing this with htcondor using a DAG to define theworkflow and requirements expressions to match particular jobs toparticular databases, but it's coarse-grained and more suited for batchprocessing than real-time, as well as being cumbersome to define and manage.


I wonder whether Spark would suit this workflow, and if so how?

It seems that either we would need to schedule parts of our jobs on theappropriate nodes, which I can't see how to do:

http://spark.apache.org/docs/latest/job-scheduling.html

Or possibly we could define our partitioned database as a custom type ofRDD - however then we would need to define operations which work on twoRDDs simultaneously (i.e. the static database and the incoming set ofqueries) which doesn't seem to fit Spark well AFAICS.

Any other ideas how we could approach this, either with Spark orsuggestions for other frameworks to look at? (We would actually prefernon-Java frameworks but are happy to look at all options)


Thanks,

Brian Candler.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Scheduling and node affinity

Reply via email to