I would like to know if Spark has any facility by which particular tasks
can be scheduled to run on chosen nodes.
The use case: we have a large custom-format database. It is partitioned
and the segments are stored on local SSD on multiple nodes. Incoming
queries are matched against the database; this involves either sending
each key to the correct node, or sending a batch to all nodes
simultaneously, where the queries are filtered and processed against one
segment each, and the results are merged at the end.
Currently we are doing this with htcondor using a DAG to define the
workflow and requirements expressions to match particular jobs to
particular databases, but it's coarse-grained and more suited for batch
processing than real-time, as well as being cumbersome to define and manage.
I wonder whether Spark would suit this workflow, and if so how?
It seems that either we would need to schedule parts of our jobs on the
appropriate nodes, which I can't see how to do:
http://spark.apache.org/docs/latest/job-scheduling.html
Or possibly we could define our partitioned database as a custom type of
RDD - however then we would need to define operations which work on two
RDDs simultaneously (i.e. the static database and the incoming set of
queries) which doesn't seem to fit Spark well AFAICS.
Any other ideas how we could approach this, either with Spark or
suggestions for other frameworks to look at? (We would actually prefer
non-Java frameworks but are happy to look at all options)
Thanks,
Brian Candler.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org