It is a bit hacky but possible. A lot depends on what kind of queries etc you want to run. You could write a data source that reads your data and keeps it partitioned the way you want, then use mapPartitions() to execute your codeā¦
Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Aug 22, 2016, at 7:59 AM, Larry White <ljw1...@gmail.com> wrote: > > Hi, > > I have a bit of an unusual use-case and would greatly appreciate some > feedback as to whether it is a good fit for spark. > > I have a network of compute/data servers configured as a tree as shown below > controller > server 1 > server 2 > server 3 > etc. > There are ~20 servers, but the number is increasing to ~100. > > Each server contains a different dataset, all in the same format. Each is > hosted by a different organization, and the data on every individual server > is unique to that organization. > > Data cannot be replicated across servers using RDDs or any other means, for > privacy/ownership reasons. > > Data cannot be retrieved to the controller, except in aggregate form, as the > result of a query, for example. > > Because of this, there are currently no operations that treats the data as if > it were a single data set: We could run a classifier on each site > individually, but cannot for legal reasons, pull all the data into a single > physical dataframe to run the classifier on all of it together. > > The servers are located across a wide geographic region (1,000s of miles) > > We would like to send jobs from the controller to be executed in parallel on > all the servers, and retrieve the results to the controller. The jobs would > consist of SQL-Heavy Java code for 'production' queries, and python or R code > for ad-hoc queries and predictive modeling. > > Spark seems to have the capability to meet many of the individual > requirements, but is it a reasonable platform overall for building this > application? > > Thank you very much for your assistance. > > Larry >