It is a bit hacky but possible. A lot depends on what kind of queries etc you 
want to run. You could write a data source that reads your data and keeps it 
partitioned the way you want, then use mapPartitions() to execute your codeā€¦


Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




> On Aug 22, 2016, at 7:59 AM, Larry White <ljw1...@gmail.com> wrote:
> 
> Hi,
> 
> I have a bit of an unusual use-case and would greatly appreciate some 
> feedback as to whether it is a good fit for spark.
> 
> I have a network of compute/data servers configured as a tree as shown below
> controller
> server 1
> server 2
> server 3
> etc.
> There are ~20 servers, but the number is increasing to ~100. 
> 
> Each server contains a different dataset, all in the same format. Each is 
> hosted by a different organization, and the data on every individual server 
> is unique to that organization.
> 
> Data cannot be replicated across servers using RDDs or any other means, for 
> privacy/ownership reasons.
> 
> Data cannot be retrieved to the controller, except in aggregate form, as the 
> result of a query, for example. 
> 
> Because of this, there are currently no operations that treats the data as if 
> it were a single data set: We could run a classifier on each site 
> individually, but cannot for legal reasons, pull all the data into a single 
> physical dataframe to run the classifier on all of it together. 
> 
> The servers are located across a wide geographic region (1,000s of miles)
> 
> We would like to send jobs from the controller to be executed in parallel on 
> all the servers, and retrieve the results to the controller. The jobs would 
> consist of SQL-Heavy Java code for 'production' queries, and python or R code 
> for ad-hoc queries and predictive modeling. 
> 
> Spark seems to have the capability to meet many of the individual 
> requirements, but is it a reasonable platform overall for building this 
> application?
> 
> Thank you very much for your assistance. 
> 
> Larry 
>  

Reply via email to