subject:"Using spark to distribute jobs to standalone servers"

Re: Using spark to distribute jobs to standalone servers

2016-08-25 Thread Igor Berman

imho, you'll need to implement custom rdd with your locality settings(i.e.
custom implementation of discovering where each partition is located) +
setting for spark.locality.wait

On 24 August 2016 at 03:48, Mohit Jaggi  wrote:

> It is a bit hacky but possible. A lot depends on what kind of queries etc
> you want to run. You could write a data source that reads your data and
> keeps it partitioned the way you want, then use mapPartitions() to execute
> your code…
>
>
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com
>
>
>
>
> On Aug 22, 2016, at 7:59 AM, Larry White  wrote:
>
> Hi,
>
> I have a bit of an unusual use-case and would *greatly* *appreciate* some
> feedback as to whether it is a good fit for spark.
>
> I have a network of compute/data servers configured as a tree as shown
> below
>
>- controller
>- server 1
>   - server 2
>   - server 3
>   - etc.
>
> There are ~20 servers, but the number is increasing to ~100.
>
> Each server contains a different dataset, all in the same format. Each is
> hosted by a different organization, and the data on every individual server
> is unique to that organization.
>
> Data *cannot* be replicated across servers using RDDs or any other means,
> for privacy/ownership reasons.
>
> Data *cannot* be retrieved to the controller, except in aggregate form,
> as the result of a query, for example.
>
> Because of this, there are currently no operations that treats the data as
> if it were a single data set: We could run a classifier on each site
> individually, but cannot for legal reasons, pull all the data into a single
> *physical* dataframe to run the classifier on all of it together.
>
> The servers are located across a wide geographic region (1,000s of miles)
>
> We would like to send jobs from the controller to be executed in parallel
> on all the servers, and retrieve the results to the controller. The jobs
> would consist of SQL-Heavy Java code for 'production' queries, and python
> or R code for ad-hoc queries and predictive modeling.
>
> Spark seems to have the capability to meet many of the individual
> requirements, but is it a reasonable platform overall for building this
> application?
>
> Thank you very much for your assistance.
>
> Larry
>
>
>
>

Re: Using spark to distribute jobs to standalone servers

2016-08-23 Thread Mohit Jaggi

It is a bit hacky but possible. A lot depends on what kind of queries etc you 
want to run. You could write a data source that reads your data and keeps it 
partitioned the way you want, then use mapPartitions() to execute your code…


Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




> On Aug 22, 2016, at 7:59 AM, Larry White  wrote:
> 
> Hi,
> 
> I have a bit of an unusual use-case and would greatly appreciate some 
> feedback as to whether it is a good fit for spark.
> 
> I have a network of compute/data servers configured as a tree as shown below
> controller
> server 1
> server 2
> server 3
> etc.
> There are ~20 servers, but the number is increasing to ~100. 
> 
> Each server contains a different dataset, all in the same format. Each is 
> hosted by a different organization, and the data on every individual server 
> is unique to that organization.
> 
> Data cannot be replicated across servers using RDDs or any other means, for 
> privacy/ownership reasons.
> 
> Data cannot be retrieved to the controller, except in aggregate form, as the 
> result of a query, for example. 
> 
> Because of this, there are currently no operations that treats the data as if 
> it were a single data set: We could run a classifier on each site 
> individually, but cannot for legal reasons, pull all the data into a single 
> physical dataframe to run the classifier on all of it together. 
> 
> The servers are located across a wide geographic region (1,000s of miles)
> 
> We would like to send jobs from the controller to be executed in parallel on 
> all the servers, and retrieve the results to the controller. The jobs would 
> consist of SQL-Heavy Java code for 'production' queries, and python or R code 
> for ad-hoc queries and predictive modeling. 
> 
> Spark seems to have the capability to meet many of the individual 
> requirements, but is it a reasonable platform overall for building this 
> application?
> 
> Thank you very much for your assistance. 
> 
> Larry 
>

Retrying: Using spark to distribute jobs to standalone servers

2016-08-23 Thread Larry White

(apologies if this appears twice. I sent it 24 hours ago and it hasn't hit
the list yet)

Hi,

I have a bit of an unusual use-case and would greatly appreciate some
feedback from experienced Sparklers as to whether it is a good fit for
spark.

I have a network of compute/data servers configured as a tree as shown below

   - controller
   - server 1
  - server 2
  - server 3
  - etc.

There are ~20 servers, with the number is increasing to near 100.

Each server contains a different dataset, all in the same format. Each is
hosted by a different organization, and the data on every individual server
is unique to that organization. Essentially, each server hosts a single
partition.

Data *cannot* be replicated across servers using RDDs or any other means,
for privacy/ownership reasons.

Raw data *cannot* be retrieved to the controller, except in summary form.

We would like to send jobs from the controller to be executed in parallel
on all the servers, and retrieve the results to the controller. The jobs
would consist of SQL-Heavy Java code for 'production' queries, and python
code for ad-hoc queries and predictive modeling.

There are no operations that treats the data as if it were a single data
set: We could run a classifier on each site individually, but cannot for
legal reasons, pull all the data into a single *physical* dataframe to run
the classifier on all of it together.

The servers are located across a wide geographic region (1,000s of miles)

Spark seems to have the capability to meet many of the individual
requirements, but is it a reasonable platform overall for building this
application?  In particular, I'm wondering about:

1. Possible issues distributing queries to a set of servers that don't
constitute a typical spark cluster.
2. Support for executing jobs written in Java on the remote servers.

Thank you very much for your assistance.

Larry

Using spark to distribute jobs to standalone servers

2016-08-22 Thread Larry White

Hi,

I have a bit of an unusual use-case and would *greatly* *appreciate* some
feedback as to whether it is a good fit for spark.

I have a network of compute/data servers configured as a tree as shown below

   - controller
   - server 1
  - server 2
  - server 3
  - etc.

There are ~20 servers, but the number is increasing to ~100.

Each server contains a different dataset, all in the same format. Each is
hosted by a different organization, and the data on every individual server
is unique to that organization.

Data *cannot* be replicated across servers using RDDs or any other means,
for privacy/ownership reasons.

Data *cannot* be retrieved to the controller, except in aggregate form, as
the result of a query, for example.

Because of this, there are currently no operations that treats the data as
if it were a single data set: We could run a classifier on each site
individually, but cannot for legal reasons, pull all the data into a single
*physical* dataframe to run the classifier on all of it together.

The servers are located across a wide geographic region (1,000s of miles)

We would like to send jobs from the controller to be executed in parallel
on all the servers, and retrieve the results to the controller. The jobs
would consist of SQL-Heavy Java code for 'production' queries, and python
or R code for ad-hoc queries and predictive modeling.

Spark seems to have the capability to meet many of the individual
requirements, but is it a reasonable platform overall for building this
application?

Thank you very much for your assistance.

Larry

Re: Using spark to distribute jobs to standalone servers

Re: Using spark to distribute jobs to standalone servers

Retrying: Using spark to distribute jobs to standalone servers

Using spark to distribute jobs to standalone servers

4 matches

Site Navigation

Mail list logo

Footer information