Re: Multithreaded vs Spark Executor

2015-09-11 Thread Richard Eggert
Parallel processing is what Spark was made for. Let it do its job. Spawning
your own threads independently of what Spark is doing seems like you'd just
be asking for trouble.

I think you can accomplish what you want by taking the cartesian product of
the data element RDD and the feature list RDD and then perform the
computation as a map operation that takes the tuple of the data element and
feature as input.

Rich
On Sep 11, 2015 11:07 PM, "Rachana Srivastava" <
rachana.srivast...@markmonitor.com> wrote:

> Hello all,
>
>
>
> We are getting stream of input data from a Kafka queue using Spark
> Streaming API.  For each data element we want to run parallel threads to
> process a set of feature lists (nearly 100 feature or more).Since
> feature lists creation is independent of each other we would like to
> execute these feature lists in parallel on the input data that we get from
> the Kafka queue.
>
>
>
> *Question is *
>
>
>
> 1. Should we write thread pool and manage these features execution on
> different threads in parallel.  Only concern is because of data locality we
> are confined to the node that is assigned to the input data from the Kafka
> stream we cannot leverage distributed nodes for processing of these
> features for a single input data.
>
>
>
> 2.  Or since we are using JavaRDD as a feature list, these feature
> execution will be managed internally by Spark executors.
>
>
>
> Thanks,
>
>
>
> Rachana
>


Multithreaded vs Spark Executor

2015-09-11 Thread Rachana Srivastava
Hello all,

We are getting stream of input data from a Kafka queue using Spark Streaming 
API.  For each data element we want to run parallel threads to process a set of 
feature lists (nearly 100 feature or more).Since feature lists creation is 
independent of each other we would like to execute these feature lists in 
parallel on the input data that we get from the Kafka queue.

Question is

1. Should we write thread pool and manage these features execution on different 
threads in parallel.  Only concern is because of data locality we are confined 
to the node that is assigned to the input data from the Kafka stream we cannot 
leverage distributed nodes for processing of these features for a single input 
data.

2.  Or since we are using JavaRDD as a feature list, these feature execution 
will be managed internally by Spark executors.

Thanks,

Rachana