Re: Spark streaming multi-tasking during I/O

2015-08-22 Thread Sateesh Kavuri
Hi Rishitesh,

We are not using any RDD's to parallelize the processing and all of the
algorithm runs on a single core (and in a single thread). The parallelism
is done at the user level

The disk can be started in a separate IO, but then the executor will not be
able to take up more jobs, since thats how I believe Spark is designed by
default

On Sat, Aug 22, 2015 at 12:51 AM, Rishitesh Mishra rishi80.mis...@gmail.com
 wrote:

 Hi Sateesh,
 It is interesting to know , how did you determine that the Dstream runs on
 a single core. Did you mean receivers?

 Coming back to your question, could you not start disk io in a separate
 thread, so that the sceduler can go ahead and assign other tasks ?
 On 21 Aug 2015 16:06, Sateesh Kavuri sateesh.kav...@gmail.com wrote:

 Hi,

 My scenario goes like this:
 I have an algorithm running in Spark streaming mode on a 4 core virtual
 machine. Majority of the time, the algorithm does disk I/O and database
 I/O. Question is, during the I/O, where the CPU is not considerably loaded,
 is it possible to run any other task/thread so as to efficiently utilize
 the CPU?

 Note that one DStream of the algorithm runs completely on a single CPU

 Thank you,
 Sateesh




Re: Spark streaming multi-tasking during I/O

2015-08-22 Thread Sateesh Kavuri
Hi Akhil,

Think of the scenario as running a piece of code in normal Java with
multiple threads. Lets say there are 4 threads spawned by a Java process to
handle reading from database, some processing and storing to database. In
this process, while a thread is performing a database I/O, the CPU could
allow another thread to perform the processing, thus efficiently using the
resources.

Incase of Spark, while a node executor is running the same read from DB =
process data = store to DB, during the read from DB and store to DB
phase, the CPU is not given to other requests in queue, since the executor
will allocate the resources completely to the current ongoing request.

Does not flag spark.streaming.concurrentJobs enable this kind of scenario
or is there any other way to achieve what I am looking for

Thanks,
Sateesh

On Sat, Aug 22, 2015 at 7:26 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Hmm for a singl core VM you will have to run it in local mode(specifying
 master= local[4]). The flag is available in all the versions of spark i
 guess.
 On Aug 22, 2015 5:04 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Thanks Akhil. Does this mean that the executor running in the VM can
 spawn two concurrent jobs on the same core? If this is the case, this is
 what we are looking for. Also, which version of Spark is this flag in?

 Thanks,
 Sateesh

 On Sat, Aug 22, 2015 at 1:44 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can look at the spark.streaming.concurrentJobs by default it runs a
 single job. If set it to 2 then it can run 2 jobs parallely. Its an
 experimental flag, but go ahead and give it a try.
 On Aug 21, 2015 3:36 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Hi,

 My scenario goes like this:
 I have an algorithm running in Spark streaming mode on a 4 core virtual
 machine. Majority of the time, the algorithm does disk I/O and database
 I/O. Question is, during the I/O, where the CPU is not considerably loaded,
 is it possible to run any other task/thread so as to efficiently utilize
 the CPU?

 Note that one DStream of the algorithm runs completely on a single CPU

 Thank you,
 Sateesh





Re: Spark streaming multi-tasking during I/O

2015-08-22 Thread Sateesh Kavuri
Thanks Akhil. Does this mean that the executor running in the VM can spawn
two concurrent jobs on the same core? If this is the case, this is what we
are looking for. Also, which version of Spark is this flag in?

Thanks,
Sateesh

On Sat, Aug 22, 2015 at 1:44 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 You can look at the spark.streaming.concurrentJobs by default it runs a
 single job. If set it to 2 then it can run 2 jobs parallely. Its an
 experimental flag, but go ahead and give it a try.
 On Aug 21, 2015 3:36 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Hi,

 My scenario goes like this:
 I have an algorithm running in Spark streaming mode on a 4 core virtual
 machine. Majority of the time, the algorithm does disk I/O and database
 I/O. Question is, during the I/O, where the CPU is not considerably loaded,
 is it possible to run any other task/thread so as to efficiently utilize
 the CPU?

 Note that one DStream of the algorithm runs completely on a single CPU

 Thank you,
 Sateesh




Spark streaming multi-tasking during I/O

2015-08-21 Thread Sateesh Kavuri
Hi,

My scenario goes like this:
I have an algorithm running in Spark streaming mode on a 4 core virtual
machine. Majority of the time, the algorithm does disk I/O and database
I/O. Question is, during the I/O, where the CPU is not considerably loaded,
is it possible to run any other task/thread so as to efficiently utilize
the CPU?

Note that one DStream of the algorithm runs completely on a single CPU

Thank you,
Sateesh


Re: Spark or Storm

2015-06-16 Thread Sateesh Kavuri
Probably overloading the question a bit.

In Storm, Bolts have the functionality of getting triggered on events. Is
that kind of functionality possible with Spark streaming? During each phase
of the data processing, the transformed data is stored to the database and
this transformed data should then be sent to a new pipeline for further
processing

How can this be achieved using Spark?



On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in
 wrote:

 I have a use-case where a stream of Incoming events have to be aggregated
 and joined to create Complex events. The aggregation will have to happen at
 an interval of 1 minute (or less).

 The pipeline is :
   send events
  enrich event
 Upstream services --- KAFKA - event Stream
 Processor  Complex Event Processor  Elastic
 Search.

 From what I understand, Storm will make a very good ESP and Spark
 Streaming will make a good CEP.

 But, we are also evaluating Storm with Trident.

 How does Spark Streaming compare with Storm with Trident?

 Sridhar Chellappa







   On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com
 wrote:


 I have a similar scenario where we need to bring data from kinesis to
 hbase. Data volecity is 20k per 10 mins. Little manipulation of data will
 be required but that's regardless of the tool so we will be writing that
 piece in Java pojo.
 All env is on aws. Hbase is on a long running EMR and kinesis on a
 separate cluster.
 TIA.
 Best
 Ayan
 On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

 The programming models for the two frameworks are conceptually rather
 different; I haven't worked with Storm for quite some time, but based on my
 old experience with it, I would equate Spark Streaming more with Storm's
 Trident API, rather than with the raw Bolt API. Even then, there are
 significant differences, but it's a bit closer.

 If you can share your use case, we might be able to provide better
 guidance.

 Regards,
 Will

 On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

 Hi All,

 I am evaluating spark VS storm ( spark streaming  ) and i am not able to
 see what is equivalent of Bolt in storm inside spark.

 Any help will be appreciated on this ?

 Thanks ,
 Ashish
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Spark ML decision list

2015-06-04 Thread Sateesh Kavuri
Hi,

I have used weka machine learning library for generating a model for my
training set. I have used the PART algorithm (decision lists) from weka.

Now, I would like to use spark ML for the PART algo for my training set and
could not seem to find a parallel. Could anyone point out the corresponding
algorithm or even if its available in Spark ML?

Thanks,
Sateesh


Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Right, I am aware on how to use connection pooling with oracle, but the
specific question is how to use it in the context of spark job execution
On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote:

 http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm

 The question doesn't seem to be Spark specific, btw




  On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:
 
  Hi,
 
  We have a case that we will have to run concurrent jobs (for the same
 algorithm) on different data sets. And these jobs can run in parallel and
 each one of them would be fetching the data from the database.
  We would like to optimize the database connections by making use of
 connection pooling. Any suggestions / best known ways on how to achieve
 this. The database in question is Oracle
 
  Thanks,
  Sateesh



Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Hi,

We have a case that we will have to run concurrent jobs (for the same
algorithm) on different data sets. And these jobs can run in parallel and
each one of them would be fetching the data from the database.
We would like to optimize the database connections by making use of
connection pooling. Any suggestions / best known ways on how to achieve
this. The database in question is Oracle

Thanks,
Sateesh


Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
But this basically means that the pool is confined to the job (of a single
app) in question, but is not sharable across multiple apps?
The setup we have is a job server (the spark-jobserver) that creates jobs.
Currently, we have each job opening and closing a connection to the
database. What we would like to achieve is for each of the jobs to obtain a
connection from a db pool

Any directions on how this can be achieved?

--
Sateesh

On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger c...@koeninger.org wrote:

 Connection pools aren't serializable, so you generally need to set them up
 inside of a closure.  Doing that for every item is wasteful, so you
 typically want to use mapPartitions or foreachPartition

 rdd.mapPartition { part =
 setupPool
 part.map { ...



 See Design Patterns for using foreachRDD in
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

 On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Right, I am aware on how to use connection pooling with oracle, but the
 specific question is how to use it in the context of spark job execution
 On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote:

 http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm

 The question doesn't seem to be Spark specific, btw




  On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:
 
  Hi,
 
  We have a case that we will have to run concurrent jobs (for the same
 algorithm) on different data sets. And these jobs can run in parallel and
 each one of them would be fetching the data from the database.
  We would like to optimize the database connections by making use of
 connection pooling. Any suggestions / best known ways on how to achieve
 this. The database in question is Oracle
 
  Thanks,
  Sateesh





Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Each executor runs for about 5 secs until which time the db connection can
potentially be open. Each executor will have 1 connection open.
Connection pooling surely has its advantages of performance and not hitting
the dbserver for every open/close. The database in question is not just
used by the spark jobs, but is shared by other systems and so the spark
jobs have to better at managing the resources.

I am not really looking for a db connections counter (will let the db
handle that part), but rather have a pool of connections on spark end so
that the connections can be reused across jobs


On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke charles.fed...@gmail.com
wrote:

 How long does each executor keep the connection open for? How many
 connections does each executor open?

 Are you certain that connection pooling is a performant and suitable
 solution? Are you running out of resources on the database server and
 cannot tolerate each executor having a single connection?

 If you need a solution that limits the number of open connections
 [resource starvation on the DB server] I think you'd have to fake it with a
 centralized counter of active connections, and logic within each executor
 that blocks when the counter is at a given threshold. If the counter is not
 at threshold, then an active connection can be created (after incrementing
 the shared counter). You could use something like ZooKeeper to store the
 counter value. This would have the overall effect of decreasing performance
 if your required number of connections outstrips the database's resources.

 On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 But this basically means that the pool is confined to the job (of a
 single app) in question, but is not sharable across multiple apps?
 The setup we have is a job server (the spark-jobserver) that creates
 jobs. Currently, we have each job opening and closing a connection to the
 database. What we would like to achieve is for each of the jobs to obtain a
 connection from a db pool

 Any directions on how this can be achieved?

 --
 Sateesh

 On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger c...@koeninger.org
 wrote:

 Connection pools aren't serializable, so you generally need to set them
 up inside of a closure.  Doing that for every item is wasteful, so you
 typically want to use mapPartitions or foreachPartition

 rdd.mapPartition { part =
 setupPool
 part.map { ...



 See Design Patterns for using foreachRDD in
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

 On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri sateesh.kav...@gmail.com
  wrote:

 Right, I am aware on how to use connection pooling with oracle, but the
 specific question is how to use it in the context of spark job execution
 On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote:

 http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm

 The question doesn't seem to be Spark specific, btw




  On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:
 
  Hi,
 
  We have a case that we will have to run concurrent jobs (for the
 same algorithm) on different data sets. And these jobs can run in parallel
 and each one of them would be fetching the data from the database.
  We would like to optimize the database connections by making use of
 connection pooling. Any suggestions / best known ways on how to achieve
 this. The database in question is Oracle
 
  Thanks,
  Sateesh