Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Gerard Maas
Hi TD,

We also struggled with this error for a long while.  The recurring scenario
is when the job takes longer to compute than the job interval and a backlog
starts to pile up.
Hint: Check
If the DStream storage level is set to MEMORY_ONLY_SER and memory runs
out,  then you will get a 'Cannot compute split: Missing block ...'.

What I don't know ATM is whether the new data is dropped or the LRU policy
removes data in the system in favor for the incoming data.
In any case, the DStream processing still thinks the data is there at the
moment the job is scheduled to run and fails to run.

In our case, changing storage to MEMORY_AND_DISK_SER solved the problem
and our streaming job can get through tought times without issues.

Regularly checking 'scheduling delay' and 'total delay' on the Streaming
tab in the UI is a must.  (And soon we will have that on the metrics report
as well!! :-) )

-kr, Gerard.



On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 Hi TD,

 I am using Spark Streaming to consume data from Kafka and do some
 aggregation and ingest the results into RDS. I do use foreachRDD in the
 program. I am planning to use Spark streaming in our production pipeline
 and it performs well in generating the results. Unfortunately, we plan to
 have a production pipeline 24/7 and Spark streaming job usually fails after
 8-20 hours due to the exception cannot compute split. In other cases, the
 Kafka receiver has failure and the program runs without producing any
 result.

 In my pipeline, the batch size is 1 minute and the data volume per minute
 from Kafka is 3G. I have been struggling with this issue for more than a
 month. It will be great if you can provide some solutions for this. Thanks!

 Bill


 On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Can you elaborate on the usage pattern that lead to cannot compute
 split ? Are you using the RDDs generated by DStream, outside the
 DStream logic? Something like running interactive Spark jobs
 (independent of the Spark Streaming ones) on RDDs generated by
 DStreams? If that is the case, what is happening is that Spark
 Streaming is not aware that some of the RDDs (and the raw input data
 that it will need) will be used by Spark jobs unrelated to Spark
 Streaming. Hence Spark Streaming will actively clear off the raw data,
 leading to failures in the unrelated Spark jobs using that data.

 In case this is your use case, the cleanest way to solve this, is by
 asking Spark Streaming remember stuff for longer, by using
 streamingContext.remember(duration). This will ensure that Spark
 Streaming will keep around all the stuff for at least that duration.
 Hope this helps.

 TD

 On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Just add one more point. If Spark streaming knows when the RDD will not
 be
  used any more, I believe Spark will not try to retrieve data it will
 not use
  any more. However, in practice, I often encounter the error of cannot
  compute split. Based on my understanding, this is  because Spark
 cleared
  out data that will be used again. In my case, the data volume is much
  smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
  executor). If Spark will only keep RDD that are in use, I expect that
 this
  error may not happen.
 
  Bill
 
  On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das 
 tathagata.das1...@gmail.com
  wrote:
 
  Let me further clarify Lalit's point on when RDDs generated by
  DStreams are destroyed, and hopefully that will answer your original
  questions.
 
  1.  How spark (streaming) guarantees that all the actions are taken on
  each input rdd/batch.
  This is isnt hard! By the time you call streamingContext.start(), you
  have already set up the output operations (foreachRDD, saveAs***Files,
  etc.) that you want to do with the DStream. There are RDD actions
  inside the DStream output oeprations that need to be done every batch
  interval. So all the systems does is this - after every batch
  interval, put all the output operations (that will call RDD actions)
  in a job queue, and then keep executing stuff in the queue. If there
  is any failure in running the jobs, the streaming context will stop.
 
  2.  How does spark determines that the life-cycle of a rdd is
  complete. Is there any chance that a RDD will be cleaned out of ram
  before all actions are taken on them?
  Spark Streaming knows when the all the processing related to batch T
  has been completed. And also it keeps track of how much time of the
  previous RDDs does it need to remember and keep around in the cache
  based on what DStream operations have been done. For example, if you
  are using a window 1 minute, the system knows that it needs to keep
  around at least last 1 minute data in the memory. Accordingly, it
  cleans up the input data (actively unpersisted), and cached RDD
  (simply dereferenced from DStream metadata, and then 

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Bill Jay
Gerard,

That is a good observation. However, the strange thing I meet is if I use
MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10
seconds to process my data of every batch, which is one minute. It fails
after 10 hours with the cannot compute split error.

Bill

On Thu, Nov 27, 2014 at 3:31 AM, Gerard Maas gerard.m...@gmail.com wrote:

 Hi TD,

 We also struggled with this error for a long while.  The recurring
 scenario is when the job takes longer to compute than the job interval and
 a backlog starts to pile up.
 Hint: Check
 If the DStream storage level is set to MEMORY_ONLY_SER and memory runs
 out,  then you will get a 'Cannot compute split: Missing block ...'.

 What I don't know ATM is whether the new data is dropped or the LRU policy
 removes data in the system in favor for the incoming data.
 In any case, the DStream processing still thinks the data is there at the
 moment the job is scheduled to run and fails to run.

 In our case, changing storage to MEMORY_AND_DISK_SER solved the problem
 and our streaming job can get through tought times without issues.

 Regularly checking 'scheduling delay' and 'total delay' on the Streaming
 tab in the UI is a must.  (And soon we will have that on the metrics report
 as well!! :-) )

 -kr, Gerard.



 On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi TD,

 I am using Spark Streaming to consume data from Kafka and do some
 aggregation and ingest the results into RDS. I do use foreachRDD in the
 program. I am planning to use Spark streaming in our production pipeline
 and it performs well in generating the results. Unfortunately, we plan to
 have a production pipeline 24/7 and Spark streaming job usually fails after
 8-20 hours due to the exception cannot compute split. In other cases, the
 Kafka receiver has failure and the program runs without producing any
 result.

 In my pipeline, the batch size is 1 minute and the data volume per minute
 from Kafka is 3G. I have been struggling with this issue for more than a
 month. It will be great if you can provide some solutions for this. Thanks!

 Bill


 On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Can you elaborate on the usage pattern that lead to cannot compute
 split ? Are you using the RDDs generated by DStream, outside the
 DStream logic? Something like running interactive Spark jobs
 (independent of the Spark Streaming ones) on RDDs generated by
 DStreams? If that is the case, what is happening is that Spark
 Streaming is not aware that some of the RDDs (and the raw input data
 that it will need) will be used by Spark jobs unrelated to Spark
 Streaming. Hence Spark Streaming will actively clear off the raw data,
 leading to failures in the unrelated Spark jobs using that data.

 In case this is your use case, the cleanest way to solve this, is by
 asking Spark Streaming remember stuff for longer, by using
 streamingContext.remember(duration). This will ensure that Spark
 Streaming will keep around all the stuff for at least that duration.
 Hope this helps.

 TD

 On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Just add one more point. If Spark streaming knows when the RDD will
 not be
  used any more, I believe Spark will not try to retrieve data it will
 not use
  any more. However, in practice, I often encounter the error of cannot
  compute split. Based on my understanding, this is  because Spark
 cleared
  out data that will be used again. In my case, the data volume is much
  smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
  executor). If Spark will only keep RDD that are in use, I expect that
 this
  error may not happen.
 
  Bill
 
  On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das 
 tathagata.das1...@gmail.com
  wrote:
 
  Let me further clarify Lalit's point on when RDDs generated by
  DStreams are destroyed, and hopefully that will answer your original
  questions.
 
  1.  How spark (streaming) guarantees that all the actions are taken on
  each input rdd/batch.
  This is isnt hard! By the time you call streamingContext.start(), you
  have already set up the output operations (foreachRDD, saveAs***Files,
  etc.) that you want to do with the DStream. There are RDD actions
  inside the DStream output oeprations that need to be done every batch
  interval. So all the systems does is this - after every batch
  interval, put all the output operations (that will call RDD actions)
  in a job queue, and then keep executing stuff in the queue. If there
  is any failure in running the jobs, the streaming context will stop.
 
  2.  How does spark determines that the life-cycle of a rdd is
  complete. Is there any chance that a RDD will be cleaned out of ram
  before all actions are taken on them?
  Spark Streaming knows when the all the processing related to batch T
  has been completed. And also it keeps track of how much time of the
  previous RDDs does 

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Tathagata Das
If it regularly fails after 8 hours then could you get me the log4j logs?
To limit the size, set default log level to Warn and the level of logs for
all classes in package o.a.s.streaming to Debug. Then I can take a look.
On Nov 27, 2014 11:01 AM, Bill Jay bill.jaypeter...@gmail.com wrote:

 Gerard,

 That is a good observation. However, the strange thing I meet is if I use
 MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10
 seconds to process my data of every batch, which is one minute. It fails
 after 10 hours with the cannot compute split error.

 Bill

 On Thu, Nov 27, 2014 at 3:31 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi TD,

 We also struggled with this error for a long while.  The recurring
 scenario is when the job takes longer to compute than the job interval and
 a backlog starts to pile up.
 Hint: Check
 If the DStream storage level is set to MEMORY_ONLY_SER and memory runs
 out,  then you will get a 'Cannot compute split: Missing block ...'.

 What I don't know ATM is whether the new data is dropped or the LRU
 policy removes data in the system in favor for the incoming data.
 In any case, the DStream processing still thinks the data is there at the
 moment the job is scheduled to run and fails to run.

 In our case, changing storage to MEMORY_AND_DISK_SER solved the
 problem and our streaming job can get through tought times without issues.

 Regularly checking 'scheduling delay' and 'total delay' on the Streaming
 tab in the UI is a must.  (And soon we will have that on the metrics report
 as well!! :-) )

 -kr, Gerard.



 On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi TD,

 I am using Spark Streaming to consume data from Kafka and do some
 aggregation and ingest the results into RDS. I do use foreachRDD in the
 program. I am planning to use Spark streaming in our production pipeline
 and it performs well in generating the results. Unfortunately, we plan to
 have a production pipeline 24/7 and Spark streaming job usually fails after
 8-20 hours due to the exception cannot compute split. In other cases, the
 Kafka receiver has failure and the program runs without producing any
 result.

 In my pipeline, the batch size is 1 minute and the data volume per
 minute from Kafka is 3G. I have been struggling with this issue for more
 than a month. It will be great if you can provide some solutions for this.
 Thanks!

 Bill


 On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Can you elaborate on the usage pattern that lead to cannot compute
 split ? Are you using the RDDs generated by DStream, outside the
 DStream logic? Something like running interactive Spark jobs
 (independent of the Spark Streaming ones) on RDDs generated by
 DStreams? If that is the case, what is happening is that Spark
 Streaming is not aware that some of the RDDs (and the raw input data
 that it will need) will be used by Spark jobs unrelated to Spark
 Streaming. Hence Spark Streaming will actively clear off the raw data,
 leading to failures in the unrelated Spark jobs using that data.

 In case this is your use case, the cleanest way to solve this, is by
 asking Spark Streaming remember stuff for longer, by using
 streamingContext.remember(duration). This will ensure that Spark
 Streaming will keep around all the stuff for at least that duration.
 Hope this helps.

 TD

 On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Just add one more point. If Spark streaming knows when the RDD will
 not be
  used any more, I believe Spark will not try to retrieve data it will
 not use
  any more. However, in practice, I often encounter the error of cannot
  compute split. Based on my understanding, this is  because Spark
 cleared
  out data that will be used again. In my case, the data volume is much
  smaller (30M/s, the batch size is 60 seconds) than the memory (20G
 each
  executor). If Spark will only keep RDD that are in use, I expect that
 this
  error may not happen.
 
  Bill
 
  On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das 
 tathagata.das1...@gmail.com
  wrote:
 
  Let me further clarify Lalit's point on when RDDs generated by
  DStreams are destroyed, and hopefully that will answer your original
  questions.
 
  1.  How spark (streaming) guarantees that all the actions are taken
 on
  each input rdd/batch.
  This is isnt hard! By the time you call streamingContext.start(), you
  have already set up the output operations (foreachRDD,
 saveAs***Files,
  etc.) that you want to do with the DStream. There are RDD actions
  inside the DStream output oeprations that need to be done every batch
  interval. So all the systems does is this - after every batch
  interval, put all the output operations (that will call RDD actions)
  in a job queue, and then keep executing stuff in the queue. If there
  is any failure in running the jobs, the streaming context will stop.
 
  2.  How does spark determines 

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Harihar Nahak
When there is new data comes in a stream spark use streams classes to
convert it into RDD and as you mention its follow with transformation and
finally action. Till the time user doesn't destroy or application is alive
All RDD remain in Memory as far as I experienced.


On 26 November 2014 at 20:05, Mukesh Jha [via Apache Spark User List] 
ml-node+s1001560n19835...@n3.nabble.com wrote:

 Any pointers guys?

 On Tue, Nov 25, 2014 at 5:32 PM, Mukesh Jha [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19835i=0 wrote:

 Hey Experts,

 I wanted to understand in detail about the lifecycle of rdd(s) in a
 streaming app.

 From my current understanding
 - rdd gets created out of the realtime input stream.
 - Transform(s) functions are applied in a lazy fashion on the RDD to
 transform into another rdd(s).
 - Actions are taken on the final transformed rdds to get the data out of
 the system.

 Also rdd(s) are stored in the clusters RAM (disc if configured so) and
 are cleaned in LRU fashion.

 So I have the following questions on the same.
 - How spark (streaming) guarantees that all the actions are taken on each
 input rdd/batch.
 - How does spark determines that the life-cycle of a rdd is complete. Is
 there any chance that a RDD will be cleaned out of ram before all actions
 are taken on them?

 Thanks in advance for all your help. Also, I'm relatively new to scala 
 spark so pardon me in case these are naive questions/assumptions.

 --
 Thanks  Regards,

 *[hidden email] http://user/SendEmail.jtp?type=nodenode=19835i=1*




 --


 Thanks  Regards,

 *[hidden email] http://user/SendEmail.jtp?type=nodenode=19835i=2*


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19835.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
Regards,
Harihar Nahak
BigData Developer
Wynyard
Email:hna...@wynyardgroup.com | Extn: 8019




-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19987.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread lalit1303
Hi Mukesh,

Once you create a streming job, a DAG is created which contains your job
plan i.e. all map transformation and all action operations to be performed
on each batch of streaming application.

So, once your job is started, the input dstream take the data input from
specified source and all the transformations/actions are performed according
to the DAG created. Once all the operations on dstream are performed, the
dstream is destroyed in LRU fashion.




-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread tian zhang
I have found this paper seems to answer most of questions about life 
duration.https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf

Tian 

 On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha 
me.mukesh@gmail.com wrote:
   

 Hey Experts,
I wanted to understand in detail about the lifecycle of rdd(s) in a streaming 
app.
From my current understanding- rdd gets created out of the realtime input 
stream.
- Transform(s) functions are applied in a lazy fashion on the RDD to transform 
into another rdd(s).- Actions are taken on the final transformed rdds to get 
the data out of the system.
Also rdd(s) are stored in the clusters RAM (disc if configured so) and are 
cleaned in LRU fashion.
So I have the following questions on the same.
- How spark (streaming) guarantees that all the actions are taken on each input 
rdd/batch. - How does spark determines that the life-cycle of a rdd is 
complete. Is there any chance that a RDD will be cleaned out of ram before all 
actions are taken on them?
Thanks in advance for all your help. Also, I'm relatively new to scala  spark 
so pardon me in case these are naive questions/assumptions.

-- 
Thanks  Regards,
Mukesh Jha

   

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
Just add one more point. If Spark streaming knows when the RDD will not be
used any more, I believe Spark will not try to retrieve data it will not
use any more. However, in practice, I often encounter the error of cannot
compute split. Based on my understanding, this is  because Spark cleared
out data that will be used again. In my case, the data volume is much
smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
executor). If Spark will only keep RDD that are in use, I expect that this
error may not happen.

Bill

On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Let me further clarify Lalit's point on when RDDs generated by
 DStreams are destroyed, and hopefully that will answer your original
 questions.

 1.  How spark (streaming) guarantees that all the actions are taken on
 each input rdd/batch.
 This is isnt hard! By the time you call streamingContext.start(), you
 have already set up the output operations (foreachRDD, saveAs***Files,
 etc.) that you want to do with the DStream. There are RDD actions
 inside the DStream output oeprations that need to be done every batch
 interval. So all the systems does is this - after every batch
 interval, put all the output operations (that will call RDD actions)
 in a job queue, and then keep executing stuff in the queue. If there
 is any failure in running the jobs, the streaming context will stop.

 2.  How does spark determines that the life-cycle of a rdd is
 complete. Is there any chance that a RDD will be cleaned out of ram
 before all actions are taken on them?
 Spark Streaming knows when the all the processing related to batch T
 has been completed. And also it keeps track of how much time of the
 previous RDDs does it need to remember and keep around in the cache
 based on what DStream operations have been done. For example, if you
 are using a window 1 minute, the system knows that it needs to keep
 around at least last 1 minute data in the memory. Accordingly, it
 cleans up the input data (actively unpersisted), and cached RDD
 (simply dereferenced from DStream metadata, and then Spark unpersists
 them as the RDD object gets GarbageCollected by the JVM).

 TD



 On Wed, Nov 26, 2014 at 10:10 AM, tian zhang
 tzhang...@yahoo.com.invalid wrote:
  I have found this paper seems to answer most of questions about life
  duration.
 
 https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
 
  Tian
 
 
  On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha 
 me.mukesh@gmail.com
  wrote:
 
 
  Hey Experts,
 
  I wanted to understand in detail about the lifecycle of rdd(s) in a
  streaming app.
 
  From my current understanding
  - rdd gets created out of the realtime input stream.
  - Transform(s) functions are applied in a lazy fashion on the RDD to
  transform into another rdd(s).
  - Actions are taken on the final transformed rdds to get the data out of
 the
  system.
 
  Also rdd(s) are stored in the clusters RAM (disc if configured so) and
 are
  cleaned in LRU fashion.
 
  So I have the following questions on the same.
  - How spark (streaming) guarantees that all the actions are taken on each
  input rdd/batch.
  - How does spark determines that the life-cycle of a rdd is complete. Is
  there any chance that a RDD will be cleaned out of ram before all actions
  are taken on them?
 
  Thanks in advance for all your help. Also, I'm relatively new to scala 
  spark so pardon me in case these are naive questions/assumptions.
 
  --
  Thanks  Regards,
  Mukesh Jha
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Tathagata Das
Can you elaborate on the usage pattern that lead to cannot compute
split ? Are you using the RDDs generated by DStream, outside the
DStream logic? Something like running interactive Spark jobs
(independent of the Spark Streaming ones) on RDDs generated by
DStreams? If that is the case, what is happening is that Spark
Streaming is not aware that some of the RDDs (and the raw input data
that it will need) will be used by Spark jobs unrelated to Spark
Streaming. Hence Spark Streaming will actively clear off the raw data,
leading to failures in the unrelated Spark jobs using that data.

In case this is your use case, the cleanest way to solve this, is by
asking Spark Streaming remember stuff for longer, by using
streamingContext.remember(duration). This will ensure that Spark
Streaming will keep around all the stuff for at least that duration.
Hope this helps.

TD

On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com wrote:
 Just add one more point. If Spark streaming knows when the RDD will not be
 used any more, I believe Spark will not try to retrieve data it will not use
 any more. However, in practice, I often encounter the error of cannot
 compute split. Based on my understanding, this is  because Spark cleared
 out data that will be used again. In my case, the data volume is much
 smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
 executor). If Spark will only keep RDD that are in use, I expect that this
 error may not happen.

 Bill

 On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 Let me further clarify Lalit's point on when RDDs generated by
 DStreams are destroyed, and hopefully that will answer your original
 questions.

 1.  How spark (streaming) guarantees that all the actions are taken on
 each input rdd/batch.
 This is isnt hard! By the time you call streamingContext.start(), you
 have already set up the output operations (foreachRDD, saveAs***Files,
 etc.) that you want to do with the DStream. There are RDD actions
 inside the DStream output oeprations that need to be done every batch
 interval. So all the systems does is this - after every batch
 interval, put all the output operations (that will call RDD actions)
 in a job queue, and then keep executing stuff in the queue. If there
 is any failure in running the jobs, the streaming context will stop.

 2.  How does spark determines that the life-cycle of a rdd is
 complete. Is there any chance that a RDD will be cleaned out of ram
 before all actions are taken on them?
 Spark Streaming knows when the all the processing related to batch T
 has been completed. And also it keeps track of how much time of the
 previous RDDs does it need to remember and keep around in the cache
 based on what DStream operations have been done. For example, if you
 are using a window 1 minute, the system knows that it needs to keep
 around at least last 1 minute data in the memory. Accordingly, it
 cleans up the input data (actively unpersisted), and cached RDD
 (simply dereferenced from DStream metadata, and then Spark unpersists
 them as the RDD object gets GarbageCollected by the JVM).

 TD



 On Wed, Nov 26, 2014 at 10:10 AM, tian zhang
 tzhang...@yahoo.com.invalid wrote:
  I have found this paper seems to answer most of questions about life
  duration.
 
  https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
 
  Tian
 
 
  On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha
  me.mukesh@gmail.com
  wrote:
 
 
  Hey Experts,
 
  I wanted to understand in detail about the lifecycle of rdd(s) in a
  streaming app.
 
  From my current understanding
  - rdd gets created out of the realtime input stream.
  - Transform(s) functions are applied in a lazy fashion on the RDD to
  transform into another rdd(s).
  - Actions are taken on the final transformed rdds to get the data out of
  the
  system.
 
  Also rdd(s) are stored in the clusters RAM (disc if configured so) and
  are
  cleaned in LRU fashion.
 
  So I have the following questions on the same.
  - How spark (streaming) guarantees that all the actions are taken on
  each
  input rdd/batch.
  - How does spark determines that the life-cycle of a rdd is complete. Is
  there any chance that a RDD will be cleaned out of ram before all
  actions
  are taken on them?
 
  Thanks in advance for all your help. Also, I'm relatively new to scala 
  spark so pardon me in case these are naive questions/assumptions.
 
  --
  Thanks  Regards,
  Mukesh Jha
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
Hi TD,

I am using Spark Streaming to consume data from Kafka and do some
aggregation and ingest the results into RDS. I do use foreachRDD in the
program. I am planning to use Spark streaming in our production pipeline
and it performs well in generating the results. Unfortunately, we plan to
have a production pipeline 24/7 and Spark streaming job usually fails after
8-20 hours due to the exception cannot compute split. In other cases, the
Kafka receiver has failure and the program runs without producing any
result.

In my pipeline, the batch size is 1 minute and the data volume per minute
from Kafka is 3G. I have been struggling with this issue for more than a
month. It will be great if you can provide some solutions for this. Thanks!

Bill


On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Can you elaborate on the usage pattern that lead to cannot compute
 split ? Are you using the RDDs generated by DStream, outside the
 DStream logic? Something like running interactive Spark jobs
 (independent of the Spark Streaming ones) on RDDs generated by
 DStreams? If that is the case, what is happening is that Spark
 Streaming is not aware that some of the RDDs (and the raw input data
 that it will need) will be used by Spark jobs unrelated to Spark
 Streaming. Hence Spark Streaming will actively clear off the raw data,
 leading to failures in the unrelated Spark jobs using that data.

 In case this is your use case, the cleanest way to solve this, is by
 asking Spark Streaming remember stuff for longer, by using
 streamingContext.remember(duration). This will ensure that Spark
 Streaming will keep around all the stuff for at least that duration.
 Hope this helps.

 TD

 On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Just add one more point. If Spark streaming knows when the RDD will not
 be
  used any more, I believe Spark will not try to retrieve data it will not
 use
  any more. However, in practice, I often encounter the error of cannot
  compute split. Based on my understanding, this is  because Spark cleared
  out data that will be used again. In my case, the data volume is much
  smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
  executor). If Spark will only keep RDD that are in use, I expect that
 this
  error may not happen.
 
  Bill
 
  On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das 
 tathagata.das1...@gmail.com
  wrote:
 
  Let me further clarify Lalit's point on when RDDs generated by
  DStreams are destroyed, and hopefully that will answer your original
  questions.
 
  1.  How spark (streaming) guarantees that all the actions are taken on
  each input rdd/batch.
  This is isnt hard! By the time you call streamingContext.start(), you
  have already set up the output operations (foreachRDD, saveAs***Files,
  etc.) that you want to do with the DStream. There are RDD actions
  inside the DStream output oeprations that need to be done every batch
  interval. So all the systems does is this - after every batch
  interval, put all the output operations (that will call RDD actions)
  in a job queue, and then keep executing stuff in the queue. If there
  is any failure in running the jobs, the streaming context will stop.
 
  2.  How does spark determines that the life-cycle of a rdd is
  complete. Is there any chance that a RDD will be cleaned out of ram
  before all actions are taken on them?
  Spark Streaming knows when the all the processing related to batch T
  has been completed. And also it keeps track of how much time of the
  previous RDDs does it need to remember and keep around in the cache
  based on what DStream operations have been done. For example, if you
  are using a window 1 minute, the system knows that it needs to keep
  around at least last 1 minute data in the memory. Accordingly, it
  cleans up the input data (actively unpersisted), and cached RDD
  (simply dereferenced from DStream metadata, and then Spark unpersists
  them as the RDD object gets GarbageCollected by the JVM).
 
  TD
 
 
 
  On Wed, Nov 26, 2014 at 10:10 AM, tian zhang
  tzhang...@yahoo.com.invalid wrote:
   I have found this paper seems to answer most of questions about life
   duration.
  
  
 https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
  
   Tian
  
  
   On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha
   me.mukesh@gmail.com
   wrote:
  
  
   Hey Experts,
  
   I wanted to understand in detail about the lifecycle of rdd(s) in a
   streaming app.
  
   From my current understanding
   - rdd gets created out of the realtime input stream.
   - Transform(s) functions are applied in a lazy fashion on the RDD to
   transform into another rdd(s).
   - Actions are taken on the final transformed rdds to get the data out
 of
   the
   system.
  
   Also rdd(s) are stored in the clusters RAM (disc if configured so) and
   are
   cleaned in LRU fashion.
  
   So I have the following questions on the 

Lifecycle of RDD in spark-streaming

2014-11-25 Thread Mukesh Jha
Hey Experts,

I wanted to understand in detail about the lifecycle of rdd(s) in a
streaming app.

From my current understanding
- rdd gets created out of the realtime input stream.
- Transform(s) functions are applied in a lazy fashion on the RDD to
transform into another rdd(s).
- Actions are taken on the final transformed rdds to get the data out of
the system.

Also rdd(s) are stored in the clusters RAM (disc if configured so) and are
cleaned in LRU fashion.

So I have the following questions on the same.
- How spark (streaming) guarantees that all the actions are taken on each
input rdd/batch.
- How does spark determines that the life-cycle of a rdd is complete. Is
there any chance that a RDD will be cleaned out of ram before all actions
are taken on them?

Thanks in advance for all your help. Also, I'm relatively new to scala 
spark so pardon me in case these are naive questions/assumptions.

-- 
Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*


Re: Lifecycle of RDD in spark-streaming

2014-11-25 Thread Mukesh Jha
Any pointers guys?

On Tue, Nov 25, 2014 at 5:32 PM, Mukesh Jha me.mukesh@gmail.com wrote:

 Hey Experts,

 I wanted to understand in detail about the lifecycle of rdd(s) in a
 streaming app.

 From my current understanding
 - rdd gets created out of the realtime input stream.
 - Transform(s) functions are applied in a lazy fashion on the RDD to
 transform into another rdd(s).
 - Actions are taken on the final transformed rdds to get the data out of
 the system.

 Also rdd(s) are stored in the clusters RAM (disc if configured so) and are
 cleaned in LRU fashion.

 So I have the following questions on the same.
 - How spark (streaming) guarantees that all the actions are taken on each
 input rdd/batch.
 - How does spark determines that the life-cycle of a rdd is complete. Is
 there any chance that a RDD will be cleaned out of ram before all actions
 are taken on them?

 Thanks in advance for all your help. Also, I'm relatively new to scala 
 spark so pardon me in case these are naive questions/assumptions.

 --
 Thanks  Regards,

 *Mukesh Jha me.mukesh@gmail.com*




-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*