Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-23 Thread Tathagata Das
queue stream does not support driver checkpoint recovery since the RDDs in the queue are arbitrary generated by the user and its hard for Spark Streaming to keep track of the data in the RDDs (thats necessary for recovering from checkpoint). Anyways queue stream is meant of testing and

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
Thanks, I've updated my code to use updateStateByKey but am still getting these errors when I resume from a checkpoint. One thought of mine was that I used sc.parallelize to generate the RDDs for the queue, but perhaps on resume, it doesn't recreate the context needed? -- Shaanan Cohney PhD

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
Where does task_batches come from? On 22 Jun 2015 4:48 pm, Shaanan Cohney shaan...@gmail.com wrote: Thanks, I've updated my code to use updateStateByKey but am still getting these errors when I resume from a checkpoint. One thought of mine was that I used sc.parallelize to generate the RDDs

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
It's a generated set of shell commands to run (written in C, highly optimized numerical computer), which is create from a set of user provided parameters. The snippet above is: task_outfiles_to_cmds = OrderedDict(run_sieving.leftover_tasks)

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
Counts is a list (counts = []) in the driver, used to collect the results. It seems like it's also not the best way to be doing things, but I'm new to spark and editing someone else's code so still learning. Thanks! def update_state(out_files, counts, curr_rdd): try: for c in

[Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
I'm receiving the SPARK-5063 error (RDD transformations and actions can only be invoked by the driver, not inside of other transformations) whenever I try and restore from a checkpoint in spark streaming on my app. I'm using python3 and my RDDs are inside a queuestream DStream. This is the

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
What does counts refer to? Could you also paste the code of your update_state function? On 22 Jun 2015 12:48 pm, Shaanan Cohney shaan...@gmail.com wrote: I'm receiving the SPARK-5063 error (RDD transformations and actions can only be invoked by the driver, not inside of other transformations)

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
I would suggest you have a look at the updateStateByKey transformation in the Spark Streaming programming guide which should fit your needs better than your update_state function. On 22 Jun 2015 1:03 pm, Shaanan Cohney shaan...@gmail.com wrote: Counts is a list (counts = []) in the driver, used