got it. seems like i better stay away from this feature for now..
On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: > One issue is that job cancellation is posted on eventloop. So its possible > that subsequent jobs submitted to job queue may beat the job cancellation > event & hence the job cancellation event may end up closing them too. > So there's definitely a race condition you are risking even if not running > into. > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> SparkContext.cancelJobGroup >> >> >> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: >> >>> How do you cancel the job. Which API do you use? >>> >>> Mayur Rustagi >>> Ph: +1 (760) 203 3257 >>> http://www.sigmoidanalytics.com >>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>> >>> >>> >>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com> wrote: >>> >>>> i also noticed that jobs (with a new JobGroupId) which i run after this >>>> use which use the same RDDs get very confused. i see lots of cancelled >>>> stages and retries that go on forever. >>>> >>>> >>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com>wrote: >>>> >>>>> i have a running job that i cancel while keeping the spark context >>>>> alive. >>>>> >>>>> at the time of cancellation the active stage is 14. >>>>> >>>>> i see in logs: >>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job >>>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c >>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>>> 10 >>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>>> 14 >>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was >>>>> cancelled >>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet >>>>> 14.0 from pool x >>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>>> 13 >>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>>> 12 >>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>>> 11 >>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>>> 15 >>>>> >>>>> so far it all looks good. then i get a lot of messages like this: >>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update >>>>> with state FINISHED from TID 883 because its task set is gone >>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update >>>>> with state KILLED from TID 888 because its task set is gone >>>>> >>>>> after this stage 14 hangs around in active stages, without any sign of >>>>> progress or cancellation. it just sits there forever, stuck. looking at >>>>> the >>>>> logs of the executors confirms this. they task seem to be still running, >>>>> but nothing is happening. for example (by the time i look at this its 4:58 >>>>> so this tasks hasnt done anything in 15 mins): >>>>> >>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is >>>>> 1007 >>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to >>>>> driver >>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943 >>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is >>>>> 1007 >>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to >>>>> driver >>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945 >>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66 >>>>> >>>>> not sure what to make of this. any suggestions? best, koert >>>>> >>>> >>>> >>> >> >