Re: [Discussion] Race on heartbeat update in checkpoint thread

2021-08-18 Thread Andrey Gura
Alexey, Ilya,

I took a look at the problem and corresponding fixes. It seems that
you are both right. But Alexey's fix looks like some kind of hack to
me.

We have two problems:

1. Heartbeat update from thread that will complete a future.

I agree with Ilya and Andrey. Only a critical worker itself can update
heartbeat. So, removal of `res.listen(fut ->
heartbeatUpdater.updateHeartbeat());` from
CheckpointContextImpl#executor() method is a good idea. It will avoid
a race and makes Alexey's fix redundant.

2. blockingSection in checkpointer thread really can never detect the
lack of pending tasks progress.

Alexey is right. It is a problem. But what I actually see:
checkpointWritePageThreads and checkpointCollectInfoThreads which
execute pending tasks, are actually critical themselves and should be
also monitored by WorkersRegistry. In such a case the checkpointer
thread can safely use blocking sections at places designated by Ilya
and forementioned threads should also use blocking sections on IO
operations.

So proposal:

- Remove Alexey's fix
- Remove asynchronous heartbeat update
- Add blocking sections (fix of Ilya)
- Monitor checkpointWritePageThreads and checkpointCollectInfoThreads
if they exist (it depends on configuration)

WDYT?

On Thu, Jul 29, 2021 at 5:48 AM Ilya Kazakov  wrote:
>
> Guys. I have discussed just these changes. Look at my PR, please. What do
> you think?
>
> https://issues.apache.org/jira/browse/IGNITE-15192
> https://github.com/apache/ignite/pull/9280/files
>
> чт, 22 июл. 2021 г. в 22:30, Andrey Mashenkov :
>
> > Hi guys,
> >
> > I think updateHeartBeat() method was misused in the future listener and
> > this must be fixed.
> >
> > Actually, GridWorker.heartbeatTs Javadoc says that field is updated by the
> > worker itself.
> > It is consistent with WorkProgressDispatcher.updateHearbeat() javadoc,
> > which said "Notifying dispatcher that work is in progress and thread didn't
> > freeze."
> > GridWorker.heartbeatTs field is marked volatile just to provide consistent
> > visibility guarantee to a watcher.
> >
> > So, CheckpointPagesWriter violates this contract, when runs in another
> > thread(s).
> > But it works fine just because the main (Checkpointer) thread waits for
> > all the Writers to finish their work before it can fall into the blocking
> > section, therefore there is no race possible.
> >
> > As for the broken case, there are 2 possible solutions,
> > 1. Main thread must wait for all children's tasks to finish before going
> > into the blocking section.
> > 2. Or make updateHeartbeat consistent with the blockingSectionBegin/End.
> > Seems, there is no need to mark the method as synchronized, but use call
> > AtomicLongFieldUpdater.getAndUpdate(this, v -> v == Long.MAX_VALUE ? v :
> > U.currentTimeMillis()).
> >
> > I'd suggest
> > * Remove the "thread" mentioning from WorkProgressDispatcher interface to
> > avoid confusion with the current usage and make WorkProgressDispatcher more
> > general.
> > * Avoid unsafe heartbeatUpdater leak to the outside via wrapping
> > CheckpointContextImpl.heartbeatUpdater instance to perform consistent
> > heartbeat update from p.2 above.
> >
> > Does it make sense?
> >
> > On Wed, Jul 21, 2021 at 11:12 AM Alex Plehanov 
> > wrote:
> >
> >> Ilya,
> >>
> >> Race only affects the future listener (which only updates heartbeat), but
> >> not checkpoint listeners, so no other bugs possible caused by this race
> >> except wrong heartbeat update.
> >>
> >> As I've written before, I think it's not a good idea to wait for other
> >> threads by the critical thread in the blocking section. If these threads
> >> are not monitored we can never detect lack of progress and never trigger
> >> the failure handler. Perhaps a good solution will be to register async
> >> threads as critical threads (additionally to waiting for these threads in
> >> the blocing section) and update their own heartbeat. But the current fix
> >> is
> >> required too, to avoid such problems for other critical threads in the
> >> future.
> >>
> >> ср, 21 июл. 2021 г. в 06:29, Ilya Kazakov :
> >>
> >> > 1.
> >> > I mean calling listeners in CheckpointWorkflow.markCheckpointBegin():
> >> >
> >> > This
> >> > --
> >> > for (CheckpointListener lsnr : dbLsnrs)
> >> > lsnr.beforeCheckpointBegin(ctx0);
> >> >
> >> > ctx0.awaitPendingTasksFinished();
> >> > --
> >> >
> >> > And this:
> >> >
> >> > --
> >> > // Listeners must be invoked before we write checkpoint record to WAL.
> >> > for (CheckpointListener lsnr : dbLsnrs)
> >> > lsnr.onMarkCheckpointBegin(ctx0);
> >> >
> >> > ctx0.awaitPendingTasksFinished();
> >> > --
> >> >
> >> > Inside lsnr.beforeCheckpointBegin(ctx0) and
> >> > lsnr.onMarkCheckpointBegin(ctx0) we call
> >> CheckpointContextImpl.executor()
> >> > which submit heartbeat update tasks in 

Re: [Discussion] Race on heartbeat update in checkpoint thread

2021-07-28 Thread Ilya Kazakov
Guys. I have discussed just these changes. Look at my PR, please. What do
you think?

https://issues.apache.org/jira/browse/IGNITE-15192
https://github.com/apache/ignite/pull/9280/files

чт, 22 июл. 2021 г. в 22:30, Andrey Mashenkov :

> Hi guys,
>
> I think updateHeartBeat() method was misused in the future listener and
> this must be fixed.
>
> Actually, GridWorker.heartbeatTs Javadoc says that field is updated by the
> worker itself.
> It is consistent with WorkProgressDispatcher.updateHearbeat() javadoc,
> which said "Notifying dispatcher that work is in progress and thread didn't
> freeze."
> GridWorker.heartbeatTs field is marked volatile just to provide consistent
> visibility guarantee to a watcher.
>
> So, CheckpointPagesWriter violates this contract, when runs in another
> thread(s).
> But it works fine just because the main (Checkpointer) thread waits for
> all the Writers to finish their work before it can fall into the blocking
> section, therefore there is no race possible.
>
> As for the broken case, there are 2 possible solutions,
> 1. Main thread must wait for all children's tasks to finish before going
> into the blocking section.
> 2. Or make updateHeartbeat consistent with the blockingSectionBegin/End.
> Seems, there is no need to mark the method as synchronized, but use call
> AtomicLongFieldUpdater.getAndUpdate(this, v -> v == Long.MAX_VALUE ? v :
> U.currentTimeMillis()).
>
> I'd suggest
> * Remove the "thread" mentioning from WorkProgressDispatcher interface to
> avoid confusion with the current usage and make WorkProgressDispatcher more
> general.
> * Avoid unsafe heartbeatUpdater leak to the outside via wrapping
> CheckpointContextImpl.heartbeatUpdater instance to perform consistent
> heartbeat update from p.2 above.
>
> Does it make sense?
>
> On Wed, Jul 21, 2021 at 11:12 AM Alex Plehanov 
> wrote:
>
>> Ilya,
>>
>> Race only affects the future listener (which only updates heartbeat), but
>> not checkpoint listeners, so no other bugs possible caused by this race
>> except wrong heartbeat update.
>>
>> As I've written before, I think it's not a good idea to wait for other
>> threads by the critical thread in the blocking section. If these threads
>> are not monitored we can never detect lack of progress and never trigger
>> the failure handler. Perhaps a good solution will be to register async
>> threads as critical threads (additionally to waiting for these threads in
>> the blocing section) and update their own heartbeat. But the current fix
>> is
>> required too, to avoid such problems for other critical threads in the
>> future.
>>
>> ср, 21 июл. 2021 г. в 06:29, Ilya Kazakov :
>>
>> > 1.
>> > I mean calling listeners in CheckpointWorkflow.markCheckpointBegin():
>> >
>> > This
>> > --
>> > for (CheckpointListener lsnr : dbLsnrs)
>> > lsnr.beforeCheckpointBegin(ctx0);
>> >
>> > ctx0.awaitPendingTasksFinished();
>> > --
>> >
>> > And this:
>> >
>> > --
>> > // Listeners must be invoked before we write checkpoint record to WAL.
>> > for (CheckpointListener lsnr : dbLsnrs)
>> > lsnr.onMarkCheckpointBegin(ctx0);
>> >
>> > ctx0.awaitPendingTasksFinished();
>> > --
>> >
>> > Inside lsnr.beforeCheckpointBegin(ctx0) and
>> > lsnr.onMarkCheckpointBegin(ctx0) we call
>> CheckpointContextImpl.executor()
>> > which submit heartbeat update tasks in threadpool. But due to a bug in
>> > registration, ctx0.awaitPendingTasksFinished() do not work correctly.
>> > Checpoint thread does not wait for all tasks to complete and moves on.
>> >
>> > This could lead to other bugs because as written in the comment "//
>> > Listeners must be invoked before we write checkpoint record to WAL."
>> >
>> > 2.
>> > About the fix.
>> > Yes, the fix resolves the issue, but does not resolve the root cause - a
>> > race between checkpoint thread and threads run in asyncRunner. Also, as
>> I
>> > understand, there should be no attempts to update heartbeat inside
>> > blockingSection, but the fix does not exclude such attempts but blocks
>> them.
>> >
>> > 3.
>> > But my main point is that it looks strange to update the heartbeat of
>> > thread A from thread B. It's like doing artificial respiration and chest
>> > compressions. Thread A is waiting on async tasks completion, but these
>> > tasks are updating progress of thread A. I suppose that blockingSection
>> was
>> > designed for such situations when the thread is waiting for something
>> and
>> > does not perform any progress.
>> >
>> > вт, 20 июл. 2021 г. в 21:43, Ivan Daschinsky :
>> >
>> >> +1 For current fix. Code is clean and understandable. I suppose that
>> the
>> >> current fix is a correct variant to update heartbeatTs.
>> >>
>> >> вт, 20 июл. 2021 г. в 16:13, Alex Plehanov :
>> >>
>> >> > Hello, Ilya
>> >> >
>> >> > > But anyway, I propose to remove the update of the heartbeat from
>> 

Re: [Discussion] Race on heartbeat update in checkpoint thread

2021-07-22 Thread Andrey Mashenkov
Hi guys,

I think updateHeartBeat() method was misused in the future listener and
this must be fixed.

Actually, GridWorker.heartbeatTs Javadoc says that field is updated by the
worker itself.
It is consistent with WorkProgressDispatcher.updateHearbeat() javadoc,
which said "Notifying dispatcher that work is in progress and thread didn't
freeze."
GridWorker.heartbeatTs field is marked volatile just to provide consistent
visibility guarantee to a watcher.

So, CheckpointPagesWriter violates this contract, when runs in another
thread(s).
But it works fine just because the main (Checkpointer) thread waits for all
the Writers to finish their work before it can fall into the blocking
section, therefore there is no race possible.

As for the broken case, there are 2 possible solutions,
1. Main thread must wait for all children's tasks to finish before going
into the blocking section.
2. Or make updateHeartbeat consistent with the blockingSectionBegin/End.
Seems, there is no need to mark the method as synchronized, but use call
AtomicLongFieldUpdater.getAndUpdate(this, v -> v == Long.MAX_VALUE ? v :
U.currentTimeMillis()).

I'd suggest
* Remove the "thread" mentioning from WorkProgressDispatcher interface to
avoid confusion with the current usage and make WorkProgressDispatcher more
general.
* Avoid unsafe heartbeatUpdater leak to the outside via wrapping
CheckpointContextImpl.heartbeatUpdater instance to perform consistent
heartbeat update from p.2 above.

Does it make sense?

On Wed, Jul 21, 2021 at 11:12 AM Alex Plehanov 
wrote:

> Ilya,
>
> Race only affects the future listener (which only updates heartbeat), but
> not checkpoint listeners, so no other bugs possible caused by this race
> except wrong heartbeat update.
>
> As I've written before, I think it's not a good idea to wait for other
> threads by the critical thread in the blocking section. If these threads
> are not monitored we can never detect lack of progress and never trigger
> the failure handler. Perhaps a good solution will be to register async
> threads as critical threads (additionally to waiting for these threads in
> the blocing section) and update their own heartbeat. But the current fix is
> required too, to avoid such problems for other critical threads in the
> future.
>
> ср, 21 июл. 2021 г. в 06:29, Ilya Kazakov :
>
> > 1.
> > I mean calling listeners in CheckpointWorkflow.markCheckpointBegin():
> >
> > This
> > --
> > for (CheckpointListener lsnr : dbLsnrs)
> > lsnr.beforeCheckpointBegin(ctx0);
> >
> > ctx0.awaitPendingTasksFinished();
> > --
> >
> > And this:
> >
> > --
> > // Listeners must be invoked before we write checkpoint record to WAL.
> > for (CheckpointListener lsnr : dbLsnrs)
> > lsnr.onMarkCheckpointBegin(ctx0);
> >
> > ctx0.awaitPendingTasksFinished();
> > --
> >
> > Inside lsnr.beforeCheckpointBegin(ctx0) and
> > lsnr.onMarkCheckpointBegin(ctx0) we call CheckpointContextImpl.executor()
> > which submit heartbeat update tasks in threadpool. But due to a bug in
> > registration, ctx0.awaitPendingTasksFinished() do not work correctly.
> > Checpoint thread does not wait for all tasks to complete and moves on.
> >
> > This could lead to other bugs because as written in the comment "//
> > Listeners must be invoked before we write checkpoint record to WAL."
> >
> > 2.
> > About the fix.
> > Yes, the fix resolves the issue, but does not resolve the root cause - a
> > race between checkpoint thread and threads run in asyncRunner. Also, as I
> > understand, there should be no attempts to update heartbeat inside
> > blockingSection, but the fix does not exclude such attempts but blocks
> them.
> >
> > 3.
> > But my main point is that it looks strange to update the heartbeat of
> > thread A from thread B. It's like doing artificial respiration and chest
> > compressions. Thread A is waiting on async tasks completion, but these
> > tasks are updating progress of thread A. I suppose that blockingSection
> was
> > designed for such situations when the thread is waiting for something and
> > does not perform any progress.
> >
> > вт, 20 июл. 2021 г. в 21:43, Ivan Daschinsky :
> >
> >> +1 For current fix. Code is clean and understandable. I suppose that the
> >> current fix is a correct variant to update heartbeatTs.
> >>
> >> вт, 20 июл. 2021 г. в 16:13, Alex Plehanov :
> >>
> >> > Hello, Ilya
> >> >
> >> > > But anyway, I propose to remove the update of the heartbeat from
> other
> >> > threads altogether and wrap the call to listeners in a
> blockingSection.
> >> > I don't quite understand your proposal. Which call to listeners do you
> >> > mean? If we wrap the listener into the blocking section the result
> will
> >> be
> >> > the same.
> >> > Alternatively, I think we can wrap awaitPendingTasksFinished into the
> >> > blocking section, this will also solve the 

Re: [Discussion] Race on heartbeat update in checkpoint thread

2021-07-21 Thread Alex Plehanov
Ilya,

Race only affects the future listener (which only updates heartbeat), but
not checkpoint listeners, so no other bugs possible caused by this race
except wrong heartbeat update.

As I've written before, I think it's not a good idea to wait for other
threads by the critical thread in the blocking section. If these threads
are not monitored we can never detect lack of progress and never trigger
the failure handler. Perhaps a good solution will be to register async
threads as critical threads (additionally to waiting for these threads in
the blocing section) and update their own heartbeat. But the current fix is
required too, to avoid such problems for other critical threads in the
future.

ср, 21 июл. 2021 г. в 06:29, Ilya Kazakov :

> 1.
> I mean calling listeners in CheckpointWorkflow.markCheckpointBegin():
>
> This
> --
> for (CheckpointListener lsnr : dbLsnrs)
> lsnr.beforeCheckpointBegin(ctx0);
>
> ctx0.awaitPendingTasksFinished();
> --
>
> And this:
>
> --
> // Listeners must be invoked before we write checkpoint record to WAL.
> for (CheckpointListener lsnr : dbLsnrs)
> lsnr.onMarkCheckpointBegin(ctx0);
>
> ctx0.awaitPendingTasksFinished();
> --
>
> Inside lsnr.beforeCheckpointBegin(ctx0) and
> lsnr.onMarkCheckpointBegin(ctx0) we call CheckpointContextImpl.executor()
> which submit heartbeat update tasks in threadpool. But due to a bug in
> registration, ctx0.awaitPendingTasksFinished() do not work correctly.
> Checpoint thread does not wait for all tasks to complete and moves on.
>
> This could lead to other bugs because as written in the comment "//
> Listeners must be invoked before we write checkpoint record to WAL."
>
> 2.
> About the fix.
> Yes, the fix resolves the issue, but does not resolve the root cause - a
> race between checkpoint thread and threads run in asyncRunner. Also, as I
> understand, there should be no attempts to update heartbeat inside
> blockingSection, but the fix does not exclude such attempts but blocks them.
>
> 3.
> But my main point is that it looks strange to update the heartbeat of
> thread A from thread B. It's like doing artificial respiration and chest
> compressions. Thread A is waiting on async tasks completion, but these
> tasks are updating progress of thread A. I suppose that blockingSection was
> designed for such situations when the thread is waiting for something and
> does not perform any progress.
>
> вт, 20 июл. 2021 г. в 21:43, Ivan Daschinsky :
>
>> +1 For current fix. Code is clean and understandable. I suppose that the
>> current fix is a correct variant to update heartbeatTs.
>>
>> вт, 20 июл. 2021 г. в 16:13, Alex Plehanov :
>>
>> > Hello, Ilya
>> >
>> > > But anyway, I propose to remove the update of the heartbeat from other
>> > threads altogether and wrap the call to listeners in a blockingSection.
>> > I don't quite understand your proposal. Which call to listeners do you
>> > mean? If we wrap the listener into the blocking section the result will
>> be
>> > the same.
>> > Alternatively, I think we can wrap awaitPendingTasksFinished into the
>> > blocking section, this will also solve the problem, but in this case we
>> can
>> > never detect lack of progress by async executor threads and checkpointer
>> > thread hangs due to this reason.
>> > What's wrong with the current fix? It solves the current problem and I
>> hope
>> > will protect us from problems like this in the future.
>> >
>> > вт, 20 июл. 2021 г. в 15:28, Ilya Kazakov :
>> >
>> > > Hi Igniters, hi Alexey.
>> > >
>> > > I want to discuss this issue:
>> > > https://issues.apache.org/jira/browse/IGNITE-15099. I have caught it
>> > too.
>> > >
>> > > I was able to determine where there is a race.
>> > >
>> > > The update of the heartbeat happens asynchronously into the listener
>> > code.
>> > > But we always wait in the checkpoint thread for all pending async
>> > > tasks. And this is reasonable.
>> > >
>> > > for (CheckpointListener lsnr : dbLsnrs)
>> > >   lsnr.beforeCheckpointBegin(ctx0);
>> > >
>> > > ctx0.awaitPendingTasksFinished();
>> > >
>> > > The race was because of inappropriate order of future registration. In
>> > > CheckpointContextImpl.executor () (inside listeners execution)
>> > >
>> > > GridFutureAdapter res = new GridFutureAdapter<>();
>> > > res.listen(fut -> heartbeatUpdater.updateHeartbeat());
>> > > asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
>> > > pendingTaskFuture.add(res);
>> > >
>> > > Here we create a task, submit a task to the executor, and only after
>> this
>> > > do we register the task. Thus we got a situation where checkpointer
>> > thread
>> > > was moving on after ctx0.awaitPendingTasksFinished(); and still,
>> > > the unregistered asyncRunner task was moving on in parallel.
>> > >
>> > > But anyway, I propose to remove the update of the heartbeat from other
>> > > 

Re: [Discussion] Race on heartbeat update in checkpoint thread

2021-07-20 Thread Ilya Kazakov
1.
I mean calling listeners in CheckpointWorkflow.markCheckpointBegin():

This
--
for (CheckpointListener lsnr : dbLsnrs)
lsnr.beforeCheckpointBegin(ctx0);

ctx0.awaitPendingTasksFinished();
--

And this:

--
// Listeners must be invoked before we write checkpoint record to WAL.
for (CheckpointListener lsnr : dbLsnrs)
lsnr.onMarkCheckpointBegin(ctx0);

ctx0.awaitPendingTasksFinished();
--

Inside lsnr.beforeCheckpointBegin(ctx0) and lsnr.onMarkCheckpointBegin(ctx0)
we call CheckpointContextImpl.executor() which submit heartbeat update
tasks in threadpool. But due to a bug in registration,
ctx0.awaitPendingTasksFinished() do not work correctly. Checpoint thread
does not wait for all tasks to complete and moves on.

This could lead to other bugs because as written in the comment "//
Listeners must be invoked before we write checkpoint record to WAL."

2.
About the fix.
Yes, the fix resolves the issue, but does not resolve the root cause - a
race between checkpoint thread and threads run in asyncRunner. Also, as I
understand, there should be no attempts to update heartbeat inside
blockingSection, but the fix does not exclude such attempts but blocks them.

3.
But my main point is that it looks strange to update the heartbeat of
thread A from thread B. It's like doing artificial respiration and chest
compressions. Thread A is waiting on async tasks completion, but these
tasks are updating progress of thread A. I suppose that blockingSection was
designed for such situations when the thread is waiting for something and
does not perform any progress.

вт, 20 июл. 2021 г. в 21:43, Ivan Daschinsky :

> +1 For current fix. Code is clean and understandable. I suppose that the
> current fix is a correct variant to update heartbeatTs.
>
> вт, 20 июл. 2021 г. в 16:13, Alex Plehanov :
>
> > Hello, Ilya
> >
> > > But anyway, I propose to remove the update of the heartbeat from other
> > threads altogether and wrap the call to listeners in a blockingSection.
> > I don't quite understand your proposal. Which call to listeners do you
> > mean? If we wrap the listener into the blocking section the result will
> be
> > the same.
> > Alternatively, I think we can wrap awaitPendingTasksFinished into the
> > blocking section, this will also solve the problem, but in this case we
> can
> > never detect lack of progress by async executor threads and checkpointer
> > thread hangs due to this reason.
> > What's wrong with the current fix? It solves the current problem and I
> hope
> > will protect us from problems like this in the future.
> >
> > вт, 20 июл. 2021 г. в 15:28, Ilya Kazakov :
> >
> > > Hi Igniters, hi Alexey.
> > >
> > > I want to discuss this issue:
> > > https://issues.apache.org/jira/browse/IGNITE-15099. I have caught it
> > too.
> > >
> > > I was able to determine where there is a race.
> > >
> > > The update of the heartbeat happens asynchronously into the listener
> > code.
> > > But we always wait in the checkpoint thread for all pending async
> > > tasks. And this is reasonable.
> > >
> > > for (CheckpointListener lsnr : dbLsnrs)
> > >   lsnr.beforeCheckpointBegin(ctx0);
> > >
> > > ctx0.awaitPendingTasksFinished();
> > >
> > > The race was because of inappropriate order of future registration. In
> > > CheckpointContextImpl.executor () (inside listeners execution)
> > >
> > > GridFutureAdapter res = new GridFutureAdapter<>();
> > > res.listen(fut -> heartbeatUpdater.updateHeartbeat());
> > > asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
> > > pendingTaskFuture.add(res);
> > >
> > > Here we create a task, submit a task to the executor, and only after
> this
> > > do we register the task. Thus we got a situation where checkpointer
> > thread
> > > was moving on after ctx0.awaitPendingTasksFinished(); and still,
> > > the unregistered asyncRunner task was moving on in parallel.
> > >
> > > But anyway, I propose to remove the update of the heartbeat from other
> > > threads altogether and wrap the call to listeners in a blockingSection.
> > >
> > > As I understand heartbeat was designed just to indicate self-progress
> by
> > a
> > > worker. If a worker can not indicate self-progress we should wrap such
> > code
> > > into blockingSections. In case of listeners, worker can not indicate
> > > self-progress, thus let's wrap it into blockingSection.
> > >
> > > Guys, what do you think about this?
> > >
> > > -
> > > Ilya Kazakov
> > >
> >
>
>
> --
> Sincerely yours, Ivan Daschinskiy
>


Re: [Discussion] Race on heartbeat update in checkpoint thread

2021-07-20 Thread Ivan Daschinsky
+1 For current fix. Code is clean and understandable. I suppose that the
current fix is a correct variant to update heartbeatTs.

вт, 20 июл. 2021 г. в 16:13, Alex Plehanov :

> Hello, Ilya
>
> > But anyway, I propose to remove the update of the heartbeat from other
> threads altogether and wrap the call to listeners in a blockingSection.
> I don't quite understand your proposal. Which call to listeners do you
> mean? If we wrap the listener into the blocking section the result will be
> the same.
> Alternatively, I think we can wrap awaitPendingTasksFinished into the
> blocking section, this will also solve the problem, but in this case we can
> never detect lack of progress by async executor threads and checkpointer
> thread hangs due to this reason.
> What's wrong with the current fix? It solves the current problem and I hope
> will protect us from problems like this in the future.
>
> вт, 20 июл. 2021 г. в 15:28, Ilya Kazakov :
>
> > Hi Igniters, hi Alexey.
> >
> > I want to discuss this issue:
> > https://issues.apache.org/jira/browse/IGNITE-15099. I have caught it
> too.
> >
> > I was able to determine where there is a race.
> >
> > The update of the heartbeat happens asynchronously into the listener
> code.
> > But we always wait in the checkpoint thread for all pending async
> > tasks. And this is reasonable.
> >
> > for (CheckpointListener lsnr : dbLsnrs)
> >   lsnr.beforeCheckpointBegin(ctx0);
> >
> > ctx0.awaitPendingTasksFinished();
> >
> > The race was because of inappropriate order of future registration. In
> > CheckpointContextImpl.executor () (inside listeners execution)
> >
> > GridFutureAdapter res = new GridFutureAdapter<>();
> > res.listen(fut -> heartbeatUpdater.updateHeartbeat());
> > asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
> > pendingTaskFuture.add(res);
> >
> > Here we create a task, submit a task to the executor, and only after this
> > do we register the task. Thus we got a situation where checkpointer
> thread
> > was moving on after ctx0.awaitPendingTasksFinished(); and still,
> > the unregistered asyncRunner task was moving on in parallel.
> >
> > But anyway, I propose to remove the update of the heartbeat from other
> > threads altogether and wrap the call to listeners in a blockingSection.
> >
> > As I understand heartbeat was designed just to indicate self-progress by
> a
> > worker. If a worker can not indicate self-progress we should wrap such
> code
> > into blockingSections. In case of listeners, worker can not indicate
> > self-progress, thus let's wrap it into blockingSection.
> >
> > Guys, what do you think about this?
> >
> > -
> > Ilya Kazakov
> >
>


-- 
Sincerely yours, Ivan Daschinskiy


Re: [Discussion] Race on heartbeat update in checkpoint thread

2021-07-20 Thread Alex Plehanov
Hello, Ilya

> But anyway, I propose to remove the update of the heartbeat from other
threads altogether and wrap the call to listeners in a blockingSection.
I don't quite understand your proposal. Which call to listeners do you
mean? If we wrap the listener into the blocking section the result will be
the same.
Alternatively, I think we can wrap awaitPendingTasksFinished into the
blocking section, this will also solve the problem, but in this case we can
never detect lack of progress by async executor threads and checkpointer
thread hangs due to this reason.
What's wrong with the current fix? It solves the current problem and I hope
will protect us from problems like this in the future.

вт, 20 июл. 2021 г. в 15:28, Ilya Kazakov :

> Hi Igniters, hi Alexey.
>
> I want to discuss this issue:
> https://issues.apache.org/jira/browse/IGNITE-15099. I have caught it too.
>
> I was able to determine where there is a race.
>
> The update of the heartbeat happens asynchronously into the listener code.
> But we always wait in the checkpoint thread for all pending async
> tasks. And this is reasonable.
>
> for (CheckpointListener lsnr : dbLsnrs)
>   lsnr.beforeCheckpointBegin(ctx0);
>
> ctx0.awaitPendingTasksFinished();
>
> The race was because of inappropriate order of future registration. In
> CheckpointContextImpl.executor () (inside listeners execution)
>
> GridFutureAdapter res = new GridFutureAdapter<>();
> res.listen(fut -> heartbeatUpdater.updateHeartbeat());
> asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
> pendingTaskFuture.add(res);
>
> Here we create a task, submit a task to the executor, and only after this
> do we register the task. Thus we got a situation where checkpointer thread
> was moving on after ctx0.awaitPendingTasksFinished(); and still,
> the unregistered asyncRunner task was moving on in parallel.
>
> But anyway, I propose to remove the update of the heartbeat from other
> threads altogether and wrap the call to listeners in a blockingSection.
>
> As I understand heartbeat was designed just to indicate self-progress by a
> worker. If a worker can not indicate self-progress we should wrap such code
> into blockingSections. In case of listeners, worker can not indicate
> self-progress, thus let's wrap it into blockingSection.
>
> Guys, what do you think about this?
>
> -
> Ilya Kazakov
>


[Discussion] Race on heartbeat update in checkpoint thread

2021-07-20 Thread Ilya Kazakov
Hi Igniters, hi Alexey.

I want to discuss this issue:
https://issues.apache.org/jira/browse/IGNITE-15099. I have caught it too.

I was able to determine where there is a race.

The update of the heartbeat happens asynchronously into the listener code.
But we always wait in the checkpoint thread for all pending async
tasks. And this is reasonable.

for (CheckpointListener lsnr : dbLsnrs)
  lsnr.beforeCheckpointBegin(ctx0);

ctx0.awaitPendingTasksFinished();

The race was because of inappropriate order of future registration. In
CheckpointContextImpl.executor () (inside listeners execution)

GridFutureAdapter res = new GridFutureAdapter<>();
res.listen(fut -> heartbeatUpdater.updateHeartbeat());
asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
pendingTaskFuture.add(res);

Here we create a task, submit a task to the executor, and only after this
do we register the task. Thus we got a situation where checkpointer thread
was moving on after ctx0.awaitPendingTasksFinished(); and still,
the unregistered asyncRunner task was moving on in parallel.

But anyway, I propose to remove the update of the heartbeat from other
threads altogether and wrap the call to listeners in a blockingSection.

As I understand heartbeat was designed just to indicate self-progress by a
worker. If a worker can not indicate self-progress we should wrap such code
into blockingSections. In case of listeners, worker can not indicate
self-progress, thus let's wrap it into blockingSection.

Guys, what do you think about this?

-
Ilya Kazakov