Re: [HelenOS-devel] Deadlock while booting

2017-11-06 Thread Ondřej Hlavatý
Hi,

On 05.11., Jakub Jermář wrote:
> On 10/27/2017 12:08 AM, Ondřej Hlavatý wrote:
> > I started to experience deadlock while booting HelenOS. It does not
> > happen every time, and when I add some debug prints, the deadlock
> > disappears completely.
> > 
> > The issue starts at ps2mouse driver, which adds mouse function from its
> > device_add operation. This remote call goes all the way to fun_online,
> > in which it is holding the writelock (blocking other drivers) and,
> > because the function is exposed, probably waiting inside
> > loc_register_tree_function, respectively in loc_service_register.
> > 
> > Looking at this function, it seems to be very similar to what Jakub
> > Jermar describes at:
> > 
> > http://jakubsuniversalblog.blogspot.cz/2011/09/debugging-file-system-hang-using.html?q=deadlock
> > 
> > As far as I understand the issue, this shall not be the case - this is
> > the sender, not the receiver, and there is no cycle of messages waiting
> > for themselves. But after swapping the order of exch release and waiting
> > for answer, the deadlock no longer occurs.
> > 
> > Can someone please confirm, that the order there is correct?
> 
> I don't think that changing the mutual ordering of loc_exchange_end()
> and async_wait_for() will fix this on its own. See #700 for details.

Everything you wrote there makes sense to me.

> Btw, when you were testing your fix, did you change the order only in
> loc_service_register() or also in other places? I would be actually very
> surprised if this changed anything because in all the deadlocks for
> which we have some data in #700 (i.e. your .svg and my log files), the
> LOC_SERVICE_REGISTER was the second request, not the first. The first
> must have been LOC_SERVICE_ADD_TO_CAT. locsrv did not even start
> processing the second one.

Well, the deadlock was very randomly occuring, so it is possible that it
just didn't show up because of some timing issues of previous requests.
It often happened that the deadlock disappeared by adding a debug print
somewhere completely unrelated.

Also, it wasn't clear at all to me why it should fix the deadlock,
that's why I asked instead of sending patch.

OH

___
HelenOS-devel mailing list
HelenOS-devel@lists.modry.cz
http://lists.modry.cz/listinfo/helenos-devel


Re: [HelenOS-devel] Deadlock while booting

2017-11-05 Thread Jakub Jermář
On 10/27/2017 12:08 AM, Ondřej Hlavatý wrote:
> I started to experience deadlock while booting HelenOS. It does not
> happen every time, and when I add some debug prints, the deadlock
> disappears completely.
> 
> The issue starts at ps2mouse driver, which adds mouse function from its
> device_add operation. This remote call goes all the way to fun_online,
> in which it is holding the writelock (blocking other drivers) and,
> because the function is exposed, probably waiting inside
> loc_register_tree_function, respectively in loc_service_register.
> 
> Looking at this function, it seems to be very similar to what Jakub
> Jermar describes at:
>   
> http://jakubsuniversalblog.blogspot.cz/2011/09/debugging-file-system-hang-using.html?q=deadlock
> 
> As far as I understand the issue, this shall not be the case - this is
> the sender, not the receiver, and there is no cycle of messages waiting
> for themselves. But after swapping the order of exch release and waiting
> for answer, the deadlock no longer occurs.
> 
> Can someone please confirm, that the order there is correct?

I don't think that changing the mutual ordering of loc_exchange_end()
and async_wait_for() will fix this on its own. See #700 for details.

Btw, when you were testing your fix, did you change the order only in
loc_service_register() or also in other places? I would be actually very
surprised if this changed anything because in all the deadlocks for
which we have some data in #700 (i.e. your .svg and my log files), the
LOC_SERVICE_REGISTER was the second request, not the first. The first
must have been LOC_SERVICE_ADD_TO_CAT. locsrv did not even start
processing the second one.

Jakub



___
HelenOS-devel mailing list
HelenOS-devel@lists.modry.cz
http://lists.modry.cz/listinfo/helenos-devel


Re: [HelenOS-devel] Deadlock while booting

2017-10-27 Thread Jakub Jermář
On 10/27/2017 12:08 AM, Ondřej Hlavatý wrote:
> I started to experience deadlock while booting HelenOS. It does not
> happen every time, and when I add some debug prints, the deadlock
> disappears completely.
> 
> The issue starts at ps2mouse driver, which adds mouse function from its
> device_add operation. This remote call goes all the way to fun_online,
> in which it is holding the writelock (blocking other drivers) and,
> because the function is exposed, probably waiting inside
> loc_register_tree_function, respectively in loc_service_register.
> 
> Looking at this function, it seems to be very similar to what Jakub
> Jermar describes at:
>   
> http://jakubsuniversalblog.blogspot.cz/2011/09/debugging-file-system-hang-using.html?q=deadlock
> 
> As far as I understand the issue, this shall not be the case - this is
> the sender, not the receiver, and there is no cycle of messages waiting
> for themselves. But after swapping the order of exch release and waiting
> for answer, the deadlock no longer occurs.
> 
> Can someone please confirm, that the order there is correct?

For the record, here is my and Ondrej's conversation from irc:

 can you see some active calls from locsrv to devman?
 tbf i cannot reproduce it anymore
 but i think the only active calls were 4 to ethip
 from locsrv?
 there were some chain of stuck messages, ending at ns
 but ns wasn't sending anything
 that might be important
 ns was recently rewritten to use async framework
 another interesting thing is that unlike in my blog, the
connection between devman and locsrv uses only one phone
 but I still fail to see anything that would prevent receiving
the answer to LOC_SERVICE_REGISTER forever
 it would be good if you could collect the ipc  for all
interesting task ID's
 I also make an observation that until the LOC_SERIVCE_REGISTER
call is answered, locsrv cannot start processing another call
 because there is only one fibril and it is currently busy
processing LOC_SERVICE_REGISTER

J.

___
HelenOS-devel mailing list
HelenOS-devel@lists.modry.cz
http://lists.modry.cz/listinfo/helenos-devel