Re: [tipc-discussion] [net-next PATCH 1/2] tipc: prevent swork item from queuing workqueue after connection is closed

Jon Maloy Mon, 08 Jan 2018 07:30:07 -0800


> -----Original Message-----
> From: Ying Xue [mailto:[email protected]]
> Sent: Monday, January 08, 2018 08:42
> To: Jon Maloy <[email protected]>; [email protected]
> Cc: [email protected]
> Subject: Re: [net-next PATCH 1/2] tipc: prevent swork item from queuing
> workqueue after connection is closed
> 
> Very sorry for the late response Jon.
> 
> On 12/12/2017 12:52 AM, Jon Maloy wrote:
> >>>                 flush_work(&con->swork);
> >>>                 flush_work(&con->rwork);
> >> Sorry, we cannot move flush works to tipc_conn_kref_release(),
> >> otherwise, that name table soft lockup solved by 9dc3abdd1f7e ("tipc:
> >> fix nametbl_lock soft lockup at module exit") may occur again because
> >> tipc_conn_kref_release may be called in tipc_nametbl_withdraw()
> >> rather than tipc_topsrv_stop() context.
> > It is both possible and necessary, as I see it. The key to achieve this is 
> > to
> move all deletes, i.e., all calls to xxx_put() of any kind, to the receiving
> context of either work queue. Never in an interrupt context, and never from
> the event trigging side of the send work queue.
> 
> In fact, if we enforce so many limitations where we cannot release "con"
> object by calling con_put(), that means that kref variable declared in con
> structure will lose its real purpose.


That wouldn't be a bad thing, in my view. But unfortunately we still need it, 
since we still have numerous subscriptions in both sending and receiving jobs 
referring to it.

> 
> The current design model of TIPC internal topology server just follows the
> design philosophy of many other internal servers implemented in other
> subsystems, such as nfs, drdb, dlm and ocfs2 etc. In other words, the design
> is not a new invention.
> 
> In this design model, any event/connection request is asynchronously
> handled. It means before we use any object, we have to first increase its
> refcnt, and we must decrease its refcnt when we don't use it any more. But
> no matter which context we are sitting in, we can just decrease its refcnt
> anywhere.

That is the problem; we can't. 
If we decrease it in interrupt context, which we do, we get the problem I 
referred to in my patch  ("[net-next v3 04/11] tipc: eliminate struct 
tipc_subscriber") from nov 4th.
This problem is real, and even If others using the same template design do it 
like this, it is still not right.

> 
> Although in our TIPC module, our situation might be a bit more complex than
> other components, I don't think it needs a radical overhaul. In my opinion, I
> cannot find any other potential issue existing in topology server after the 
> two
> patches are applied.
> 
> Of course, when designing the TIPC server, there were two internal
> servers: topology server and remote configuration server. That's why we
> abstracted the one common server, and the two TIPC internal servers just
> implemented their private interfaces exposed by the common server.

Yes, I was aware of that.

> Now
> the remote configuration server had been deleted. So we can remove some
> redundant code from tipc server. But once one day we have a new
> requirement to implement another special internal server, we may have to
> return back to the current design.

Personally I would prefer to take the pain of doing that if so happens, just to 
have a more comprehensible code right now. I don't foresee any new servers in 
TIPC in the near (or even remote) future.

> 
> Lastly, we ever solved several critical issues occurred in TIPC server many
> times, but after my careful analysis, we did not completely understand their
> real root causes in that solutions.
> 
> In your solution, we significantly expanded the sk->sk_callback_lock
> protection scope, and we introduced another new lock - sub_lock.
> Instead, RCU lock is almost meaningless. However, although RCU lock is a bit
> more complex than RW lock, it has the two significant advantages:
> - Performance is good.
> - Almost be immune to deadlock risk.

But the event race problem is real too, and has to be solved. It would be nice 
if we could solve it with an rcu lock, but I cannot see how. 
Maybe we can fix it with an atomic counter instead?
 
> 
> We ever spent much time dropping several RW locks designed in TIPC 1.7.7
> and other older versions. But now it seems more and more RW/spin locks
> were introduced again. This is not a good trend.

True, although this is only around one single flag.

> 
> In all, I still think the current locking policy of tipc server has no any 
> problem
> in principle. So it's better that we should follow the current design model to
> improve TIPC server code.
> 
> If you agree to my idea above, I would merge the two patches into one patch
> and add your suggestions commented in patch #1, 

Not sure which of my comments you are referring to, but please go ahead.

Regards
///jon

> and submit another new
> version patch in this mail list. If the patch is fine, I will submit it to 
> net-next as
> soon as possible. After that, we can continue to further simplify the TIPC
> server code with the patches of your series
> ("tipc: simplify topology server").
> 
> Thanks,
> Ying
> 
> > If you look at my patch #4(v3) that I sent out two weeks ago I do exactly
> this, but for a different reason; because this might happen even at a
> subscription timeout event, although with less fatal consequences.
> >
> > You don't have to eliminate the tipc_subscriber structure to achieve this, 
> > as
> I did, (although I strongly recommend it; but that can be done later). You 
> only
> need to send some hint along with the subscription event that this is the 
> final
> event for this subscription, and that it should be deleted, in work queue
> context, after the event has been posted on the socket (or dropped).
> >
> >> Moreover, before a work item is queued into workqueue, con refcnt
> >> needs to be first held. So when tipc_conn_kref_release() is called,
> >> it means there is no any user for a con. As a consequence, there is
> >> no any work item which are still queued in workqueues, so we don't need
> to flush works at all.
> > Even better. So, we never need to flush any work queues, only the send
> event queue.
> >
> > Regards
> > ///jon
> >
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] [net-next PATCH 1/2] tipc: prevent swork item from queuing workqueue after connection is closed

Reply via email to