[tipc-discussion] [net-next PATCH 1/1] tipc: add back tipc prefix to log messages

2017-12-11 Thread John Thompson
The tipc prefix to log messages generated by tipc was
removed in commit 07f6c4bc0 ("tipc: convert tipc reference
table to use generic rhashtable").

This is still a useful prefix and so add it back.

Acked-by: Jon Maloy <jon.ma...@ericsson.com>
Signed-off-by: John Thompson <thompa@gmail.com>
---
 net/tipc/core.c | 2 --
 net/tipc/core.h | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/tipc/core.c b/net/tipc/core.c
index 0b982d048fb9..4cd9e57446ec 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -34,8 +34,6 @@
  * POSSIBILITY OF SUCH DAMAGE.
  */

-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
 #include "core.h"
 #include "name_table.h"
 #include "subscr.h"
diff --git a/net/tipc/core.h b/net/tipc/core.h
index 964342689f2c..f89f9a3c18c2 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -37,6 +37,8 @@
 #ifndef _TIPC_CORE_H
 #define _TIPC_CORE_H

+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include 
 #include 
 #include 
-- 
2.15.0
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


[tipc-discussion] [PATCH] tipc: add back tipc prefix to log messages

2017-11-20 Thread John Thompson
The tipc prefix to log messages generated by tipc was
removed in commit 07f6c4bc0 ("tipc: convert tipc reference
table to use generic rhashtable").

This is still a useful prefix and so add it back.

Signed-off-by: John Thompson <thompa@gmail.com>
---
 net/tipc/core.c | 2 --
 net/tipc/core.h | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/tipc/core.c b/net/tipc/core.c
index 0b982d048fb9..4cd9e57446ec 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -34,8 +34,6 @@
  * POSSIBILITY OF SUCH DAMAGE.
  */

-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
 #include "core.h"
 #include "name_table.h"
 #include "subscr.h"
diff --git a/net/tipc/core.h b/net/tipc/core.h
index 964342689f2c..f89f9a3c18c2 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -37,6 +37,8 @@
 #ifndef _TIPC_CORE_H
 #define _TIPC_CORE_H

+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include 
 #include 
 #include 
-- 
2.15.0
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


[tipc-discussion] Why was "tipc:" removed from tipc printk messages?

2017-11-20 Thread John Thompson
Hi,

In commit 07f6c4bc0 ("tipc: convert tipc reference table to use generic
rhashtable") the definition of pr_fmt within the tipc area was moved from
core.h to core.c.  This caused most of the printk messages to no longer be
prefixed by "tipc:".
There is no reference to this change in the commit description.

I make use of tipc prefix to redirect the messages to other log that allow
me to piece together with other log outputs what happens when the cluster
separates.  I cannot do this when I cannot redirect the tipc logs.

Was there a particular reason to move the definition of pr_fmt to core.c?

Regards,
John
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [PATCH v2 net-next 6/6] tipc: delete expired subscription

2017-03-19 Thread John Thompson
On Fri, Mar 17, 2017 at 3:44 PM, John Thompson <thompa@gmail.com> wrote:

>
>
> On Fri, Mar 17, 2017 at 9:35 AM, John Thompson <thompa@gmail.com>
> wrote:
>
>>
>>
>> On Thu, Mar 16, 2017 at 1:45 AM, Parthasarathy Bhuvaragan <
>> parthasarathy.bhuvara...@ericsson.com> wrote:
>>>
>>> Hi John,
>>>
>>> Can you run your test with my series and provide some feedback?
>>>
>> Hi Partha
>> I have started running some tests and will be in touch later today with
>> initial results.
>> Thanks
>> Jt
>>
>
> Hi Partha,
> My testing is looking good.  I have generally seen problems reasonably
> quickly.  This patch series seems good.  I will run a longer run test over
> the weekend to be certain but it should be ok.
>
> Thanks,
> JT
>
Hi Partha,

My testing over the weekend has looked good. I have not seen any soft
lockups or other errors of any form.

JT
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [PATCH v2 net-next 6/6] tipc: delete expired subscription

2017-03-16 Thread John Thompson
On Fri, Mar 17, 2017 at 9:35 AM, John Thompson <thompa@gmail.com> wrote:

>
>
> On Thu, Mar 16, 2017 at 1:45 AM, Parthasarathy Bhuvaragan <
> parthasarathy.bhuvara...@ericsson.com> wrote:
>>
>> Hi John,
>>
>> Can you run your test with my series and provide some feedback?
>>
> Hi Partha
> I have started running some tests and will be in touch later today with
> initial results.
> Thanks
> Jt
>

Hi Partha,
My testing is looking good.  I have generally seen problems reasonably
quickly.  This patch series seems good.  I will run a longer run test over
the weekend to be certain but it should be ok.

Thanks,
JT
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [PATCH v2 net-next 6/6] tipc: delete expired subscription

2017-03-16 Thread John Thompson
On Thu, Mar 16, 2017 at 1:45 AM, Parthasarathy Bhuvaragan <
parthasarathy.bhuvara...@ericsson.com> wrote:
>
> Hi John,
>
> Can you run your test with my series and provide some feedback?
>
Hi Partha
I have started running some tests and will be in touch later today with
initial results.
Thanks
Jt

>
> [PATCH v3 net-next 0/3] solve two deadlock issues:
>
>   tipc: advance the time of calling tipc_nametbl_unsubscribe
>   tipc: advance the time of deleting subscription from
> subscriber->subscrp_list
>   tipc: adjust the policy of holding subscription kref
>
> regards
> partha
> --
> *From:* Ying Xue 
> *Sent:* Wednesday, March 15, 2017 12:33 PM
> *To:* Parthasarathy Bhuvaragan; Jon Maloy; thompa@gmail.com
> *Cc:* tipc-discussion@lists.sourceforge.net
> *Subject:* Re: [tipc-discussion] [PATCH v2 net-next 6/6] tipc: delete
> expired subscription
>
> Hi Partha,
>
> Thank you for the review and improvement.
>
> On 03/14/2017 01:43 AM, Parthasarathy Bhuvaragan wrote:
> > Hi Ying,
> >
> > I have a new patch sets which fixes this issue using fixes from your
> > patches. It deviates from your patch the following way:
> > In my solution, the subscription refcount keeps track of a subscription
> > with or without timer. I do not increment refcount for timer, and use
> > the subscriber lock plus the del_timer to find outstanding timers. I
> > will send the series shortly.
>
> I have carefully reviewed your solution as well as your revised patches.
> Your method is pretty simpler than mine. Instead the thing I image is
> too complex. Now I can confirm that it's unnecessary to increase
> subscription refcount before its timer is started, and it's absolutely
> safe for us now. Good work Parth!
>
> >
> > I applied your series and ran some tests with it. If I run test against
> > each patch individually, they seem to introduce new warnings/panics. I
> > think every patch should be as correct as possible. I tried to moved
> > around the patches in this series to get every patch correct, which led
> > me to my series above.
> >
>
> If we can ensure every single patch of a patchset can independently work
> very well, that's very good. But in many cases, it's hard to reach that
> goal. The most reason is that on one hand, we must have patch easily
> readable for reviewer, on another hand, we must keep every single work
> well. So it is sometimes very hard.
>
> Anyway, your revised patches are very good. If Jon or other guys have no
> any different opinion, please submit them to net-next as soon as possible.
>
> Of course, if possible, please let John verify them again.
>
> Thanks,
> Ying
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [PATCH v2 net-next 0/6] solve two deadlock issues

2017-03-09 Thread John Thompson
Thanks Ying.
I am away for a week and will run some tests next when back.
Regards
John

On 10/03/2017 3:22 AM, "Ying Xue"  wrote:

> commit d094c4d5f5 ("tipc: add subscription refcount to avoid invalid
> delete") accidently introduce the following deadlock scenarios:
>
>CPU1: CPU2:
> -- 
> tipc_nametbl_publish
> spin_lock_bh(>nametbl_lock)
> tipc_nametbl_insert_publ
> tipc_nameseq_insert_publ
> tipc_subscrp_report_overlap
> tipc_subscrp_get
> tipc_subscrp_send_event
>  tipc_close_conn
>  tipc_subscrb_release_cb
>  tipc_subscrb_delete
>  tipc_subscrp_put
> tipc_subscrp_put
> tipc_subscrp_kref_release
> tipc_nametbl_unsubscribe
> spin_lock_bh(>nametbl_lock)
> <>
>
>CPU1:  CPU2:
> -- 
> tipc_nametbl_stop
> spin_lock_bh(>nametbl_lock)
> tipc_purge_publications
> tipc_nameseq_remove_publ
> tipc_subscrp_report_overlap
> tipc_subscrp_get
> tipc_subscrp_send_event
>  tipc_close_conn
>  tipc_subscrb_release_cb
>  tipc_subscrb_delete
>  tipc_subscrp_put
> tipc_subscrp_put
> tipc_subscrp_kref_release
> tipc_nametbl_unsubscribe
> spin_lock_bh(>nametbl_lock)
> <>
>
> The root cause of two deadlocks is that we have to hold nametbl lock
> when subscription is freed in tipc_subscrp_kref_release(). In order to
> eliminate the need of taking nametbl lock in
> tipc_subscrp_kref_release(), the functions protected by nametbl lock
> in tipc_subscrp_kref_release() are moved to other places step by step
> in the series.
>
> Change log:
> v2:
>  As Parth's comments, subscription is still present name table after
>  it's expired. To fix it, we introduce a workqueue, and when
>  subscription's timer is expired,  the subscription will be pushed to
>  the workqueue through its work. When the work is scheduled, the
>  subscription be deleted finally.
>
> Ying Xue (6):
>   tipc: advance the time of deleting subscription from
> subscriber->subscrp_list
>   tipc: adjust the policy of holding subscription kref
>   tipc: adjust policy that sub->timer holds subscription kref
>   tipc: advance the time of calling tipc_nametbl_unsubscribe
>   tipc: remove unnecessary increasement of subscription refcount
>   tipc: delete expired subscription
>
>  net/tipc/name_table.c |  2 ++
>  net/tipc/server.h |  2 ++
>  net/tipc/subscr.c | 64 ++
> ++---
>  net/tipc/subscr.h |  4 
>  4 files changed, 54 insertions(+), 18 deletions(-)
>
> --
> 2.7.4
>
>
--
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [net 0/5] solve two deadlock issues

2017-02-22 Thread John Thompson
My overnight testing has not shown any problems with this patch set.
JT

On Thu, Feb 23, 2017 at 1:29 AM, Jon Maloy  wrote:

> Ok. So go for it to net-next.
> Reviewed-by: Jon
> ///jon
>
> > -Original Message-
> > From: Ying Xue [mailto:ying@windriver.com]
> > Sent: Wednesday, February 22, 2017 06:47 AM
> > To: Jon Maloy ; Parthasarathy Bhuvaragan
> > ; thompa@gmail.com
> > Cc: tipc-discussion@lists.sourceforge.net
> > Subject: Re: [tipc-discussion] [net 0/5] solve two deadlock issues
> >
> > At least, net-next tree is still open as David is reviewing patches
> submitted to
> > net-next.
> >
> > Hope we have a window to submit the series to net-next.
> >
> > Thanks,
> > Ying
> >
> > On 02/22/2017 07:42 PM, Xue, Ying wrote:
> > > Hi Jon,
> > >
> > > I understood your concern.
> > >
> > > I have checked the possibility of merging patch #1, #4 and #5 as one.
> > However, just merging the three patch is insufficient, and at least #2
> seems
> > necessary too, otherwise, another deadlock still exists due to two
> premature
> > 'return's in subcsrb_report_overlap(). Even if we merged them as one, it
> will
> > lose my initial purpose of dividing the series as so small patches.
> Although
> > each patch is made a small change, it's often related to a policy
> adjustment of
> > locking or holding refcount. Moreover, as our locking policy associated
> with
> > topserver becomes complex, I want to use the comments in each patch
> > header to record what policy has been adjusted. In the future, the
> > information can guide whether our changes comply with the adjusted policy
> > or not.
> > >
> > > In fact, all changes contained in the series is not big. But if we
> merged them
> > as one, all useful messages will be lost forever.
> > >
> > > Additionally, "net-next" tree reaches 4.10-rc8, and "net" tree is
> 4.10-rc7
> > now. I saw today there was one developer who submitted a patch to net-
> > next and David also accepted it. However, if John's testing proved the
> series
> > is okay tomorrow, probably I can send the series to net-next tomorrow.
> Even
> > for the worst case, we cannot submit the series until net-next is open
> again.
> > But I have checked nobody would maintain 4.10 as a stable version. So
> even
> > if there is a big long time gap, it seems not to cause a series issue.
> > >
> > > Regards,
> > > Ying
> > >
> > > -Original Message-
> > > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > > Sent: Tuesday, February 21, 2017 7:12 PM
> > > To: Xue, Ying; Parthasarathy Bhuvaragan; thompa@gmail.com
> > > Cc: tipc-discussion@lists.sourceforge.net
> > > Subject: RE: [net 0/5] solve two deadlock issues
> > >
> > > Hi Ying,
> > > These are good design changes, that definitely should go in asap.
> However,
> > I feel deeply uncomfortable with such a big change going into 'net',
> especially
> > since our previous, exceptionally large, contribution now has turned out
> to
> > have problems. I wonder if we could not get away with something simpler
> > for 'net'.
> > >
> > > Looking closer at your series, it seems to me that only patches ## 1,
> 4, and
> > the lock removal part of #5 are really needed to solve the problem we
> have
> > at hand now. Why not merge those into one patch and deliver this to
> 'net',
> > while reference count redesign parts can go into net-next ?
> > >
> > > Regards
> > > ///jon
> > >
> > >
> > >> -Original Message-
> > >> From: Ying Xue [mailto:ying@windriver.com]
> > >> Sent: Monday, February 20, 2017 06:39 AM
> > >> To: Jon Maloy ; Parthasarathy Bhuvaragan
> > >> ; thompa@gmail.com
> > >> Cc: tipc-discussion@lists.sourceforge.net
> > >> Subject: [net 0/5] solve two deadlock issues
> > >>
> > >> Commit d094c4d5f5 ("tipc: add subscription refcount to avoid invalid
> > >> delete") accidently introduce the following deadlock scenarios:
> > >>
> > >>CPU1: CPU2:
> > >> -- 
> > >> tipc_nametbl_publish
> > >> spin_lock_bh(>nametbl_lock)
> > >> tipc_nametbl_insert_publ
> > >> tipc_nameseq_insert_publ
> > >> tipc_subscrp_report_overlap
> > >> tipc_subscrp_get
> > >> tipc_subscrp_send_event
> > >> tipc_close_conn
> > >> tipc_subscrb_release_cb
> > >> tipc_subscrb_delete
> > >> tipc_subscrp_put
> > >> tipc_subscrp_put
> > >> tipc_subscrp_kref_release
> > >> tipc_nametbl_unsubscribe
> > >> spin_lock_bh(>nametbl_lock)
> > >> <>
> > >>
> > >>CPU1:  CPU2:
> > >> -- 
> > >> tipc_nametbl_stop
> > >> spin_lock_bh(>nametbl_lock)
> > >> tipc_purge_publications
> > >> tipc_nameseq_remove_publ
> > >> tipc_subscrp_report_overlap
> > >> tipc_subscrp_get

Re: [tipc-discussion] [net 0/5] solve two deadlock issues

2017-02-22 Thread John Thompson
I am running a test overnight on the patches as they are and will update
later on.
JT

On Wed, Feb 22, 2017 at 11:56 AM, Jon Maloy <jon.ma...@ericsson.com> wrote:

> Hi John.
>
> Yes you are right. But I still would prefer a condensed patch where we
> don’t touch the refcounts when this goes into ‘net’. I am awaiting a
> comment from Ying.
>
>
>
> ///jon
>
>
>
>
>
> *From:* John Thompson [mailto:thompa@gmail.com]
> *Sent:* Tuesday, February 21, 2017 04:18 PM
> *To:* Jon Maloy <jon.ma...@ericsson.com>
> *Cc:* Ying Xue <ying@windriver.com>; Parthasarathy Bhuvaragan <
> parthasarathy.bhuvara...@ericsson.com>; tipc-discussion@lists.
> sourceforge.net
> *Subject:* Re: [net 0/5] solve two deadlock issues
>
>
>
> Patch #2 removes the tipc_subscrp_get() and _put()
> from tipc_subscrp_report_overlap().
>
> This prevents the problem of the early returns.
>
> JT
>
>
>
>
>
> On Wed, Feb 22, 2017 at 1:42 AM, Jon Maloy <jon.ma...@ericsson.com> wrote:
>
> I don't see that you remove the two premature 'return's in
> subcsrb_report_overlap() in your series. These are also genuine bugs that
> must be fixed.
>
> ///jon
>
>
> > -Original Message-
> > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > Sent: Tuesday, February 21, 2017 06:12 AM
> > To: Ying Xue <ying@windriver.com>; Parthasarathy Bhuvaragan
> > <parthasarathy.bhuvara...@ericsson.com>; thompa@gmail.com
> > Cc: tipc-discussion@lists.sourceforge.net
>
> > Subject: Re: [tipc-discussion] [net 0/5] solve two deadlock issues
> >
> > Hi Ying,
> > These are good design changes, that definitely should go in asap.
> However, I
> > feel deeply uncomfortable with such a big change going into 'net',
> especially
> > since our previous, exceptionally large, contribution now has turned out
> to
> > have problems. I wonder if we could not get away with something simpler
> > for 'net'.
> >
> > Looking closer at your series, it seems to me that only patches ## 1, 4,
> and
> > the lock removal part of #5 are really needed to solve the problem we
> have
> > at hand now. Why not merge those into one patch and deliver this to
> 'net',
> > while reference count redesign parts can go into net-next ?
> >
> > Regards
> > ///jon
> >
> >
> > > -Original Message-
> > > From: Ying Xue [mailto:ying@windriver.com]
> > > Sent: Monday, February 20, 2017 06:39 AM
> > > To: Jon Maloy <jon.ma...@ericsson.com>; Parthasarathy Bhuvaragan
> > > <parthasarathy.bhuvara...@ericsson.com>; thompa@gmail.com
> > > Cc: tipc-discussion@lists.sourceforge.net
> > > Subject: [net 0/5] solve two deadlock issues
> > >
> > > Commit d094c4d5f5 ("tipc: add subscription refcount to avoid invalid
> > > delete") accidently introduce the following deadlock scenarios:
> > >
> > >CPU1: CPU2:
> > > -- 
> > > tipc_nametbl_publish
> > > spin_lock_bh(>nametbl_lock)
> > > tipc_nametbl_insert_publ
> > > tipc_nameseq_insert_publ
> > > tipc_subscrp_report_overlap
> > > tipc_subscrp_get
> > > tipc_subscrp_send_event
> > >  tipc_close_conn
> > >  tipc_subscrb_release_cb
> > >  tipc_subscrb_delete
> > >  tipc_subscrp_put
> > > tipc_subscrp_put
> > > tipc_subscrp_kref_release
> > > tipc_nametbl_unsubscribe
> > > spin_lock_bh(>nametbl_lock)
> > > <>
> > >
> > >CPU1:  CPU2:
> > > -- 
> > > tipc_nametbl_stop
> > > spin_lock_bh(>nametbl_lock)
> > > tipc_purge_publications
> > > tipc_nameseq_remove_publ
> > > tipc_subscrp_report_overlap
> > > tipc_subscrp_get
> > > tipc_subscrp_send_event
> > >  tipc_close_conn
> > >  tipc_subscrb_release_cb
> > >  tipc_subscrb_delete
> > >  tipc_subscrp_put
> > > tipc_subscrp_put
> > > tipc_subscrp_kref_release
> > > tipc_nametbl_unsubscribe
> > > spin_lock_bh(>nametbl_lock)
> > > <>
> > >
> > > The root cause of two deadlocks is that we have to hold nametbl lock
> > > w

Re: [tipc-discussion] [net 0/5] solve two deadlock issues

2017-02-21 Thread John Thompson
Patch #2 removes the tipc_subscrp_get() and _put()
from tipc_subscrp_report_overlap().
This prevents the problem of the early returns.
JT


On Wed, Feb 22, 2017 at 1:42 AM, Jon Maloy  wrote:

> I don't see that you remove the two premature 'return's in
> subcsrb_report_overlap() in your series. These are also genuine bugs that
> must be fixed.
>
> ///jon
>
>
> > -Original Message-
> > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > Sent: Tuesday, February 21, 2017 06:12 AM
> > To: Ying Xue ; Parthasarathy Bhuvaragan
> > ; thompa@gmail.com
> > Cc: tipc-discussion@lists.sourceforge.net
> > Subject: Re: [tipc-discussion] [net 0/5] solve two deadlock issues
> >
> > Hi Ying,
> > These are good design changes, that definitely should go in asap.
> However, I
> > feel deeply uncomfortable with such a big change going into 'net',
> especially
> > since our previous, exceptionally large, contribution now has turned out
> to
> > have problems. I wonder if we could not get away with something simpler
> > for 'net'.
> >
> > Looking closer at your series, it seems to me that only patches ## 1, 4,
> and
> > the lock removal part of #5 are really needed to solve the problem we
> have
> > at hand now. Why not merge those into one patch and deliver this to
> 'net',
> > while reference count redesign parts can go into net-next ?
> >
> > Regards
> > ///jon
> >
> >
> > > -Original Message-
> > > From: Ying Xue [mailto:ying@windriver.com]
> > > Sent: Monday, February 20, 2017 06:39 AM
> > > To: Jon Maloy ; Parthasarathy Bhuvaragan
> > > ; thompa@gmail.com
> > > Cc: tipc-discussion@lists.sourceforge.net
> > > Subject: [net 0/5] solve two deadlock issues
> > >
> > > Commit d094c4d5f5 ("tipc: add subscription refcount to avoid invalid
> > > delete") accidently introduce the following deadlock scenarios:
> > >
> > >CPU1: CPU2:
> > > -- 
> > > tipc_nametbl_publish
> > > spin_lock_bh(>nametbl_lock)
> > > tipc_nametbl_insert_publ
> > > tipc_nameseq_insert_publ
> > > tipc_subscrp_report_overlap
> > > tipc_subscrp_get
> > > tipc_subscrp_send_event
> > >  tipc_close_conn
> > >  tipc_subscrb_release_cb
> > >  tipc_subscrb_delete
> > >  tipc_subscrp_put
> > > tipc_subscrp_put
> > > tipc_subscrp_kref_release
> > > tipc_nametbl_unsubscribe
> > > spin_lock_bh(>nametbl_lock)
> > > <>
> > >
> > >CPU1:  CPU2:
> > > -- 
> > > tipc_nametbl_stop
> > > spin_lock_bh(>nametbl_lock)
> > > tipc_purge_publications
> > > tipc_nameseq_remove_publ
> > > tipc_subscrp_report_overlap
> > > tipc_subscrp_get
> > > tipc_subscrp_send_event
> > >  tipc_close_conn
> > >  tipc_subscrb_release_cb
> > >  tipc_subscrb_delete
> > >  tipc_subscrp_put
> > > tipc_subscrp_put
> > > tipc_subscrp_kref_release
> > > tipc_nametbl_unsubscribe
> > > spin_lock_bh(>nametbl_lock)
> > > <>
> > >
> > > The root cause of two deadlocks is that we have to hold nametbl lock
> > > when subscription is freed in tipc_subscrp_kref_release(). In order to
> > > eliminate the need of taking nametbl lock in
> > > tipc_subscrp_kref_release(), the functions protected by nametbl lock
> > > in tipc_subscrp_kref_release() are moved to other places step by step
> in
> > the series.
> > >
> > > Ying Xue (5):
> > >   tipc: advance the time of deleting subscription from
> > > subscriber->subscrp_list
> > >   tipc: adjust the policy of holding subscription kref
> > >   tipc: adjust policy that sub->timer holds subscription kref
> > >   tipc: advance the time of calling tipc_nametbl_unsubscribe
> > >   tipc: remove unnecessary increasement of subscription refcount
> > >
> > >  net/tipc/name_table.c |  2 ++
> > >  net/tipc/subscr.c | 32 ++--
> > >  net/tipc/subscr.h |  3 +++
> > >  3 files changed, 19 insertions(+), 18 deletions(-)
> > >
> > > --
> > > 2.7.4
> >
> >
> > 
> --
> > Check out the vibrant tech community on one of the world's most engaging
> > tech sites, SlashDot.org! http://sdm.link/slashdot
> > ___
> > tipc-discussion mailing list
> > tipc-discussion@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___

Re: [tipc-discussion] [net 5/5] tipc: remove unnecessary increasement of subscription refcount

2017-02-21 Thread John Thompson
Sorry, I was mistaken.  You are right that you have removed the sub->lock
from
tipc_subscrp_kref_release() and so there is no chance of a deadlock.

JT

On Tue, Feb 21, 2017 at 11:22 PM, Ying Xue <ying@windriver.com> wrote:

> On 02/21/2017 06:13 AM, John Thompson wrote:
> >
> >
> > On Tue, Feb 21, 2017 at 12:39 AM, Ying Xue <ying@windriver.com
> > <mailto:ying@windriver.com>> wrote:
> >
> > As the policy of holding subscription in subscriber time has been
> > adjusted, it's unnecessary to hold subscription refcount before
> > tipc_subscrp_delete() is called. As a consequence, subscriber->lock
> > can be safely removed to avoid the following two deadlock scenarios:
> >
> >CPU1: CPU2:
> > -- 
> > tipc_nametbl_publish
> > spin_lock_bh(>nametbl_lock)
> > tipc_nametbl_insert_publ
> > tipc_nameseq_insert_publ
> > tipc_subscrp_report_overlap
> > tipc_subscrp_get
> > tipc_subscrp_send_event
> >  tipc_close_conn
> >  tipc_subscrb_release_cb
> >  tipc_subscrb_delete
> >  tipc_subscrp_put
> > tipc_subscrp_put
> > tipc_subscrp_kref_release
> > tipc_nametbl_unsubscribe
> > spin_lock_bh(>nametbl_lock)
> > <>
> >
> >CPU1:  CPU2:
> > -- 
> > tipc_nametbl_stop
> > spin_lock_bh(>nametbl_lock)
> > tipc_purge_publications
> > tipc_nameseq_remove_publ
> > tipc_subscrp_report_overlap
> > tipc_subscrp_get
> > tipc_subscrp_send_event
> >  tipc_close_conn
> >  tipc_subscrb_release_cb
> >  tipc_subscrb_delete
> >  tipc_subscrp_put
> > tipc_subscrp_put
> > tipc_subscrp_kref_release
> > tipc_nametbl_unsubscribe
> > spin_lock_bh(>nametbl_lock)
> > <>
> >
> > Reported-by: John Thompson <thompa@gmail.com
> > <mailto:thompa@gmail.com>>
> > Reported-by: Jon Maloy <jon.ma...@ericsson.com
> > <mailto:jon.ma...@ericsson.com>>
> > Fixes: d094c4d5f5 ("tipc: add subscription refcount to avoid invalid
> > delete")
> > Signed-off-by: Ying Xue <ying@windriver.com
> > <mailto:ying@windriver.com>>
> > ---
> >  net/tipc/subscr.c | 6 --
> >  1 file changed, 6 deletions(-)
> >
> > diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
> > index 6624232..ea67cce 100644
> > --- a/net/tipc/subscr.c
> > +++ b/net/tipc/subscr.c
> > @@ -168,9 +168,7 @@ static void tipc_subscrp_kref_release(struct
> > kref *kref)
> > struct tipc_net *tn = net_generic(sub->net, tipc_net_id);
> > struct tipc_subscriber *subscriber = sub->subscriber;
> >
> > -   spin_lock_bh(>lock);
> > atomic_dec(>subscription_count);
> > -   spin_unlock_bh(>lock);
> > kfree(sub);
> > tipc_subscrb_put(subscriber);
> >  }
> > @@ -202,11 +200,7 @@ static void tipc_subscrb_subscrp_delete(struct
> > tipc_subscriber *subscriber,
> > list_del(>subscrp_list);
> >
> > tipc_nametbl_unsubscribe(sub);
> > -   tipc_subscrp_get(sub);
> > -   spin_unlock_bh(>lock);
> >
> >
> > By removing the unlocking at this point couldn't you end up with a
> > deadlock on subscriber->lock
> > due to tipc_subscrp_delete() putting the subscrp twice (which could end
> > up with kref == 0) and
> > as a result calling tipc_subscrp_kref_release() which gets
> subscriber->lock?
> >
>
> I think there is no deadlock risk even if tipc_subscrp_put() will be
> called twice in tipc_subscrp_delete() because
> tipc_subscrp_kref_release() doesn't hold subscriber->lock after the change.
>
> Please take a look at tipc_subscrp_kref_release() code:
> static void tipc_subscrp_kref_release(struct kref *kref)
> {
> struct tipc_subscription *sub = container_of(kref,
>  struct
> tipc

Re: [tipc-discussion] [net 5/5] tipc: remove unnecessary increasement of subscription refcount

2017-02-20 Thread John Thompson
Hi Ying,

Sorry I sent that reply before signing it off.
The rest of the changes look like they are doing the right things but this
change looks like it has the potential to cause a deadlock.

Regards,
John

On Tue, Feb 21, 2017 at 11:13 AM, John Thompson <thompa@gmail.com>
wrote:

>
>
> On Tue, Feb 21, 2017 at 12:39 AM, Ying Xue <ying@windriver.com> wrote:
>
>> As the policy of holding subscription in subscriber time has been
>> adjusted, it's unnecessary to hold subscription refcount before
>> tipc_subscrp_delete() is called. As a consequence, subscriber->lock
>> can be safely removed to avoid the following two deadlock scenarios:
>>
>>CPU1: CPU2:
>> -- 
>> tipc_nametbl_publish
>> spin_lock_bh(>nametbl_lock)
>> tipc_nametbl_insert_publ
>> tipc_nameseq_insert_publ
>> tipc_subscrp_report_overlap
>> tipc_subscrp_get
>> tipc_subscrp_send_event
>>  tipc_close_conn
>>  tipc_subscrb_release_cb
>>  tipc_subscrb_delete
>>  tipc_subscrp_put
>> tipc_subscrp_put
>> tipc_subscrp_kref_release
>> tipc_nametbl_unsubscribe
>> spin_lock_bh(>nametbl_lock)
>> <>
>>
>>CPU1:  CPU2:
>> -- 
>> tipc_nametbl_stop
>> spin_lock_bh(>nametbl_lock)
>> tipc_purge_publications
>> tipc_nameseq_remove_publ
>> tipc_subscrp_report_overlap
>> tipc_subscrp_get
>> tipc_subscrp_send_event
>>  tipc_close_conn
>>  tipc_subscrb_release_cb
>>  tipc_subscrb_delete
>>  tipc_subscrp_put
>> tipc_subscrp_put
>> tipc_subscrp_kref_release
>> tipc_nametbl_unsubscribe
>> spin_lock_bh(>nametbl_lock)
>> <>
>>
>> Reported-by: John Thompson <thompa@gmail.com>
>> Reported-by: Jon Maloy <jon.ma...@ericsson.com>
>> Fixes: d094c4d5f5 ("tipc: add subscription refcount to avoid invalid
>> delete")
>> Signed-off-by: Ying Xue <ying@windriver.com>
>> ---
>>  net/tipc/subscr.c | 6 --
>>  1 file changed, 6 deletions(-)
>>
>> diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
>> index 6624232..ea67cce 100644
>> --- a/net/tipc/subscr.c
>> +++ b/net/tipc/subscr.c
>> @@ -168,9 +168,7 @@ static void tipc_subscrp_kref_release(struct kref
>> *kref)
>> struct tipc_net *tn = net_generic(sub->net, tipc_net_id);
>> struct tipc_subscriber *subscriber = sub->subscriber;
>>
>> -   spin_lock_bh(>lock);
>> atomic_dec(>subscription_count);
>> -   spin_unlock_bh(>lock);
>> kfree(sub);
>> tipc_subscrb_put(subscriber);
>>  }
>> @@ -202,11 +200,7 @@ static void tipc_subscrb_subscrp_delete(struct
>> tipc_subscriber *subscriber,
>> list_del(>subscrp_list);
>>
>> tipc_nametbl_unsubscribe(sub);
>> -   tipc_subscrp_get(sub);
>> -   spin_unlock_bh(>lock);
>>
>
> By removing the unlocking at this point couldn't you end up with a
> deadlock on subscriber->lock
> due to tipc_subscrp_delete() putting the subscrp twice (which could end up
> with kref == 0) and
> as a result calling tipc_subscrp_kref_release() which gets
> subscriber->lock?
>
>
>
>> tipc_subscrp_delete(sub);
>> -   tipc_subscrp_put(sub);
>> -   spin_lock_bh(>lock);
>>
>> if (s)
>> break;
>> --
>> 2.7.4
>>
>>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [net 5/5] tipc: remove unnecessary increasement of subscription refcount

2017-02-20 Thread John Thompson
On Tue, Feb 21, 2017 at 12:39 AM, Ying Xue <ying@windriver.com> wrote:

> As the policy of holding subscription in subscriber time has been
> adjusted, it's unnecessary to hold subscription refcount before
> tipc_subscrp_delete() is called. As a consequence, subscriber->lock
> can be safely removed to avoid the following two deadlock scenarios:
>
>CPU1: CPU2:
> -- 
> tipc_nametbl_publish
> spin_lock_bh(>nametbl_lock)
> tipc_nametbl_insert_publ
> tipc_nameseq_insert_publ
> tipc_subscrp_report_overlap
> tipc_subscrp_get
> tipc_subscrp_send_event
>  tipc_close_conn
>  tipc_subscrb_release_cb
>  tipc_subscrb_delete
>  tipc_subscrp_put
> tipc_subscrp_put
> tipc_subscrp_kref_release
> tipc_nametbl_unsubscribe
> spin_lock_bh(>nametbl_lock)
> <>
>
>CPU1:  CPU2:
> -- 
> tipc_nametbl_stop
> spin_lock_bh(>nametbl_lock)
> tipc_purge_publications
> tipc_nameseq_remove_publ
> tipc_subscrp_report_overlap
> tipc_subscrp_get
> tipc_subscrp_send_event
>  tipc_close_conn
>  tipc_subscrb_release_cb
>  tipc_subscrb_delete
>  tipc_subscrp_put
> tipc_subscrp_put
> tipc_subscrp_kref_release
> tipc_nametbl_unsubscribe
> spin_lock_bh(>nametbl_lock)
> <>
>
> Reported-by: John Thompson <thompa@gmail.com>
> Reported-by: Jon Maloy <jon.ma...@ericsson.com>
> Fixes: d094c4d5f5 ("tipc: add subscription refcount to avoid invalid
> delete")
> Signed-off-by: Ying Xue <ying@windriver.com>
> ---
>  net/tipc/subscr.c | 6 --
>  1 file changed, 6 deletions(-)
>
> diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
> index 6624232..ea67cce 100644
> --- a/net/tipc/subscr.c
> +++ b/net/tipc/subscr.c
> @@ -168,9 +168,7 @@ static void tipc_subscrp_kref_release(struct kref
> *kref)
> struct tipc_net *tn = net_generic(sub->net, tipc_net_id);
> struct tipc_subscriber *subscriber = sub->subscriber;
>
> -   spin_lock_bh(>lock);
> atomic_dec(>subscription_count);
> -   spin_unlock_bh(>lock);
> kfree(sub);
> tipc_subscrb_put(subscriber);
>  }
> @@ -202,11 +200,7 @@ static void tipc_subscrb_subscrp_delete(struct
> tipc_subscriber *subscriber,
> list_del(>subscrp_list);
>
> tipc_nametbl_unsubscribe(sub);
> -   tipc_subscrp_get(sub);
> -   spin_unlock_bh(>lock);
>

By removing the unlocking at this point couldn't you end up with a deadlock
on subscriber->lock
due to tipc_subscrp_delete() putting the subscrp twice (which could end up
with kref == 0) and
as a result calling tipc_subscrp_kref_release() which gets subscriber->lock?



> tipc_subscrp_delete(sub);
> -   tipc_subscrp_put(sub);
> -   spin_lock_bh(>lock);
>
> if (s)
> break;
> --
> 2.7.4
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


[tipc-discussion] Nametable soft lockup

2017-02-08 Thread John Thompson
Hi,

I have been using the patches Partha had provided for the nametable soft
lockup, and that I had tested.  This was seen when testing on a SMP system.

Unfortunately I have come across another nametable soft lockup:

<0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [AIS listener:1591]
<6>Modules linked in: tipc jitterentropy_rng echainiv drbg
platform_driver(O) ipifwd(PO)
<6>CPU: 0 PID: 1591 Comm: AIS listener Tainted: P   O
<6>task: ae393600 ti: ae286000 task.ti: ae286000
<6>NIP: 806952bc LR: c160bfe0 CTR: 80695280
<6>REGS: ae287b40 TRAP: 0901   Tainted: P   O
<6>MSR: 00029002   CR: 48002484  XER: 
<6>
<6>GPR00: c160a64c ae287bf0 ae393600 a20f18ac   ae064fbc
0030
<6>GPR08: 01001006 0001 0001 0006 80695280
<6>NIP [806952bc] _raw_spin_lock_bh+0x3c/0x70
<6>LR [c160bfe0] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
<6>Call Trace:
<6>[ae287c10] [c160a64c] tipc_named_reinit+0x33c/0x8a0 [tipc]
<6>[ae287c30] [c160ad44] tipc_subscrp_report_overlap+0xc4/0xe0 [tipc]
<6>[ae287c70] [c160b30c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
<6>[ae287ca0] [c160b838] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
<6>[ae287cd0] [c160bcf8] tipc_nametbl_withdraw+0x68/0x140 [tipc]
<6>[ae287d00] [c1613cd4] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
<6>[ae287d30] [c16148e8] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
<6>[ae287d70] [804f5a40] sock_release+0x30/0xf0
<6>[ae287d80] [804f5b14] sock_close+0x14/0x30
<6>[ae287d90] [80105844] __fput+0x94/0x200
<6>[ae287db0] [8003dca4] task_work_run+0xd4/0x100
<6>[ae287dd0] [80023620] do_exit+0x280/0x980
<6>[ae287e10] [80024c48] do_group_exit+0x48/0xb0
<6>[ae287e30] [80030344] get_signal+0x244/0x4f0
<6>[ae287e80] [80007734] do_signal+0x34/0x1c0
<6>[ae287f30] [800079a8] do_notify_resume+0x68/0x80
<6>[ae287f40] [8000fa1c] do_user_signal+0x74/0xc4


I have gone through the code and I think I have found a place where there
is a potential soft lockup.
The call chain is:
tipc_nametbl_stop() Grabs nametbl_lock
   tipc_purge_publications()
  tipc_nameseq_remove_publ()
 tipc_subscrp_report_overlap()
tipc_subscrp_put() Calls kref_put when kref == 0 -- could have
been put by a different CPU
   tipc_subscrp_kref_release()
  tipc_nametbl_unsubscribe()
 << lockup occurs as it grabs the
 nametbl_lock again >>


Another possible issue is in tipc_subscrp_report_overlap(), there are 2
early returns after a tipc_subscrp_get() before the tipc_subscrp_put().
Could this end up with an incorrect kref?

JT
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [PATCH net-next v7 0/6] topology server fixes for nametable soft lockup

2017-01-23 Thread John Thompson
Hi,

My testing has gone ok overnight and have observed no kernel dumps.

The patches are looking good.
Thanks,
JT


On Tue, Jan 24, 2017 at 8:00 AM, Jon Maloy <jon.ma...@ericsson.com> wrote:

> You can even add an "acked-by" from me. Let's hope we are finally done
> with those problems.
>
> ///jon
>
>
> > -Original Message-
> > From: Xue, Ying [mailto:ying@windriver.com]
> > Sent: Monday, 23 January, 2017 05:56
> > To: Parthasarathy Bhuvaragan <parthasarathy.bhuvara...@ericsson.com>;
> tipc-
> > discuss...@lists.sourceforge.net; Jon Maloy <jon.ma...@ericsson.com>;
> > thompa@gmail.com
> > Subject: RE: [PATCH net-next v7 0/6] topology server fixes for nametable
> soft
> > lockup
> >
> > Hi Partha,
> >
> > As long as you can confirm the series can be applied into the latest
> "net" tree, it's
> > better to deliver to "net".
> >
> > By the way, if John's testing is passed, I suggest you can add John's
> "Tested-by"
> > tag into the whole series.
> >
> > Regards,
> > Ying
> >
> > -Original Message-
> > From: Parthasarathy Bhuvaragan
> > [mailto:parthasarathy.bhuvara...@ericsson.com]
> > Sent: Friday, January 20, 2017 9:06 PM
> > To: Xue, Ying; tipc-discussion@lists.sourceforge.net;
> jon.ma...@ericsson.com;
> > thompa@gmail.com
> > Subject: Re: [PATCH net-next v7 0/6] topology server fixes for nametable
> soft
> > lockup
> >
> > Hi Ying,
> >
> > Sure, I will wait for John as he seems to be able to trigger the race
> conditions i
> > couldn't.
> >
> > Moreover, I was thinking of posting this series to "net" instead of
> net-next as
> > these are primarily bug-fixes.
> > /Partha
> >
> > On 01/20/2017 01:12 PM, Xue, Ying wrote:
> > > Thanks, it's very good now. You can add my ack-by flag to the whole
> series.
> > >
> > > But if possible, please let John help us verify again.
> > >
> > > Regards,
> > > Ying
> > >
> > > -Original Message-
> > > From: Parthasarathy Bhuvaragan
> > > [mailto:parthasarathy.bhuvara...@ericsson.com]
> > > Sent: Friday, January 20, 2017 3:52 AM
> > > To: tipc-discussion@lists.sourceforge.net; jon.ma...@ericsson.com;
> > > Xue, Ying; thompa@gmail.com
> > > Subject: [PATCH net-next v7 0/6] topology server fixes for nametable
> > > soft lockup
> > >
> > > In this series, we revert the commit 333f796235a527 ("tipc: fix a race
> condition
> > leading to subscriber refcnt bug") and provide an alternate solution to
> fix the race
> > conditions in commits 2-4.
> > >
> > > We have to do this as the above commit introduced a nametbl soft
> lockup at
> > module exit as described by patch#4.
> > >
> > > ---
> > > v7: Following updates in Patch #2:
> > > Fix incorrect deletion of all prior subscriptions until the
> specified subscription.
> > > Protect exported tipc_subscrp_report_overlap() with subscription
> refcount.
> > > Ensure that subscription can be freed correctly at subscription
> timer expiry.
> > The
> > > earlier patch#2 in v5/v6, had refcount bug which prevents the
> above. This
> > was
> > > introduced when we skipped get/put refcount in
> > tipc_subscrb_subscrp_delete(), but
> > > instead do get in tipc_subscrp_subscribe() before starting the
> timer. Thus the
> > > subscription_create() initialized the refcount and
> tipc_subscrp_subscribe
> > steps it
> > > to 2. At subscription timeout, we perform put only once and we
> cannot
> > compensate for
> > > this additional refcount safely.
> > > v6: Address krefcount warning for John Thompson in Patch#3
> > > v5: Address Ying's comment in Patch #2 to remove del_timer_sync().
> > > v4: Address Ying's comment by introducing subscription refcount.
> > >
> > > Parthasarathy Bhuvaragan (6):
> > >   tipc: fix nametbl_lock soft lockup at node/link events
> > >   tipc: add subscription refcount to avoid invalid delete
> > >   tipc: fix connection refcount error
> > >   tipc: fix nametbl_lock soft lockup at module exit
> > >   tipc: ignore requests when the connection state is not CONNECTED
> > >   tipc: fix cleanup at module unload
> > >
> > >  net/tipc/node.c   |   9 +++-
> > >  net/tipc/server.c |  48 +  net/tipc/subscr.c | 124
> > ++
> > >  net/tipc/subscr.h |   1 +
> > >  4 files changed, 99 insertions(+), 83 deletions(-)
> > >
> > > --
> > > 2.1.4
> > >
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [PATCH net-next v7 0/6] topology server fixes for nametable soft lockup

2017-01-22 Thread John Thompson
Hi,

So far my tests are looking good.  I will leave them running over night to
be doubly sure.

Thanks,
JT


On Mon, Jan 23, 2017 at 1:57 PM, John Thompson <thompa@gmail.com> wrote:

> Hi,
>
> I am running my tests now.  I should have some results later today, but
> more conclusively tomorrow.
>
> Regards,
> JT
>
>
> On Sat, Jan 21, 2017 at 2:05 AM, Parthasarathy Bhuvaragan <
> parthasarathy.bhuvara...@ericsson.com> wrote:
>
>> Hi Ying,
>>
>> Sure, I will wait for John as he seems to be able to trigger the race
>> conditions i couldn't.
>>
>> Moreover, I was thinking of posting this series to "net" instead of
>> net-next as these are primarily bug-fixes.
>>
>> /Partha
>>
>>
>> On 01/20/2017 01:12 PM, Xue, Ying wrote:
>>
>>> Thanks, it's very good now. You can add my ack-by flag to the whole
>>> series.
>>>
>>> But if possible, please let John help us verify again.
>>>
>>> Regards,
>>> Ying
>>>
>>> -Original Message-
>>> From: Parthasarathy Bhuvaragan [mailto:parthasarathy.bhuvarag
>>> a...@ericsson.com]
>>> Sent: Friday, January 20, 2017 3:52 AM
>>> To: tipc-discussion@lists.sourceforge.net; jon.ma...@ericsson.com; Xue,
>>> Ying; thompa@gmail.com
>>> Subject: [PATCH net-next v7 0/6] topology server fixes for nametable
>>> soft lockup
>>>
>>> In this series, we revert the commit 333f796235a527 ("tipc: fix a race
>>> condition leading to subscriber refcnt bug") and provide an alternate
>>> solution to fix the race conditions in commits 2-4.
>>>
>>> We have to do this as the above commit introduced a nametbl soft lockup
>>> at module exit as described by patch#4.
>>>
>>> ---
>>> v7: Following updates in Patch #2:
>>> Fix incorrect deletion of all prior subscriptions until the
>>> specified subscription.
>>> Protect exported tipc_subscrp_report_overlap() with subscription
>>> refcount.
>>> Ensure that subscription can be freed correctly at subscription
>>> timer expiry. The
>>> earlier patch#2 in v5/v6, had refcount bug which prevents the above.
>>> This was
>>> introduced when we skipped get/put refcount in
>>> tipc_subscrb_subscrp_delete(), but
>>> instead do get in tipc_subscrp_subscribe() before starting the
>>> timer. Thus the
>>> subscription_create() initialized the refcount and
>>> tipc_subscrp_subscribe steps it
>>> to 2. At subscription timeout, we perform put only once and we
>>> cannot compensate for
>>> this additional refcount safely.
>>> v6: Address krefcount warning for John Thompson in Patch#3
>>> v5: Address Ying's comment in Patch #2 to remove del_timer_sync().
>>> v4: Address Ying's comment by introducing subscription refcount.
>>>
>>> Parthasarathy Bhuvaragan (6):
>>>   tipc: fix nametbl_lock soft lockup at node/link events
>>>   tipc: add subscription refcount to avoid invalid delete
>>>   tipc: fix connection refcount error
>>>   tipc: fix nametbl_lock soft lockup at module exit
>>>   tipc: ignore requests when the connection state is not CONNECTED
>>>   tipc: fix cleanup at module unload
>>>
>>>  net/tipc/node.c   |   9 +++-
>>>  net/tipc/server.c |  48 +  net/tipc/subscr.c | 124
>>> ++
>>>  net/tipc/subscr.h |   1 +
>>>  4 files changed, 99 insertions(+), 83 deletions(-)
>>>
>>> --
>>> 2.1.4
>>>
>>>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [PATCH net-next v7 0/6] topology server fixes for nametable soft lockup

2017-01-22 Thread John Thompson
Hi,

I am running my tests now.  I should have some results later today, but
more conclusively tomorrow.

Regards,
JT


On Sat, Jan 21, 2017 at 2:05 AM, Parthasarathy Bhuvaragan <
parthasarathy.bhuvara...@ericsson.com> wrote:

> Hi Ying,
>
> Sure, I will wait for John as he seems to be able to trigger the race
> conditions i couldn't.
>
> Moreover, I was thinking of posting this series to "net" instead of
> net-next as these are primarily bug-fixes.
>
> /Partha
>
>
> On 01/20/2017 01:12 PM, Xue, Ying wrote:
>
>> Thanks, it's very good now. You can add my ack-by flag to the whole
>> series.
>>
>> But if possible, please let John help us verify again.
>>
>> Regards,
>> Ying
>>
>> -Original Message-
>> From: Parthasarathy Bhuvaragan [mailto:parthasarathy.bhuvarag
>> a...@ericsson.com]
>> Sent: Friday, January 20, 2017 3:52 AM
>> To: tipc-discussion@lists.sourceforge.net; jon.ma...@ericsson.com; Xue,
>> Ying; thompa@gmail.com
>> Subject: [PATCH net-next v7 0/6] topology server fixes for nametable soft
>> lockup
>>
>> In this series, we revert the commit 333f796235a527 ("tipc: fix a race
>> condition leading to subscriber refcnt bug") and provide an alternate
>> solution to fix the race conditions in commits 2-4.
>>
>> We have to do this as the above commit introduced a nametbl soft lockup
>> at module exit as described by patch#4.
>>
>> ---
>> v7: Following updates in Patch #2:
>> Fix incorrect deletion of all prior subscriptions until the specified
>> subscription.
>> Protect exported tipc_subscrp_report_overlap() with subscription
>> refcount.
>> Ensure that subscription can be freed correctly at subscription timer
>> expiry. The
>> earlier patch#2 in v5/v6, had refcount bug which prevents the above.
>> This was
>> introduced when we skipped get/put refcount in
>> tipc_subscrb_subscrp_delete(), but
>> instead do get in tipc_subscrp_subscribe() before starting the timer.
>> Thus the
>> subscription_create() initialized the refcount and
>> tipc_subscrp_subscribe steps it
>> to 2. At subscription timeout, we perform put only once and we cannot
>> compensate for
>> this additional refcount safely.
>> v6: Address krefcount warning for John Thompson in Patch#3
>> v5: Address Ying's comment in Patch #2 to remove del_timer_sync().
>> v4: Address Ying's comment by introducing subscription refcount.
>>
>> Parthasarathy Bhuvaragan (6):
>>   tipc: fix nametbl_lock soft lockup at node/link events
>>   tipc: add subscription refcount to avoid invalid delete
>>   tipc: fix connection refcount error
>>   tipc: fix nametbl_lock soft lockup at module exit
>>   tipc: ignore requests when the connection state is not CONNECTED
>>   tipc: fix cleanup at module unload
>>
>>  net/tipc/node.c   |   9 +++-
>>  net/tipc/server.c |  48 +  net/tipc/subscr.c | 124
>> ++
>>  net/tipc/subscr.h |   1 +
>>  4 files changed, 99 insertions(+), 83 deletions(-)
>>
>> --
>> 2.1.4
>>
>>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [PATCH net-next v4 0/6] topology server fixes for nametable soft lockup

2017-01-17 Thread John Thompson
Hi Partha,

Thanks for the new patches.  I have tested them and had a kernel warning as
per below.  This is running on a SMP system with 4 cores.
The warning reads as though one thread has gone to free the item while
another thread has gotten a reference to it.
The suggestion is to use kref_get_unless_zero() instead of kref_get().

[ cut here ]
WARNING: at /home/johnt/views/main/linux/include/linux/kref.h:46
Modules linked in: tipc jitterentropy_rng echainiv drbg platform_driver(O)
CPU: 1 PID: 1527 Comm: intrTokenReset Tainted: P   O
task: a52c8600 ti: a4c98000 task.ti: a4c98000
NIP: c161809c LR: c1618120 CTR: 80695270
REGS: a4c99b30 TRAP: 0700   Tainted: P   O
MSR: 00029002   CR: 28002482  XER: 

GPR00: c1618694 a4c99be0 a52c8600 a4c4e840 a2763aa0  a4ab623c
0030
GPR08: 01001005 0001 c162 0005 80695270 107d12f0 6c23e000
0009
GPR16:  808e 808e 00040100 00040006 807c9428 808f
a2e39ac0
GPR24: 808dba00 01001005 a4ab623c a50fb300 0030 a50fb31c a50fb300
a4c4e840
NIP [c161809c] tipc_nl_publ_dump+0x93c/0xf10 [tipc]
LR [c1618120] tipc_nl_publ_dump+0x9c0/0xf10 [tipc]
Call Trace:
[a4c99be0] [800b9e2c] free_pages_prepare+0x18c/0x2a0 (unreliable)
[a4c99c00] [c1618694] tipc_conn_sendmsg+0x24/0x150 [tipc]
[a4c99c30] [c160ad5c] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
[a4c99c70] [c160b31c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
[a4c99ca0] [c160b848] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
[a4c99cd0] [c160bd08] tipc_nametbl_withdraw+0x68/0x140 [tipc]
[a4c99d00] [c1613ce4] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
[a4c99d30] [c16148f8] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
[a4c99d70] [804f5a40] sock_release+0x30/0xf0
[a4c99d80] [804f5b14] sock_close+0x14/0x30
[a4c99d90] [80105844] __fput+0x94/0x200
[a4c99db0] [8003dca4] task_work_run+0xd4/0x100
[a4c99dd0] [80023620] do_exit+0x280/0x980
[a4c99e10] [80024c48] do_group_exit+0x48/0xb0
[a4c99e30] [80030344] get_signal+0x244/0x4f0
[a4c99e80] [80007734] do_signal+0x34/0x1c0
[a4c99f30] [800079a8] do_notify_resume+0x68/0x80
[a4c99f40] [8000fa1c] do_user_signal+0x74/0xc4
--- interrupt: c00 at 0xf5b0cfc
LR = 0xf5b0ce8
Instruction dump:
4ba8 7c0004ac 7d201828 31290001 7d20192d 40a2fff4 7c0004ac 2f890001
4dbd0020 3d40c162 892ac11e 69290001 <0f09> 2f89 4dbe0020 3921
---[ end trace 544bc785f9258108 ]---

JT


On Thu, Jan 12, 2017 at 1:19 AM, Parthasarathy Bhuvaragan <
parthasarathy.bhuvara...@ericsson.com> wrote:

> In this series, we revert the commit 333f796235a527 ("tipc: fix a
> race condition leading to subscriber refcnt bug") and provide an
> alternate solution to fix the race conditions in commits 2-4.
>
> We have to do this as the above commit introduced a nametbl soft
> lockup at module exit as described by patch#4.
>
> ---
> v3: introduce cleanup workqueue to fix nametbl soft lockup.
>
> Parthasarathy Bhuvaragan (6):
>   tipc: fix nametbl_lock soft lockup at node/link events
>   tipc: add subscription refcount
>   tipc: fix connection refcount error
>   tipc: fix nametbl_lock soft lockup at module exit
>   tipc: ignore requests when the connection state is not CONNECTED
>   tipc: fix cleanup at module unload
>
>  net/tipc/node.c   |   9 -
>  net/tipc/server.c |  44 +
>  net/tipc/subscr.c | 116 ++
> ++--
>  net/tipc/subscr.h |   1 +
>  4 files changed, 94 insertions(+), 76 deletions(-)
>
> --
> 2.1.4
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [PATCH net v2 2/2] tipc: fix nametbl_lock soft lockup at module exit

2016-12-12 Thread John Thompson
Hi Ying,

I tested with the 3 patches applied:
1/3: tipc: fix nametbl_lock soft lockup at node/link events
2/3: tipc: fix nametbl_lock soft lockup at module exit
3/3: tipc: move connection cleanup to a workqueue

In my case the soft lockup I was seeing was resolved by patch 3 ("tipc:
move connection cleanup to a workqueue").

I have not tested without patch 2 applied.

Regards,
John



On Tue, Dec 13, 2016 at 12:37 AM, Ying Xue  wrote:

> Hi Parth,
>
> Sorry for late response.
>
> As I could not find your v3 version, I just give comments based on the
> version.
>
> On 11/22/2016 12:27 AM, Parthasarathy Bhuvaragan wrote:
>
>> Commit 333f796235a527 ("tipc: fix a race condition leading to
>> subscriber refcnt bug") reveals a soft lockup while acquiring
>> nametbl_lock.
>>
>> Before commit 333f796235a527, we call tipc_conn_shutdown() from
>> tipc_close_conn() in the context of tipc_topsrv_stop(). In that
>> context, we are allowed to grab the nametbl_lock.
>>
>> In commit 333f796235a527, i moved the tipc_conn_release (renamed from
>> tipc_conn_shutdown) to the connection refcount cleanup.
>>
>
> Can you please confirm whether the soft lockup doesn't happen any more if
> we don't adjust the sequence of tipc_conn_release?
>
> If yes, I think we can propose other method to fix the issue solved by
> commit 333f796235a527, but in the new method it's unnecessary to adjust the
> order of tipc_conn_release.
>
> In fact the order of tipc_conn_shutdown, tipc_unregister_callbacks and
> tipc_conn_release is very important. When I wrote that code, I spent much
> time considering how to carefully close connection.
>
> In my opinion, the ideal order is still as belows:
>
> 1, Close connection;
> 2. Call tipc_unregister_callbacks to let sk->sk_user_data. As long as
> sk->sk_user_data is 0, no more data will be submitted to
> con->rwork/on->swork works.
> 3. Release socket.
>
> Regards,
> Ying
>
>
>  This allows
>
>> either tipc_nametbl_withdraw() or tipc_topsrv_stop() to perform
>> tipc_sock_release().
>>
>> Since tipc_exit_net() first calls tipc_topsrv_stop() and then
>> tipc_nametble_withdraw() increases the chances for the later to
>> perform the connection cleanup.
>>
>> The soft lockup occurs in the call chain of tipc_nametbl_withdraw(),
>> when it performs the tipc_conn_kref_release() as it tries to grab
>> nametbl_lock again while holding it already.
>> tipc_nametbl_withdraw() grabs nametbl_lock
>>   tipc_nametbl_remove_publ()
>> tipc_subscrp_report_overlap()
>>   tipc_subscrp_send_event()
>> tipc_conn_sendmsg()
>>   << if (con->flags != CF_CONNECTED) we do conn_put(),
>>  triggering the cleanup as refcount=0. >>
>>   tipc_conn_kref_release
>> tipc_sock_release
>>   tipc_conn_release
>> tipc_subscrb_delete
>>   tipc_subscrp_delete
>> tipc_nametbl_unsubscribe << Soft Lockup >>
>>
>> Until now, tipc_server_stop() grabs and releases the idr_lock twice
>> for every connection. Once to find the connection and second to unset
>> connection flag, remove it from list and decrement the refcount.
>> The above lockup occurs when tipc_nametbl_withdraw() grabs the
>> connection in between the two, thus owning the connection
>> and triggering the cleanup while decrementing the refcount later.
>>
>> In this commit, we perform:
>> - call tipc_nametbl_withdraw() before tipc_topsrv_stop() to avoid:
>>   1. soft lockup
>>   2. its too late to actually notify the subscribers, as the topology
>>  server might already have started shutting down.
>> - In tipc_server_stop(), we remove all the connections from connection
>> list in the scope of the idr_lock to prevent any other thread finding
>> a connection which has unset CF_CONNECTED in its flags.
>>
>> Fixes: 333f796235a52727 ("tipc: fix a race condition leading to
>> subscriber refcnt bug")
>> Signed-off-by: Parthasarathy Bhuvaragan > sson.com>
>> ---
>> v2: commit message update.
>> cleanup tipc_server_stop() as per Ying Xue.
>> ---
>>  net/tipc/core.c   |  4 
>>  net/tipc/net.c|  2 --
>>  net/tipc/server.c | 13 -
>>  3 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/net/tipc/core.c b/net/tipc/core.c
>> index 236b043a4156..3ea1452e2f06 100644
>> --- a/net/tipc/core.c
>> +++ b/net/tipc/core.c
>> @@ -93,6 +93,10 @@ static int __net_init tipc_init_net(struct net *net)
>>
>>  static void __net_exit tipc_exit_net(struct net *net)
>>  {
>> +   struct tipc_net *tn = net_generic(net, tipc_net_id);
>> +
>> +   tipc_nametbl_withdraw(net, TIPC_CFG_SRV, tn->own_addr, 0,
>> + tn->own_addr);
>> tipc_topsrv_stop(net);
>> tipc_net_stop(net);
>> tipc_bcast_stop(net);
>> diff --git a/net/tipc/net.c b/net/tipc/net.c
>> index 28bf4feeb81c..92696cc6e763 100644
>> --- a/net/tipc/net.c
>> +++ b/net/tipc/net.c
>> @@ -130,8 +130,6 @@ 

Re: [tipc-discussion] v4.7: soft lockup when releasing a socket

2016-11-28 Thread John Thompson
Hi Partha,

I tested with the latest 3 patches last night and observed no soft lockups.

Thanks,
John


On Fri, Nov 25, 2016 at 11:50 AM, John Thompson <thompa@gmail.com>
wrote:

> Hi Partha,
>
> I rebuilt afresh and retried the test with the same lockup kernel dumps.
> Yes I have multiple tipc clients subscribed to the topology server, at
> least 10 clients.
> They all use a subscription timeout of TIPC_WAIT_FOREVER
>
> I will try the kernel command line parameter next week.
> JT
>
>
> On Fri, Nov 25, 2016 at 3:07 AM, Parthasarathy Bhuvaragan <
> parthasarathy.bhuvara...@ericsson.com> wrote:
>
>> Hi John,
>>
>> Do you have several tipc clients subscribed to topology server?
>> What subscription timeout do they use?
>>
>> Please enable kernel command line parameter:
>> softlockup_all_cpu_backtrace=1
>>
>> /Partha
>>
>> On 11/23/2016 11:04 PM, John Thompson wrote:
>>
>>> Hi Partha,
>>>
>>> I tested overnight with the 2 patches you provided yesterday.
>>> Testing is still showing problems, here is one of the soft lockups, the
>>> other is the same as I sent the other day.
>>> I am going to redo my build as I expected some change in behaviour with
>>> your patches.
>>>
>>> It is possible that I am doing some dumping of nodes or links as I am
>>> not certain of all the code or paths.
>>> I have found that we do a tipc-config -nt and tipc-config -ls in some
>>> situations but it shouldn 't be initiated in this
>>> reboot case.
>>>
>>> <0>NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [pimd:1220]
>>> <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [rpc.13:2419]
>>> <6>Modules linked in:
>>> <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [AIS
>>> listener:1600]
>>> <6> tipc
>>> <6>Modules linked in:
>>> <6> jitterentropy_rng
>>> <6> tipc
>>> <6> echainiv
>>> <6> jitterentropy_rng
>>> <6> drbg
>>> <6> echainiv
>>> <6> platform_driver(O)
>>> <6> drbg
>>> <6> platform_driver(O)
>>> <6>
>>> <6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P   O
>>> <6>CPU: 2 PID: 1600 Comm: AIS listener Tainted: P   O
>>> <6>task: aed76d20 ti: ae70c000 task.ti: ae70c000
>>> <6>task: aee3ced0 ti: ae686000 task.ti: ae686000
>>> <6>NIP: 8069257c LR: c13ebc4c CTR: 80692540
>>> <6>NIP: 80692578 LR: c13ebf50 CTR: 80692540
>>> <6>REGS: ae70dc20 TRAP: 0901   Tainted: P   O
>>> <6>REGS: ae687ad0 TRAP: 0901   Tainted: P   O
>>> <6>MSR: 00029002
>>> <6>MSR: 00029002
>>> <6><
>>> <6><
>>> <6>CE
>>> <6>CE
>>> <6>,EE
>>> <6>,EE
>>> <6>,ME
>>> <6>,ME
>>> <6>>
>>> <6>>
>>> <6>  CR: 42002484  XER: 2000
>>> <6>  CR: 48002444  XER: 
>>> <6>
>>> <6>GPR00:
>>> <6>
>>> <6>GPR00:
>>> <6>c13f3c34
>>> <6>c13ea408
>>> <6>ae70dcd0
>>> <6>ae687b80
>>> <6>aed76d20
>>> <6>aee3ced0
>>> <6>ae55c8ec
>>> <6>ae55c8ec
>>> <6>2711
>>> <6>
>>> <6>0005
>>> <6>a30e7264
>>> <6>8666592a
>>> <6>ae5e070c
>>> <6>8666592b
>>> <6>fffd
>>> <6>
>>> <6>GPR08:
>>> <6>
>>> <6>GPR08:
>>> <6>ae9dad20
>>> <6>ae72fbc8
>>> <6>0001
>>> <6>0001
>>> <6>0001
>>> <6>0001
>>> <6>
>>> <6>0004
>>> <6>80692540
>>> <6>80692540
>>> <6>
>>> <6>
>>> <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
>>> <6>NIP [80692578] _raw_spin_lock_bh+0x38/0x70
>>> <6>LR [c13ebc4c] tipc_nametbl_withdraw+0x4c/0x140 [tipc]
>>> <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
>>> <6>Call Trace:
>>> <6>Call Trace:
>>> <6>[ae70dcd0] [a85d99a0] 0xa85d99a0
>>> <6>[ae687b80] [800fa258] check_object+0xc8/0x270
>>> <6

Re: [tipc-discussion] v4.7: soft lockup when releasing a socket

2016-11-24 Thread John Thompson
Hi Partha,

I rebuilt afresh and retried the test with the same lockup kernel dumps.
Yes I have multiple tipc clients subscribed to the topology server, at
least 10 clients.
They all use a subscription timeout of TIPC_WAIT_FOREVER

I will try the kernel command line parameter next week.
JT


On Fri, Nov 25, 2016 at 3:07 AM, Parthasarathy Bhuvaragan <
parthasarathy.bhuvara...@ericsson.com> wrote:

> Hi John,
>
> Do you have several tipc clients subscribed to topology server?
> What subscription timeout do they use?
>
> Please enable kernel command line parameter:
> softlockup_all_cpu_backtrace=1
>
> /Partha
>
> On 11/23/2016 11:04 PM, John Thompson wrote:
>
>> Hi Partha,
>>
>> I tested overnight with the 2 patches you provided yesterday.
>> Testing is still showing problems, here is one of the soft lockups, the
>> other is the same as I sent the other day.
>> I am going to redo my build as I expected some change in behaviour with
>> your patches.
>>
>> It is possible that I am doing some dumping of nodes or links as I am
>> not certain of all the code or paths.
>> I have found that we do a tipc-config -nt and tipc-config -ls in some
>> situations but it shouldn 't be initiated in this
>> reboot case.
>>
>> <0>NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [pimd:1220]
>> <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [rpc.13:2419]
>> <6>Modules linked in:
>> <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [AIS
>> listener:1600]
>> <6> tipc
>> <6>Modules linked in:
>> <6> jitterentropy_rng
>> <6> tipc
>> <6> echainiv
>> <6> jitterentropy_rng
>> <6> drbg
>> <6> echainiv
>> <6> platform_driver(O)
>> <6> drbg
>> <6> platform_driver(O)
>> <6>
>> <6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P   O
>> <6>CPU: 2 PID: 1600 Comm: AIS listener Tainted: P   O
>> <6>task: aed76d20 ti: ae70c000 task.ti: ae70c000
>> <6>task: aee3ced0 ti: ae686000 task.ti: ae686000
>> <6>NIP: 8069257c LR: c13ebc4c CTR: 80692540
>> <6>NIP: 80692578 LR: c13ebf50 CTR: 80692540
>> <6>REGS: ae70dc20 TRAP: 0901   Tainted: P   O
>> <6>REGS: ae687ad0 TRAP: 0901   Tainted: P   O
>> <6>MSR: 00029002
>> <6>MSR: 00029002
>> <6><
>> <6><
>> <6>CE
>> <6>CE
>> <6>,EE
>> <6>,EE
>> <6>,ME
>> <6>,ME
>> <6>>
>> <6>>
>> <6>  CR: 42002484  XER: 2000
>> <6>  CR: 48002444  XER: 
>> <6>
>> <6>GPR00:
>> <6>
>> <6>GPR00:
>> <6>c13f3c34
>> <6>c13ea408
>> <6>ae70dcd0
>> <6>ae687b80
>> <6>aed76d20
>> <6>aee3ced0
>> <6>ae55c8ec
>> <6>ae55c8ec
>> <6>2711
>> <6>
>> <6>0005
>> <6>a30e7264
>> <6>8666592a
>> <6>ae5e070c
>> <6>8666592b
>> <6>fffd
>> <6>
>> <6>GPR08:
>> <6>
>> <6>GPR08:
>> <6>ae9dad20
>> <6>ae72fbc8
>> <6>0001
>> <6>0001
>> <6>0001
>> <6>0001
>> <6>
>> <6>0004
>> <6>80692540
>> <6>80692540
>> <6>
>> <6>
>> <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
>> <6>NIP [80692578] _raw_spin_lock_bh+0x38/0x70
>> <6>LR [c13ebc4c] tipc_nametbl_withdraw+0x4c/0x140 [tipc]
>> <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
>> <6>Call Trace:
>> <6>Call Trace:
>> <6>[ae70dcd0] [a85d99a0] 0xa85d99a0
>> <6>[ae687b80] [800fa258] check_object+0xc8/0x270
>> <6> (unreliable)
>> <6> (unreliable)
>> <6>
>> <6>
>> <6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>> <6>[ae687ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
>> <6>
>> <6>
>> <6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>> <6>[ae687bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
>> <6>
>> <6>
>> <6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0
>> <6>[ae687bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>> <6>
>> <6>
>> <6>[

Re: [tipc-discussion] v4.7: soft lockup when releasing a socket

2016-11-21 Thread John Thompson
Hi Partha,

My test has 4 nodes, 2 of which are alternately rebooting.  When the
rebooted node rejoins a few minutes pass and then the other node is
rebooted.
I am not printing out link stats and believe that the the other code is not
doing so either, when nodes leave or rejoin.

JT


On Tue, Nov 22, 2016 at 2:22 AM, Parthasarathy Bhuvaragan <
parthasarathy.bhuvara...@ericsson.com> wrote:

> Hi,
>
> There is an other branch where softlockup for nametbl_lock occurs.
>
> tipc_named_rcv() Grabs nametbl_lock
>   tipc_update_nametbl() (publish/withdraw)
> tipc_node_subscribe()/unsubscribe()
>   tipc_node_write_unlock()
>  << lockup occurs if it needs to process NODE UP/DOWN LINK
> UP/DOWN, as it grabs nametbl_lock again >>
>
> /Partha
>
>
> On 11/21/2016 01:04 PM, Parthasarathy Bhuvaragan wrote:
>
>> Hi,
>>
>> tipc_nametbl_withdraw() triggers the softlockup as it tries to grab
>> nametbl_lock twice if the node triggered a TIPC_NOTIFY_LINK_DOWN event
>> while its is running. The erroneous call chain is:
>>
>>   tipc_nametbl_withdraw() Grab nametbl_lock
>> tipc_named_process_backlog()
>>   tipc_update_nametbl()
>> if (dtype == WITHDRAWAL) tipc_node_unsubscribe()
>>   tipc_node_write_unlock()
>> if (flags & TIPC_NOTIFY_LINK_DOWN) tipc_nametbl_withdraw()
>>spin_lock_bh(>nametbl_lock);  << Soft Lockup >>
>>
>> Three callers which can cause this under module exit:
>>
>> Case1:
>>   tipc_exit_net()
>> tipc_nametbl_withdraw() Grab nametbl_lock
>>
>> Case2:
>>   tipc_server_stop()
>> tipc_conn_kref_release
>>   tipc_sock_release
>> sock_release()
>>   tipc_release()
>> tipc_sk_withdraw()
>>   tipc_nametbl_withdraw()
>>
>> Case3:
>>   tipc_server_stop()
>> tipc_conn_kref_release()
>>   kernel_bind()
>> tipc_bind()
>>   tipc_sk_withdraw()
>> tipc_nametbl_withdraw()
>>
>> I will work on a solution for this.
>>
>> What kind of test were you performing when this occurred (linkup/down)?
>> Do you read link statistics periodically in your tests?
>>
>> /Partha
>>
>> On 11/21/2016 05:30 AM, John Thompson wrote:
>>
>>> Hi Partha,
>>>
>>> I was doing some some more testing today and have still observed the
>>> problem (contrary to what I had emailed earlier).
>>>
>>> Here is the kernel dump.
>>>
>>> <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [Pluggable
>>> Serve:2221]
>>> <6>Modules linked in: tipc jitterentropy_rng echainiv drbg
>>> platform_driver(O)
>>> <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P   O
>>> <6>task: ae54ced0 ti: aec42000 task.ti: aec42000
>>> <6>NIP: 8069257c LR: c13ebf50 CTR: 80692540
>>> <6>REGS: aec43ad0 TRAP: 0901   Tainted: P   O
>>> <6>MSR: 00029002 <CE,EE,ME>  CR: 48002444  XER: 
>>> <6>
>>> <6>GPR00: c13ea408 aec43b80 ae54ced0 a624690c  a6271d84 a39a60cc
>>> fffd
>>> <6>GPR08: aeefbbc8 0001 0001 0004 80692540
>>> <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
>>> <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
>>> <6>Call Trace:
>>> <6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable)
>>> <6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
>>> <6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
>>> <6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>>> <6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
>>> <6>[aec43c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>>> <6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>>> <6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
>>> <6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc]
>>> <6>[aec43d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>>> <6>[aec43d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>>>
>> TIPC_CMD_SHOW_LINK_STATS or TIPC_NL_LINK_GET
>>
>>> <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
>>> <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
>>> <6>[aec43d90] [80105844] __fput+0x94/0x200
>>> <6>[aec4

Re: [tipc-discussion] v4.7: soft lockup when releasing a socket

2016-11-20 Thread John Thompson
Hi Partha,

In my testing over the weekend the patch performed well - I didn't see any
kernel dumps due to this issue.

Thanks for the quick response.
JT


On Fri, Nov 18, 2016 at 10:34 AM, John Thompson <thompa@gmail.com>
wrote:

> Hi,
>
> I will be able to have some test results by the start of next week on the
> first patch.
>
> Regards,
> JT
>
>
> On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue <ying@windriver.com> wrote:
>
>> On 11/17/2016 07:04 AM, John Thompson wrote:
>>
>>> Hi Partha / Ying,
>>>
>>> I will try out the patch and let you know how it goes.
>>> I also note about providing the other CPU core dumps - in one of my
>>> cases I
>>> didn't have them but in others I did but
>>> they were interleaved and so were difficult to interpret.
>>>
>>
>> Thanks, it's unnecessary for us to collect more logs as its soft lockup
>> scenario should be just what Partha described.
>>
>> Regards,
>> Ying
>>
>>
>>
>>> Thanks for getting a patch together so quickly.
>>>
>>> JT
>>>
>>> On Wed, Nov 16, 2016 at 10:23 PM, Parthasarathy Bhuvaragan <
>>> parthasarathy.bhuvara...@ericsson.com> wrote:
>>>
>>> Hi Ying / John,
>>>>
>>>> The soft lock is the call chain of tipc_nametbl_withdraw(), when it
>>>> performs the tipc_conn_kref_release() as it tries to grab nametbl_lock
>>>> again while holding it already.
>>>>
>>>>> tipc_nametbl_withdraw
>>>>>   spin_lock_bh(>nametbl_lock);
>>>>>   tipc_nametbl_remove_publ
>>>>>  spin_lock_bh(>lock);
>>>>>  tipc_nameseq_remove_publ
>>>>>tipc_subscrp_report_overlap
>>>>>  tipc_subscrp_send_event
>>>>> tipc_conn_sendmsg
>>>>>
>>>> << Here, the (test_bit(CF_CONNECTED, >flags)) Fails, leading to the
>>>> else case where do do a conn_put() and that triggers the cleanup as
>>>> refcount reached 0. Leading the call chain below : >>
>>>> tipc_conn_kref_release
>>>>tipc_sock_release
>>>>  tipc_conn_release
>>>> tipc_subscrb_delete
>>>>tipc_subscrp_delete
>>>>   tipc_nametbl_unsubscribe
>>>>  spin_lock_bh(>nametbl_lock);  << !! Soft Lockup >>
>>>>
>>>> One cause is that tipc_exit_net() calls first calls tipc_topsrv_stop()
>>>> and
>>>> then tipc_nametbl_withdraw() in scope of tipc_net_stop().
>>>>
>>>> The above chain will only occur in a narrow window for a given
>>>> connection:
>>>> CPU#1:
>>>> tipc_nametbl_withdraw() manages to perform tipc_conn_lookup() and steps
>>>> the refcount to 2, while in CPU#2 the following occurs:
>>>> CPU#2:
>>>> tipc_server_stop() calls tipc_close_conn(con). This performs a
>>>> conn_put()
>>>> decrementing refcount to 1.
>>>> Now, CPU#1 continues and detects that the connection is not CF_CONNECTED
>>>> and does a conn_put(), triggering the release callback.
>>>>
>>>> Before commit 333f796235a527, the above wont happen.
>>>>
>>>> /Partha
>>>>
>>>>
>>>> On 11/15/2016 04:11 PM, Xue, Ying wrote:
>>>>
>>>> Hi John,
>>>>>
>>>>> Regarding the stack trace you provided below, I get the two potential
>>>>> call chains:
>>>>>
>>>>> tipc_nametbl_withdraw
>>>>>   spin_lock_bh(>nametbl_lock);
>>>>>   tipc_nametbl_remove_publ
>>>>>  spin_lock_bh(>lock);
>>>>>  tipc_nameseq_remove_publ
>>>>>tipc_subscrp_report_overlap
>>>>>  tipc_subscrp_send_event
>>>>> tipc_conn_sendmsg
>>>>>spin_lock_bh(>outqueue_lock);
>>>>>list_add_tail(>list, >outqueue);
>>>>>
>>>>>
>>>>> tipc_topsrv_stop
>>>>>   tipc_server_stop
>>>>> tipc_close_conn
>>>>>   kernel_sock_shutdown
>>>>> tipc_subscrb_delete
>>>>>   spin_lock_bh(>lock);
>>>>>   tipc_nametbl_unsubscribe(sub);
>>>>>spin_lock_bh(>nametbl_lock);
>>>>

[tipc-discussion] v4.7: soft lockup when releasing a socket

2016-11-14 Thread John Thompson
Hi,

I am seeing an occasional kernel soft lockup.  I have TIPC v4.7 and the
kernel dump occurs
when the system is going down for a reboot.

The kernel dump is:

<0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [exfx:1474]
<6>Modules linked in: tipc jitterentropy_rng echainiv drbg
platform_driver(O) ipifwd(PO)
...
<6>
<6>GPR00: c15333e8 a4e0fb80 a4ee3600 a51748ac  ae475024
a537feec fffd
<6>GPR08: a2197408 0001 0001 0004 80691c00
<6>NIP [80691c40] _raw_spin_lock_bh+0x40/0x70
<6>LR [c1534f30] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
<6>Call Trace:
<6>[a4e0fba0] [c15333e8] tipc_named_reinit+0xf8/0x820 [tipc]
<6>[a4e0fbb0] [c15336a0] tipc_named_reinit+0x3b0/0x820 [tipc]
<6>[a4e0fbd0] [c1540bac] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
<6>[a4e0fc00] [c154164c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
<6>[a4e0fc30] [c1533c9c] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
<6>[a4e0fc70] [c153425c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
<6>[a4e0fca0] [c1534788] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
<6>[a4e0fcd0] [c1534c48] tipc_nametbl_withdraw+0x68/0x140 [tipc]
<6>[a4e0fd00] [c153cc24] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
<6>[a4e0fd30] [c153d838] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
<6>[a4e0fd70] [804f2870] sock_release+0x30/0xf0
<6>[a4e0fd80] [804f2944] sock_close+0x14/0x30
<6>[a4e0fd90] [80105844] __fput+0x94/0x200
<6>[a4e0fdb0] [8003dca4] task_work_run+0xd4/0x100
<6>[a4e0fdd0] [80023620] do_exit+0x280/0x980
<6>[a4e0fe10] [80024c48] do_group_exit+0x48/0xb0
<6>[a4e0fe30] [80030344] get_signal+0x244/0x4f0
<6>[a4e0fe80] [80007734] do_signal+0x34/0x1c0
<6>[a4e0ff30] [800079a8] do_notify_resume+0x68/0x80
<6>[a4e0ff40] [8000fa1c] do_user_signal+0x74/0xc4


>From the stack dump it looks like tipc_named_reinit is trying to
acquire nametbl_lock.

>From looking at the call chain I can see that tipc_conn_sendmsg can
send up calling conn_put

which will go on and call the tipc_named_reinit via tipc_sock_release.

As tipc_nametbl_withdraw (from the stack dump) has already acquired
the nametbl_lock, tipc_named_reinit

cannot get it and so the process hangs.


The call to tipc_sock_release (added in Commit 333f796235a527
)
seems to have changed the behaviour

such that it tries to do a lot more when shutting the connection down.


If there is other information I can provide please let me know.

Regards,

John
--
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [PATCH net v2 1/1] tipc: fix broadcast link synchronization problem

2016-10-27 Thread John Thompson
Hi Jon,

My overnight testing has shown no problems with the fix.  I think it is ok.


JT

On Thu, Oct 27, 2016 at 4:30 PM, John Thompson <thompa@gmail.com> wrote:

> Hi Jon,
>
> I have had some help testing this but unfortunately the testing has not
> been able to be done yet.  It is to be done tonight a long run overnight
> test to see if we can reproduce the problem.
>
> Sorry about the delay in getting back to you but the people assisting with
> the testing have been on other things.
>
> I will be in touch tomorrow.
> JT
>
>
> On Tue, Oct 25, 2016 at 2:15 PM, Jon Maloy <ma...@donjonn.com> wrote:
>
>> Hi John,
>>
>> Any news about this? I would like to post this patch if you can confirm
>> that it is working.
>>
>> ///jon
>>
>> On 10/12/2016 06:55 PM, John Thompson wrote:
>>
>> Hi,
>>
>> My initial testing has looked good with this patch.  I am going to be
>> performing more
>> in depth testing over the weekend and will be in touch next week about
>> how it has
>> held up.
>>
>> Thanks,
>> JT
>>
>>
>> On Sat, Oct 1, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com>
>> wrote:
>>
>>> In commit 2d18ac4ba745 ("tipc: extend broadcast link initialization
>>> criteria") we tried to fix a problem with the initial synchronization
>>> of broadcast link acknowledge values. Unfortunately that solution is
>>> not sufficient to solve the issue.
>>>
>>> We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
>>> non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
>>> initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
>>> broadcast acknowledge numbers, leading to premature opening of the
>>> broadcast link. When the bypassed packets finally arrive, they are
>>> inadvertently accepted, and the already correctly initialized
>>> acknowledge number in the broadcast receive link is overwritten by
>>> the invalid (zero) value of the said packets. After this the broadcast
>>> link goes stale.
>>>
>>> We now fix this by marking the packets where we know the acknowledge
>>> value is or may be invalid, and then ignoring the acks from those.
>>>
>>> To this purpose, we claim an unused bit in the header to indicate that
>>> the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
>>> synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
>>> packets, plus those LINK_PROTOCOL packets sent out before the broadcast
>>> links are fully synchronized.
>>>
>>> This minor protocol update is fully backwards compatible.
>>>
>>> Reported-by: John Thompson <thompa@gmail.com>
>>> Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
>>> ---
>>>  net/tipc/bcast.c  | 14 ++
>>>  net/tipc/bcast.h  |  3 ++-
>>>  net/tipc/link.c   |  2 ++
>>>  net/tipc/msg.h| 17 +
>>>  net/tipc/name_distr.c |  1 +
>>>  net/tipc/node.c   |  2 +-
>>>  6 files changed, 33 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
>>> index 753f774..aa1babb 100644
>>> --- a/net/tipc/bcast.c
>>> +++ b/net/tipc/bcast.c
>>> @@ -247,11 +247,17 @@ int tipc_bcast_rcv(struct net *net, struct
>>> tipc_link *l, struct sk_buff *skb)
>>>   *
>>>   * RCU is locked, no other locks set
>>>   */
>>> -void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l, u32 acked)
>>> +void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l,
>>> +   struct tipc_msg *hdr)
>>>  {
>>> struct sk_buff_head *inputq = _bc_base(net)->inputq;
>>> +   u16 acked = msg_bcast_ack(hdr);
>>> struct sk_buff_head xmitq;
>>>
>>> +   /* Ignore bc acks sent by peer before bcast synch point was
>>> received */
>>> +   if (msg_bc_ack_invalid(hdr))
>>> +   return;
>>> +
>>> __skb_queue_head_init();
>>>
>>> tipc_bcast_lock(net);
>>> @@ -279,11 +285,11 @@ int tipc_bcast_sync_rcv(struct net *net, struct
>>> tipc_link *l,
>>> __skb_queue_head_init();
>>>
>>> tipc_bcast_lock(net);
>>> -   if (msg_type(hdr) == STATE_MSG) {
>>> +   if (msg_type(hdr) != STATE_MSG) {
>>> +   tipc_link_bc_init_rcv(l, hdr

Re: [tipc-discussion] [PATCH net v2 1/1] tipc: fix broadcast link synchronization problem

2016-10-26 Thread John Thompson
Hi Jon,

I have had some help testing this but unfortunately the testing has not
been able to be done yet.  It is to be done tonight a long run overnight
test to see if we can reproduce the problem.

Sorry about the delay in getting back to you but the people assisting with
the testing have been on other things.

I will be in touch tomorrow.
JT


On Tue, Oct 25, 2016 at 2:15 PM, Jon Maloy <ma...@donjonn.com> wrote:

> Hi John,
>
> Any news about this? I would like to post this patch if you can confirm
> that it is working.
>
> ///jon
>
> On 10/12/2016 06:55 PM, John Thompson wrote:
>
> Hi,
>
> My initial testing has looked good with this patch.  I am going to be
> performing more
> in depth testing over the weekend and will be in touch next week about how
> it has
> held up.
>
> Thanks,
> JT
>
>
> On Sat, Oct 1, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com> wrote:
>
>> In commit 2d18ac4ba745 ("tipc: extend broadcast link initialization
>> criteria") we tried to fix a problem with the initial synchronization
>> of broadcast link acknowledge values. Unfortunately that solution is
>> not sufficient to solve the issue.
>>
>> We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
>> non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
>> initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
>> broadcast acknowledge numbers, leading to premature opening of the
>> broadcast link. When the bypassed packets finally arrive, they are
>> inadvertently accepted, and the already correctly initialized
>> acknowledge number in the broadcast receive link is overwritten by
>> the invalid (zero) value of the said packets. After this the broadcast
>> link goes stale.
>>
>> We now fix this by marking the packets where we know the acknowledge
>> value is or may be invalid, and then ignoring the acks from those.
>>
>> To this purpose, we claim an unused bit in the header to indicate that
>> the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
>> synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
>> packets, plus those LINK_PROTOCOL packets sent out before the broadcast
>> links are fully synchronized.
>>
>> This minor protocol update is fully backwards compatible.
>>
>> Reported-by: John Thompson <thompa@gmail.com>
>> Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
>> ---
>>  net/tipc/bcast.c  | 14 ++
>>  net/tipc/bcast.h  |  3 ++-
>>  net/tipc/link.c   |  2 ++
>>  net/tipc/msg.h| 17 +
>>  net/tipc/name_distr.c |  1 +
>>  net/tipc/node.c   |  2 +-
>>  6 files changed, 33 insertions(+), 6 deletions(-)
>>
>> diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
>> index 753f774..aa1babb 100644
>> --- a/net/tipc/bcast.c
>> +++ b/net/tipc/bcast.c
>> @@ -247,11 +247,17 @@ int tipc_bcast_rcv(struct net *net, struct
>> tipc_link *l, struct sk_buff *skb)
>>   *
>>   * RCU is locked, no other locks set
>>   */
>> -void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l, u32 acked)
>> +void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l,
>> +   struct tipc_msg *hdr)
>>  {
>> struct sk_buff_head *inputq = _bc_base(net)->inputq;
>> +   u16 acked = msg_bcast_ack(hdr);
>> struct sk_buff_head xmitq;
>>
>> +   /* Ignore bc acks sent by peer before bcast synch point was
>> received */
>> +   if (msg_bc_ack_invalid(hdr))
>> +   return;
>> +
>> __skb_queue_head_init();
>>
>> tipc_bcast_lock(net);
>> @@ -279,11 +285,11 @@ int tipc_bcast_sync_rcv(struct net *net, struct
>> tipc_link *l,
>> __skb_queue_head_init();
>>
>> tipc_bcast_lock(net);
>> -   if (msg_type(hdr) == STATE_MSG) {
>> +   if (msg_type(hdr) != STATE_MSG) {
>> +   tipc_link_bc_init_rcv(l, hdr);
>> +   } else if (!msg_bc_ack_invalid(hdr)) {
>> tipc_link_bc_ack_rcv(l, msg_bcast_ack(hdr), );
>> rc = tipc_link_bc_sync_rcv(l, hdr, );
>> -   } else {
>> -   tipc_link_bc_init_rcv(l, hdr);
>> }
>> tipc_bcast_unlock(net);
>>
>> diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h
>> index 5ffe344..855d53c 100644
>> --- a/net/tipc/bcast.h
>> +++ b/net/tipc/bcast.h
>> @@ -55,7 +55,8 @@ void tipc_bcast_dec_bearer_dst_cnt(struct net *net,
>> int bearer_id);
>>  i

Re: [tipc-discussion] [PATCH net 1/1] tipc: fix broadcast link synchronization problem

2016-09-07 Thread John THompson
Hi Jon,

I will give this patch a test and get back in touch.  I am on vacation for
a few days and so will be in touch next week.

Thanks,
JT

On Thu, Sep 8, 2016 at 12:28 AM, Jon Maloy <jon.ma...@ericsson.com> wrote:

> Hi John,
> I am unable to reproduce the scenario you are describing, but from your
> debug data it looks clear enough: a STATE packet with ack value !=0 has
> bypassed the first in-sequence packets, which still have an invalid
> pre-synch ack value.
>
> If that diagnosis is right, this patch should solve the problem. Please
> test this and give feedback.
>
> BR
> ///jon
>
>
> > -Original Message-
> > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > Sent: Wednesday, 07 September, 2016 08:18
> > To: tipc-discussion@lists.sourceforge.net; Parthasarathy Bhuvaragan
> > <parthasarathy.bhuvara...@ericsson.com>; Ying Xue
> > <ying@windriver.com>; Richard Alpe <richard.a...@ericsson.com>; Jon
> > Maloy <jon.ma...@ericsson.com>
> > Cc: ma...@donjonn.com; gbala...@gmail.com
> > Subject: [PATCH net 1/1] tipc: fix broadcast link synchronization problem
> >
> > In commit 2d18ac4ba745 ("tipc: extend broadcast link initialization
> > criteria") we tried to fix a problem with the initial synchronization
> > of broadcast link acknowledge values. Unfortunately that solution is
> > not sufficient to solve the problem.
> >
> > We have seen it happen that a LINK_PROTOCOL/STATE message with a valid
> > non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
> > initialization and NAME_DISRIBUTOR packets with invalid broadcast
> > acknowledge numbers, leading to premature opening of the broadcast link.
> > When this happens, the already correctly initialized acknowledge number
> > of the broadcast receive link will be overwritten by the invalid (zero)
> > value of the said packets, and the broadcast link risks going stale.
> >
> > We now fix this by ignoring acknowledges that we know potentially may
> > be invalid. Identifying such acknowledges is easy, since their broadcast
> > acknowledge value always is zero, and since this can only happen to
> > BCAST_PROTOCOL and NAME_DISTRIBUTOR messages.
> >
> > Reported-by: John Thompson <thompa@gmail.com>
> > Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
> > ---
> >  net/tipc/bcast.c | 7 ++-
> >  net/tipc/bcast.h | 3 ++-
> >  net/tipc/node.c  | 2 +-
> >  3 files changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
> > index ae469b3..27ffa37 100644
> > --- a/net/tipc/bcast.c
> > +++ b/net/tipc/bcast.c
> > @@ -247,11 +247,16 @@ int tipc_bcast_rcv(struct net *net, struct
> tipc_link *l,
> > struct sk_buff *skb)
> >   *
> >   * RCU is locked, no other locks set
> >   */
> > -void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l, u32 acked)
> > +void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l,
> > + int usr, u16 acked)
> >  {
> >   struct sk_buff_head *inputq = _bc_base(net)->inputq;
> >   struct sk_buff_head xmitq;
> >
> > + /* Ignore acks potentially sent before bcast synch point was
> received */
> > + if (!acked && ((usr == BCAST_PROTOCOL) || (usr ==
> > NAME_DISTRIBUTOR)))
> > + return;
> > +
> >   __skb_queue_head_init();
> >
> >   tipc_bcast_lock(net);
> > diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h
> > index d5e79b3..0529cb5 100644
> > --- a/net/tipc/bcast.h
> > +++ b/net/tipc/bcast.h
> > @@ -55,7 +55,8 @@ void tipc_bcast_dec_bearer_dst_cnt(struct net *net,
> int
> > bearer_id);
> >  int  tipc_bcast_get_mtu(struct net *net);
> >  int tipc_bcast_xmit(struct net *net, struct sk_buff_head *list);
> >  int tipc_bcast_rcv(struct net *net, struct tipc_link *l, struct sk_buff
> *skb);
> > -void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l, u32
> acked);
> > +void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l,
> > + int usr, u16 acked);
> >  void tipc_bcast_sync_rcv(struct net *net, struct tipc_link *l,
> >struct tipc_msg *hdr);
> >  int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg);
> > diff --git a/net/tipc/node.c b/net/tipc/node.c
> > index 2197419..42f98af 100644
> > --- a/net/tipc/node.c
> > +++ b/net/tipc/node.c
> > @@ -1507,7 +1507,7 @@ void tipc_rcv(struct net *net, struct sk_buff
> *skb, struct
> > tipc_bearer *b)
> >  

Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-09-06 Thread John THompson
Hi Jon,

My comments are below.

JT

On Wed, Sep 7, 2016 at 12:48 AM, Jon Maloy <jon.ma...@ericsson.com> wrote:

> Hi John,
>
> See below.
>
>
>
> ///jon
>
>
>
>
>
> *From:* John THompson [mailto:thompa@gmail.com]
> *Sent:* Tuesday, 06 September, 2016 00:14
> *To:* Jon Maloy <ma...@donjonn.com>
> *Cc:* Jon Maloy <jon.ma...@ericsson.com>; tipc-discussion@lists.
> sourceforge.net
> *Subject:* Re: [tipc-discussion] BC rcv link acked stuck after receiving
> a named with a BC ACK of 0
>
>
>
> Hi Jon,
>
>
>
> The packet I see the error happening on is when receiving a usr 11
>
> (NAME_DISTRIBUTOR) over the unicast link.
>
> The reception of this packet is happening interleaved with processing
>
> a packet (or packets) on the BC link that has brought the peer up.
>
> The BC link packet processing has the tipc_bcast_lock and the unicast
>
> pkt processing cannot get the bcast lock for a while.
>
> When it can get the lock it processes the BC ack == 0 from the
> NAME_DISTRIBUTOR
>
> packet and sets the acked field on the BC link to 0.
>
>
>
> The debug / call trace below is me trying to show from the debug I
> captured what happens.
>
> If I add debug for each pkt the problem doesn't reproduce.
>
>
>
>
>
> tipc_rcv 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt bc ack rcv 0 uc seq 3 ack 0 user
> 11 type 0
>
>   + calls tipc_bcast_ack_rcv
>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
> + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as
> link not up 1 or peer not up 1
>
>
>
> What kind of packet was this?
>
This is a LINK_PROTOCOL STATE msg received on the unicast link
Dbg of the packet:
broadcast-link-5-18 bc ack 0 orig_l 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt (hdr usr
7 type 0 bc ack 0 seq 32771 ack 0)- can't ack as link not up 0 or peer not
up 1



>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
> + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack
> as link not up 0 or peer not up 1
>
>
>
> And this?
>
This is a LINK_PROTOCOL STATE msg received on the unicast link, with ack ==
1, therefore this packet will be the first pkt that sets bc_peer_is_up to
true.
Dbg of the packet:
broadcast-link-5-18 bc ack 43402 orig_l 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt (hdr
usr 7 type 0 bc ack 43402 seq 32771 ack 1)- can't ack as link not up 0 or
peer not up 1


>
>
> ===
>
> Somewhere at this point bc_peer_is_up gets set
>
> ===
>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
> + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
> than it was previously (53574)
>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
> + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
> than it was previously (53574)
>
>
>
>   + from tipc_rcv on unicast link
>
> + tipc_bcast_ack_rcv Going to set BC ACK outside window, new 0 old
> 53574 win 200
>
>   - dump_stack
>
> CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: P   O4.4.6-at1 #3
>
> Call Trace:
>
> [a3093a80] [806943b0] dump_stack+0x84/0xb0 (unreliable)
>
> [a3093a90] [c1507314] tipc_link_bc_ack_rcv+0x244/0x250 [tipc]
>
> [a3093ab0] [c1501b04] tipc_bcast_ack_rcv+0x74/0xd0 [tipc]
>
> [a3093ae0] [c1511a08] tipc_rcv+0x468/0xa30 [tipc]
>
> [a3093b80] [c150218c] tipc_bcast_stop+0xfc/0x7b0 [tipc]
>
> [a3093b90] [8050d6a8] __netif_receive_skb_core+0x468/0xa10
>
> [a3093c30] [80510b6c] netif_receive_skb_internal+0x3c/0xe0
>
> [a3093c60] [8064b2b8] br_handle_frame_finish+0x1d8/0x4d0
>
> [a3093cd0] [8064b7a0] br_handle_frame+0x1f0/0x330
>
> [a3093d20] [8050d738] __netif_receive_skb_core+0x4f8/0xa10
>
> [a3093dc0] [805119f0] process_backlog+0x90/0x140
>
> [a3093df0] [8051103c] net_rx_action+0x15c/0x320
>
> [a3093e50] [8002594c] __do_softirq+0x13c/0x250
>
> [a3093eb0] [80025ab0] run_ksoftirqd+0x50/0x80
>
> [a3093ec0] [800434c4] smpboot_thread_fn+0x1e4/0x1f0
>
> [a3093ef0] [8003fb38] kthread+0xc8/0xe0
>
> [a3093f40] [8000eed8] ret_from_kernel_thread+0x5c/0x64
>
>
>
> I am going to send in a patch that adds checking for a valid BC ack (being
> within the window size) to
>
> tipc_link_bc_ack_rcv.
>
>
>
> Not sure that is a good idea. Even if #0 happens to be within a valid
> range it is still invalid, and may lead to an inadvertent release of
> packets which are not ready to be released yet.  I’ll try to take a
> closer look at this today.
>
I thought that #0 was a valid value when the seqno wraps around?  There
doesn't appear to be any special handling for the wrap around case.  I
agree with your statement that it might lead to inadvertent release 

Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-09-05 Thread John THompson
Hi Jon,

The packet I see the error happening on is when receiving a usr 11
(NAME_DISTRIBUTOR) over the unicast link.
The reception of this packet is happening interleaved with processing
a packet (or packets) on the BC link that has brought the peer up.
The BC link packet processing has the tipc_bcast_lock and the unicast
pkt processing cannot get the bcast lock for a while.
When it can get the lock it processes the BC ack == 0 from the
NAME_DISTRIBUTOR
packet and sets the acked field on the BC link to 0.

The debug / call trace below is me trying to show from the debug I captured
what happens.
If I add debug for each pkt the problem doesn't reproduce.


tipc_rcv 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt bc ack rcv 0 uc seq 3 ack 0 user 11
type 0
  + calls tipc_bcast_ack_rcv
tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as
link not up 1 or peer not up 1
tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as
link not up 0 or peer not up 1
===
Somewhere at this point bc_peer_is_up gets set
===
tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
than it was previously (53574)
tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
than it was previously (53574)

  + from tipc_rcv on unicast link
+ tipc_bcast_ack_rcv Going to set BC ACK outside window, new 0 old
53574 win 200
  - dump_stack
CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: P   O4.4.6-at1 #3
Call Trace:
[a3093a80] [806943b0] dump_stack+0x84/0xb0 (unreliable)
[a3093a90] [c1507314] tipc_link_bc_ack_rcv+0x244/0x250 [tipc]
[a3093ab0] [c1501b04] tipc_bcast_ack_rcv+0x74/0xd0 [tipc]
[a3093ae0] [c1511a08] tipc_rcv+0x468/0xa30 [tipc]
[a3093b80] [c150218c] tipc_bcast_stop+0xfc/0x7b0 [tipc]
[a3093b90] [8050d6a8] __netif_receive_skb_core+0x468/0xa10
[a3093c30] [80510b6c] netif_receive_skb_internal+0x3c/0xe0
[a3093c60] [8064b2b8] br_handle_frame_finish+0x1d8/0x4d0
[a3093cd0] [8064b7a0] br_handle_frame+0x1f0/0x330
[a3093d20] [8050d738] __netif_receive_skb_core+0x4f8/0xa10
[a3093dc0] [805119f0] process_backlog+0x90/0x140
[a3093df0] [8051103c] net_rx_action+0x15c/0x320
[a3093e50] [8002594c] __do_softirq+0x13c/0x250
[a3093eb0] [80025ab0] run_ksoftirqd+0x50/0x80
[a3093ec0] [800434c4] smpboot_thread_fn+0x1e4/0x1f0
[a3093ef0] [8003fb38] kthread+0xc8/0xe0
[a3093f40] [8000eed8] ret_from_kernel_thread+0x5c/0x64

I am going to send in a patch that adds checking for a valid BC ack (being
within the window size) to
tipc_link_bc_ack_rcv.

Cheers,
JT

On Wed, Aug 31, 2016 at 10:57 PM, John THompson <thompa@gmail.com>
wrote:

> Hi Jon,
>
> I have verified that the patch is included in my build.
> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
> criteria”)
>
> I am trying to verify which packets are received when the problem occurs
> but I am having trouble getting the information out of my system at the
> moment.
>
> I will keep trying.
> Thanks,
> JT
>
>
> On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy <ma...@donjonn.com> wrote:
>
>>
>>
>> On 08/29/2016 06:48 PM, Jon Maloy wrote:
>>
>>> Hi John,
>>> Sorry for my late answer; I was on vacation for a few days.
>>> It seems I gave you the wrong commit reference in my previous mail. The
>>> one I really meant was
>>> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
>>> criteria”)
>>>
>>> This one explains why the first packets sometimes get an invalid ack
>>> number, but also remedies it, and I simply cannot see how an invalid ack #0
>>> can ever be accepted when this patch is applied.
>>> I see no reason why this patch shouldn’t also be present in you code,
>>> but just to make sure, can you confirm this?
>>>
>>> I am right now wondering if a retransmission is the problem:
>>> 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to
>>> true.
>>>
>> Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true,
>> the likely sequence is rather
>> 1: We receive a STATE message with unicast ack #1. This message should
>> also contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up
>> is set to true.
>> 2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the
>> invalid unicast ack #0. This one is now accepted.
>>
>> I believe this may happen, because STATE messages, contrary to data
>> packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data
>> messages, but I cannot see it happening as often and consistently as you
>> seem to be observing it. Another possibility is that bc_ack in the received
>>

Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-08-31 Thread John THompson
Hi Jon,

I have verified that the patch is included in my build.
2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
criteria”)

I am trying to verify which packets are received when the problem occurs
but I am having trouble getting the information out of my system at the
moment.

I will keep trying.
Thanks,
JT


On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy <ma...@donjonn.com> wrote:

>
>
> On 08/29/2016 06:48 PM, Jon Maloy wrote:
>
>> Hi John,
>> Sorry for my late answer; I was on vacation for a few days.
>> It seems I gave you the wrong commit reference in my previous mail. The
>> one I really meant was
>> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
>> criteria”)
>>
>> This one explains why the first packets sometimes get an invalid ack
>> number, but also remedies it, and I simply cannot see how an invalid ack #0
>> can ever be accepted when this patch is applied.
>> I see no reason why this patch shouldn’t also be present in you code, but
>> just to make sure, can you confirm this?
>>
>> I am right now wondering if a retransmission is the problem:
>> 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to
>> true.
>>
> Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true,
> the likely sequence is rather
> 1: We receive a STATE message with unicast ack #1. This message should
> also contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up
> is set to true.
> 2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the
> invalid unicast ack #0. This one is now accepted.
>
> I believe this may happen, because STATE messages, contrary to data
> packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data
> messages, but I cannot see it happening as often and consistently as you
> seem to be observing it. Another possibility is that bc_ack in the received
> STATE message also is an invalid zero, although I cannot see how this can
> happen either.
>
> Regards
> ///jon
>
> 2: we receive pkt #1 retransmitted with ack #0. This now gets accepted,
>> and we are in trouble.
>>
>> I’ll try to figure out a solution to this, but it may be possible for you
>> to verify this first.
>>
>> BR
>> ///jon
>>
>>
>>
>> From: John THompson [mailto:thompa@gmail.com]
>> Sent: Wednesday, 24 August, 2016 16:22
>> To: Jon Maloy <jon.ma...@ericsson.com>
>> Cc: tipc-discussion@lists.sourceforge.net
>> Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a
>> named with a BC ACK of 0
>>
>> Hi Jon,
>>
>> To clarify my previous email regarding the behaviour observed,
>>
>> What happens over time:
>> + remove bc peer
>> ...
>> some time until peer rejoins
>> ...
>> + add bc peer
>> + tipc_link_bc_ack_rcv
>>link is up = false, node is up = false
>>(this gets called a number of times until both the link and node are
>> up)
>>
>> + tipc_link_bc_ack_rcv
>>l->acked set to valid ack
>> ...
>> + tipc_rcv - usr 5 or 11, bc_ack = 0
>>+ tipc_bcast_ack_rcv
>>  + tipc_link_bc_ack_rcv
>>sets l->acked to 0
>>
>> Regards,
>> JT
>>
>>
>> On Thu, Aug 25, 2016 at 8:06 AM, John THompson <thompa@gmail.com
>> <mailto:thompa@gmail.com>> wrote:
>> Hi Jon,
>>
>> It is a similar problem in terms of what happens to the bc link.  I do
>> have that patch applied.
>>
>> I have added debug through the remove bc peer and various other functions
>> and the setting of the acked field to 0 is occurring when processing a
>> packet from named (msg user 11) or BCAST protocol (msg user 5).
>>
>> Thanks,
>> JT
>>
>> On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com
>> <mailto:jon.ma...@ericsson.com>> wrote:
>> Hi John,
>> This sounds a lot like the problem I tried to fix in
>> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when
>> peer is lost")
>> So, either that patch is not present in your kernel (if it is 4.7 it is
>> supposed to be) or my solution somehow hasn't solved the problem.
>> Can you confirm that the patch is there?
>>
>> BR
>> ///jon
>>
>> -Original Message-
>>> From: John THompson [mailto:thompa@gmail.com>> thompa@gmail.com>]
>>> Sent: Tuesday, 23 August, 2016 20:21
>>> To: tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion
>>> @lists.sourceforge.net&g

Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-08-24 Thread John THompson
Hi Jon,

To clarify my previous email regarding the behaviour observed,

What happens over time:
+ remove bc peer
...
some time until peer rejoins
...
+ add bc peer
+ tipc_link_bc_ack_rcv
  link is up = false, node is up = false
  (this gets called a number of times until both the link and node are up)

+ tipc_link_bc_ack_rcv
  l->acked set to valid ack
...
+ tipc_rcv - usr 5 or 11, bc_ack = 0
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv
  sets l->acked to 0

Regards,
JT


On Thu, Aug 25, 2016 at 8:06 AM, John THompson <thompa@gmail.com> wrote:

> Hi Jon,
>
> It is a similar problem in terms of what happens to the bc link.  I do
> have that patch applied.
>
> I have added debug through the remove bc peer and various other functions
> and the setting of the acked field to 0 is occurring when processing a
> packet from named (msg user 11) or BCAST protocol (msg user 5).
>
> Thanks,
> JT
>
> On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com>
> wrote:
>
>> Hi John,
>> This sounds a lot like the problem I tried to fix in
>> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when
>> peer is lost")
>> So, either that patch is not present in your kernel (if it is 4.7 it is
>> supposed to be) or my solution somehow hasn't solved the problem.
>> Can you confirm that the patch is there?
>>
>> BR
>> ///jon
>>
>> > -Original Message-
>> > From: John THompson [mailto:thompa@gmail.com]
>> > Sent: Tuesday, 23 August, 2016 20:21
>> > To: tipc-discussion@lists.sourceforge.net
>> > Subject: [tipc-discussion] BC rcv link acked stuck after receiving a
>> named with a BC
>> > ACK of 0
>> >
>> > Hi,
>> >
>> > I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
>> > and Marvell Armada-XP processors.  There are 10 nodes in all.
>> > When 2 of the nodes are removed, then rejoin the cluster we sometimes
>> see
>> > behaviour where the TIPC BC link gets stuck and eventually the backlog
>> gets
>> > full.  the 2 nodes that are joining have already connected together.
>> >
>> > The problem occurs when the BC link sndnxt value is greater than 32k on
>> one
>> > of the nodes (call it NODE1) and 2 nodes begin to join.
>> > When NODE1 detects the joining nodes, at some early point after they
>> have
>> > joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
>> > immediately sets its BC acked to 0 and tries to ack packets off the
>> > transmq.  No packets get removed as the new ack value doesn't match any
>> of
>> > the packets that need to be acked.
>> >
>> > The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures
>> that
>> > the new acked value is more than the old acked value.  When the values
>> are
>> > greater than 32k apart this means that 0 can indeed be greater than
>> > 40,000.  So when new packets are processed the new BC ack value is
>> > considered less than the stored one (0).
>> >
>> > This results in the BC transmq getting full and the backlog getting
>> full,
>> > thereby preventing communication over the BC link between nodes.
>> >
>> > I am persisting in trying to work out why the NAMED publication has a BC
>> > ack of 0, which I think is the root cause of the problem.
>> >
>> > I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
>> > invalid BC ack value cannot be set.  I am defining invalid as being an
>> > acked value that is greater than the current BC acked value + the link
>> > window.
>> >
>> > Thanks,
>> > John
>> > 
>> --
>> > ___
>> > tipc-discussion mailing list
>> > tipc-discussion@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>
>
--
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-08-24 Thread John THompson
Hi Jon,

It is a similar problem in terms of what happens to the bc link.  I do have
that patch applied.

I have added debug through the remove bc peer and various other functions
and the setting of the acked field to 0 is occurring when processing a
packet from named (msg user 11) or BCAST protocol (msg user 5).

Thanks,
JT

On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com> wrote:

> Hi John,
> This sounds a lot like the problem I tried to fix in
> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when
> peer is lost")
> So, either that patch is not present in your kernel (if it is 4.7 it is
> supposed to be) or my solution somehow hasn't solved the problem.
> Can you confirm that the patch is there?
>
> BR
> ///jon
>
> > -Original Message-
> > From: John THompson [mailto:thompa@gmail.com]
> > Sent: Tuesday, 23 August, 2016 20:21
> > To: tipc-discussion@lists.sourceforge.net
> > Subject: [tipc-discussion] BC rcv link acked stuck after receiving a
> named with a BC
> > ACK of 0
> >
> > Hi,
> >
> > I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
> > and Marvell Armada-XP processors.  There are 10 nodes in all.
> > When 2 of the nodes are removed, then rejoin the cluster we sometimes see
> > behaviour where the TIPC BC link gets stuck and eventually the backlog
> gets
> > full.  the 2 nodes that are joining have already connected together.
> >
> > The problem occurs when the BC link sndnxt value is greater than 32k on
> one
> > of the nodes (call it NODE1) and 2 nodes begin to join.
> > When NODE1 detects the joining nodes, at some early point after they have
> > joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
> > immediately sets its BC acked to 0 and tries to ack packets off the
> > transmq.  No packets get removed as the new ack value doesn't match any
> of
> > the packets that need to be acked.
> >
> > The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures
> that
> > the new acked value is more than the old acked value.  When the values
> are
> > greater than 32k apart this means that 0 can indeed be greater than
> > 40,000.  So when new packets are processed the new BC ack value is
> > considered less than the stored one (0).
> >
> > This results in the BC transmq getting full and the backlog getting full,
> > thereby preventing communication over the BC link between nodes.
> >
> > I am persisting in trying to work out why the NAMED publication has a BC
> > ack of 0, which I think is the root cause of the problem.
> >
> > I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
> > invalid BC ack value cannot be set.  I am defining invalid as being an
> > acked value that is greater than the current BC acked value + the link
> > window.
> >
> > Thanks,
> > John
> > 
> --
> > ___
> > tipc-discussion mailing list
> > tipc-discussion@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
--
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


[tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-08-23 Thread John THompson
Hi,

I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
and Marvell Armada-XP processors.  There are 10 nodes in all.
When 2 of the nodes are removed, then rejoin the cluster we sometimes see
behaviour where the TIPC BC link gets stuck and eventually the backlog gets
full.  the 2 nodes that are joining have already connected together.

The problem occurs when the BC link sndnxt value is greater than 32k on one
of the nodes (call it NODE1) and 2 nodes begin to join.
When NODE1 detects the joining nodes, at some early point after they have
joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
immediately sets its BC acked to 0 and tries to ack packets off the
transmq.  No packets get removed as the new ack value doesn't match any of
the packets that need to be acked.

The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures that
the new acked value is more than the old acked value.  When the values are
greater than 32k apart this means that 0 can indeed be greater than
40,000.  So when new packets are processed the new BC ack value is
considered less than the stored one (0).

This results in the BC transmq getting full and the backlog getting full,
thereby preventing communication over the BC link between nodes.

I am persisting in trying to work out why the NAMED publication has a BC
ack of 0, which I think is the root cause of the problem.

I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
invalid BC ack value cannot be set.  I am defining invalid as being an
acked value that is greater than the current BC acked value + the link
window.

Thanks,
John
--
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion