Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-12-29 Thread John Baldwin
On Sunday, December 25, 2011 11:01:33 am Коньков Евгений wrote:
 Здравствуйте, John.
 
 Вы писали 20 декабря 2011 г., 16:52:44:
 
 JB On Saturday, December 17, 2011 6:21:27 pm Pawel Jakub Dawidek wrote:
  On Mon, Dec 12, 2011 at 11:00:23AM -0500, John Baldwin wrote:
   An update.  I've sent Pawel a testing patch to see if my hypothesis is 
   correct
   (www.freebsd.org/~jhb/patches/tcp_negwin_test.patch).  If it is then I 
   intend
   to commit www.freebsd.org/~jhb/patches/tcp_negwin2.patch as the fix.
  
  Unfortunately it paniced today. Take a look at:
  
http://people.freebsd.org/~pjd/misc/tcp_panic.jpg
 
 JB Ok, the one use case I was worried about is happening regularly before 
 your
 JB panic, so that is good.  Can you use gdb to figure out which call to
 JB tcp_output() is actually panic'ing?  I wonder if it is this case:
 
 JB /*
 JB  * Return any desired output.
 JB  */
 JB if (needoutput || (tp-t_flags  TF_ACKNOW)) {
 JB (void) tcp_output(tp);
 JB /* XXX: Debug */
 JB KASSERT(SEQ_GEQ(tp-rcv_adv, tp-rcv_nxt),
 JB (tcp_input: negative window after ACK));
 
 JB And if 'needoutput' is true, but TF_ACKNOW is not set, and tcp_output() 
 decides
 JB to not do anything.  I've updated tcp_negwin_test.patch to not panic if 
 that call
 JB to tcp_output() doesn't actually send a packet.  Please re-test.
 
 
 # uname -a
 FreeBSD meta-up 9.0-PRERELEASE FreeBSD 9.0-PRERELEASE #4: Sat Dec 24 13:59:20 
 EET 2011 @:/usr/obj/usr/src/sys/KES_KERN_v10  i386
 
 rebooting once per day. Now I compile kernel with debug options.
 Can you advice me which and where I find debug info when it will
 reboting next time? so I can help to debug problem

Are you using the patch at the URL above (tcp_negwin_test.patch)?  If not,
can you try applying that patch and seeing if you still get any panics?

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-12-29 Thread Pawel Jakub Dawidek
On Thu, Dec 29, 2011 at 11:12:59AM -0500, John Baldwin wrote:
 On Sunday, December 25, 2011 11:01:33 am Коньков Евгений wrote:
  Здравствуйте, John.
  
  Вы писали 20 декабря 2011 г., 16:52:44:
  
  JB On Saturday, December 17, 2011 6:21:27 pm Pawel Jakub Dawidek wrote:
   On Mon, Dec 12, 2011 at 11:00:23AM -0500, John Baldwin wrote:
An update.  I've sent Pawel a testing patch to see if my hypothesis is 
correct
(www.freebsd.org/~jhb/patches/tcp_negwin_test.patch).  If it is then I 
intend
to commit www.freebsd.org/~jhb/patches/tcp_negwin2.patch as the fix.
   
   Unfortunately it paniced today. Take a look at:
   
 http://people.freebsd.org/~pjd/misc/tcp_panic.jpg
  
  JB Ok, the one use case I was worried about is happening regularly before 
  your
  JB panic, so that is good.  Can you use gdb to figure out which call to
  JB tcp_output() is actually panic'ing?  I wonder if it is this case:
  
  JB /*
  JB  * Return any desired output.
  JB  */
  JB if (needoutput || (tp-t_flags  TF_ACKNOW)) {
  JB (void) tcp_output(tp);
  JB /* XXX: Debug */
  JB KASSERT(SEQ_GEQ(tp-rcv_adv, tp-rcv_nxt),
  JB (tcp_input: negative window after ACK));
  
  JB And if 'needoutput' is true, but TF_ACKNOW is not set, and tcp_output() 
  decides
  JB to not do anything.  I've updated tcp_negwin_test.patch to not panic if 
  that call
  JB to tcp_output() doesn't actually send a packet.  Please re-test.
  
  
  # uname -a
  FreeBSD meta-up 9.0-PRERELEASE FreeBSD 9.0-PRERELEASE #4: Sat Dec 24 
  13:59:20 EET 2011 @:/usr/obj/usr/src/sys/KES_KERN_v10  i386
  
  rebooting once per day. Now I compile kernel with debug options.
  Can you advice me which and where I find debug info when it will
  reboting next time? so I can help to debug problem
 
 Are you using the patch at the URL above (tcp_negwin_test.patch)?  If not,
 can you try applying that patch and seeing if you still get any panics?

I applied 1.5 days ago, so far now panics and no other messages.
I modified the patch a bit to not panic, but print a message when panic
was suppose to happen. This box is too valuable for me to panic it too
often. Because there were no debug messages I understand that the
scenerio didn't happen yet and not that the problem is fixed, right?

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgp9eZdmwuxI5.pgp
Description: PGP signature


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-12-29 Thread John Baldwin
On Thursday, December 29, 2011 3:25:02 pm Pawel Jakub Dawidek wrote:
 On Thu, Dec 29, 2011 at 11:12:59AM -0500, John Baldwin wrote:
  On Sunday, December 25, 2011 11:01:33 am Коньков Евгений wrote:
   Здравствуйте, John.
   
   Вы писали 20 декабря 2011 г., 16:52:44:
   
   JB On Saturday, December 17, 2011 6:21:27 pm Pawel Jakub Dawidek wrote:
On Mon, Dec 12, 2011 at 11:00:23AM -0500, John Baldwin wrote:
 An update.  I've sent Pawel a testing patch to see if my hypothesis 
 is correct
 (www.freebsd.org/~jhb/patches/tcp_negwin_test.patch).  If it is then 
 I intend
 to commit www.freebsd.org/~jhb/patches/tcp_negwin2.patch as the fix.

Unfortunately it paniced today. Take a look at:

  http://people.freebsd.org/~pjd/misc/tcp_panic.jpg
   
   JB Ok, the one use case I was worried about is happening regularly 
   before your
   JB panic, so that is good.  Can you use gdb to figure out which call to
   JB tcp_output() is actually panic'ing?  I wonder if it is this case:
   
   JB /*
   JB  * Return any desired output.
   JB  */
   JB if (needoutput || (tp-t_flags  TF_ACKNOW)) {
   JB (void) tcp_output(tp);
   JB /* XXX: Debug */
   JB KASSERT(SEQ_GEQ(tp-rcv_adv, tp-rcv_nxt),
   JB (tcp_input: negative window after ACK));
   
   JB And if 'needoutput' is true, but TF_ACKNOW is not set, and 
   tcp_output() decides
   JB to not do anything.  I've updated tcp_negwin_test.patch to not panic 
   if that call
   JB to tcp_output() doesn't actually send a packet.  Please re-test.
   
   
   # uname -a
   FreeBSD meta-up 9.0-PRERELEASE FreeBSD 9.0-PRERELEASE #4: Sat Dec 24 
   13:59:20 EET 2011 @:/usr/obj/usr/src/sys/KES_KERN_v10  i386
   
   rebooting once per day. Now I compile kernel with debug options.
   Can you advice me which and where I find debug info when it will
   reboting next time? so I can help to debug problem
  
  Are you using the patch at the URL above (tcp_negwin_test.patch)?  If not,
  can you try applying that patch and seeing if you still get any panics?
 
 I applied 1.5 days ago, so far now panics and no other messages.
 I modified the patch a bit to not panic, but print a message when panic
 was suppose to happen. This box is too valuable for me to panic it too
 often. Because there were no debug messages I understand that the
 scenerio didn't happen yet and not that the problem is fixed, right?

Yes.  It would be best to see the messages logged to be safe.  Thanks.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re[2]: 9.0-RC1 panic in tcp_input: negative winow.

2011-12-25 Thread Коньков Евгений
Здравствуйте, John.

Вы писали 20 декабря 2011 г., 16:52:44:

JB On Saturday, December 17, 2011 6:21:27 pm Pawel Jakub Dawidek wrote:
 On Mon, Dec 12, 2011 at 11:00:23AM -0500, John Baldwin wrote:
  An update.  I've sent Pawel a testing patch to see if my hypothesis is 
  correct
  (www.freebsd.org/~jhb/patches/tcp_negwin_test.patch).  If it is then I 
  intend
  to commit www.freebsd.org/~jhb/patches/tcp_negwin2.patch as the fix.
 
 Unfortunately it paniced today. Take a look at:
 
   http://people.freebsd.org/~pjd/misc/tcp_panic.jpg

JB Ok, the one use case I was worried about is happening regularly before your
JB panic, so that is good.  Can you use gdb to figure out which call to
JB tcp_output() is actually panic'ing?  I wonder if it is this case:

JB /*
JB  * Return any desired output.
JB  */
JB if (needoutput || (tp-t_flags  TF_ACKNOW)) {
JB (void) tcp_output(tp);
JB /* XXX: Debug */
JB KASSERT(SEQ_GEQ(tp-rcv_adv, tp-rcv_nxt),
JB (tcp_input: negative window after ACK));

JB And if 'needoutput' is true, but TF_ACKNOW is not set, and tcp_output() 
decides
JB to not do anything.  I've updated tcp_negwin_test.patch to not panic if 
that call
JB to tcp_output() doesn't actually send a packet.  Please re-test.


# uname -a
FreeBSD meta-up 9.0-PRERELEASE FreeBSD 9.0-PRERELEASE #4: Sat Dec 24 13:59:20 
EET 2011 @:/usr/obj/usr/src/sys/KES_KERN_v10  i386

rebooting once per day. Now I compile kernel with debug options.
Can you advice me which and where I find debug info when it will
reboting next time? so I can help to debug problem

-- 
С уважением,
 Коньков  mailto:kes-...@yandex.ru

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-12-20 Thread John Baldwin
On Saturday, December 17, 2011 6:21:27 pm Pawel Jakub Dawidek wrote:
 On Mon, Dec 12, 2011 at 11:00:23AM -0500, John Baldwin wrote:
  An update.  I've sent Pawel a testing patch to see if my hypothesis is 
  correct
  (www.freebsd.org/~jhb/patches/tcp_negwin_test.patch).  If it is then I 
  intend
  to commit www.freebsd.org/~jhb/patches/tcp_negwin2.patch as the fix.
 
 Unfortunately it paniced today. Take a look at:
 
   http://people.freebsd.org/~pjd/misc/tcp_panic.jpg

Ok, the one use case I was worried about is happening regularly before your
panic, so that is good.  Can you use gdb to figure out which call to
tcp_output() is actually panic'ing?  I wonder if it is this case:

/*
 * Return any desired output.
 */
if (needoutput || (tp-t_flags  TF_ACKNOW)) {
(void) tcp_output(tp);
/* XXX: Debug */
KASSERT(SEQ_GEQ(tp-rcv_adv, tp-rcv_nxt),
(tcp_input: negative window after ACK));

And if 'needoutput' is true, but TF_ACKNOW is not set, and tcp_output() decides
to not do anything.  I've updated tcp_negwin_test.patch to not panic if that 
call
to tcp_output() doesn't actually send a packet.  Please re-test.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-12-17 Thread Pawel Jakub Dawidek
On Mon, Dec 12, 2011 at 11:00:23AM -0500, John Baldwin wrote:
 An update.  I've sent Pawel a testing patch to see if my hypothesis is correct
 (www.freebsd.org/~jhb/patches/tcp_negwin_test.patch).  If it is then I intend
 to commit www.freebsd.org/~jhb/patches/tcp_negwin2.patch as the fix.

Unfortunately it paniced today. Take a look at:

http://people.freebsd.org/~pjd/misc/tcp_panic.jpg

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpdhigg3fyKF.pgp
Description: PGP signature


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-12-14 Thread Pawel Jakub Dawidek
On Mon, Dec 12, 2011 at 11:00:23AM -0500, John Baldwin wrote:
 An update.  I've sent Pawel a testing patch to see if my hypothesis is correct
 (www.freebsd.org/~jhb/patches/tcp_negwin_test.patch).  If it is then I intend
 to commit www.freebsd.org/~jhb/patches/tcp_negwin2.patch as the fix.

Sorry for the delay, but I'm just rebooting the box that triggered the
panic with John's patch. I should know more in a day or two.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgp4LFFZCMTh1.pgp
Description: PGP signature


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-12-12 Thread John Baldwin
On Monday, October 24, 2011 8:14:22 am John Baldwin wrote:
 On Sunday, October 23, 2011 11:58:28 am Pawel Jakub Dawidek wrote:
  On Sun, Oct 23, 2011 at 11:44:45AM +0300, Kostik Belousov wrote:
   On Sun, Oct 23, 2011 at 08:10:38AM +0200, Pawel Jakub Dawidek wrote:
My suggestion would be that if we won't be able to fix it before 9.0,
we should turn this assertion off, as the system seems to be able to
recover.
   
   Shipped kernels have all assertions turned off.
  
  Yes, I'm aware of that, but many people compile their production kernels
  with INVARIANTS/INVARIANT_SUPPORT to fail early instead of eg.
  corrupting data. I'd be fine in moving this under DIAGNOSTIC or changing
  it into a printf, so it will be visible.
 
 No, the kernel is corrupting things in other places when this is true, so
 if you are running with INVARIANTS, we want to know about it.   Specifically,
 in several places in TCP we assume that rcv_adv = rcv_nxt, and depend on
 being able to do 'rcv_adv - rcv_nxt'.
 
 In this case, it looks like the difference is consistently less than one 
 frame.  I suspect the other end of the connection is sending just beyond the 
 end of the advertised window (it probably assumes it is better to send a full 
 frame if it has that much pending data even though part of it is beyond the 
 window edge vs sending a truncated packet that just fills the window) and that
 that frame is accepted ok in the header prediction case and it's ACK is 
 delayed, but the next packet to arrive then trips over this assumption.
 
 Since 'win' is guaranteed to be non-negative and we explicitly cast
 'rcv_adv - rcv_nxt' to (int) in the following line that the assert is checking
 for:
 
   tp-rcv_wnd = imax(win, (int)(tp-rcv_adv - tp-rcv_nxt));
 
 I think we already handle this case ok and perhaps the assertion can just be
 removed?  Not sure if others feel that it warrants a comment to note that this
 is the case being handled.
 
 Also, I'm not sure if this case can leak into the timewait code?  If so, we 
 will need to fix this case:
 
   /*
* Recover last window size sent.
*/
   KASSERT(SEQ_GEQ(tp-rcv_adv, tp-rcv_nxt),
   (tcp_twstart negative window: tp %p rcv_nxt %u rcv_adv %u, tp,
   tp-rcv_nxt, tp-rcv_adv));
   tw-last_win = (tp-rcv_adv - tp-rcv_nxt)  tp-rcv_scale;
 
 So that it sets last_win to 0 instead of some really huge value.

An update.  I've sent Pawel a testing patch to see if my hypothesis is correct
(www.freebsd.org/~jhb/patches/tcp_negwin_test.patch).  If it is then I intend
to commit www.freebsd.org/~jhb/patches/tcp_negwin2.patch as the fix.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-28 Thread John Baldwin
On Friday, October 28, 2011 1:46:07 am Pawel Jakub Dawidek wrote:
 On Fri, Oct 28, 2011 at 11:29:34AM +1100, Lawrence Stewart wrote:
  On 10/26/11 22:53, John Baldwin wrote:
   The assertion would be triggered when the next packet arrives (as I said
   above).  Try modifying your debugging output to also log if the ACK is
   delayed.  I suspect it is not delayed until the last one.  (Pushing out 
an
   ACK will reset rcv_adv to be beyond rcv_nxt in tcp_output(), so in the 
case
   of an immediate ACK, rcv_nxt  rcv_adv is only a transient condition all
   under a single lock invocation so never visible to other consumers of 
the
   protocol control block.)  If that is what you see, then that confirms 
what
   I guessed above and I will likely just remove the assertion in 
tcp_input()
   and patch the timewait code to handle this case.
  
  
  Pawel, have you been able to confirm John's hypothesis? [...]
 
 Yeah, sorry. I moved the debug to the points where we drop the t_inpcb
 lock and I still see rcv_nxt being greater than rcv_adv:
 
   tcp_do_segment:2970 negative window: tp 0xfe00685ee3d0 rcv_nxt 
1312878324 rcv_adv 1312878187

Yes, I still expect this.  What I want to see is if 'delack' is always true in 
this case.

 This is just before the INP_WUNLOCK(tp-t_inpcb) under 'check_delack'
 label. I see this a lot (it was logged 545 times for 11 different tp
 pointers during 24h period).
 
   tcp_do_segment:3009 negative window: tp 0xfe005cfc6000 rcv_nxt 
1442546453 rcv_adv 1442545722
 
 This is just before calling tcp_output(). This one was logged 65 times
 for 3 different tp pointers.
 I placed a debug also after tcp_output() call, but it is not logged, so
 once we return from tcp_output() everything is fine.

That is consistent with what I expect then, since in the delack case,
tcp_output() isn't called.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-27 Thread Lawrence Stewart

On 10/26/11 22:53, John Baldwin wrote:

On Wednesday, October 26, 2011 3:54:31 am Pawel Jakub Dawidek wrote:

On Mon, Oct 24, 2011 at 08:14:22AM -0400, John Baldwin wrote:

On Sunday, October 23, 2011 11:58:28 am Pawel Jakub Dawidek wrote:

On Sun, Oct 23, 2011 at 11:44:45AM +0300, Kostik Belousov wrote:

On Sun, Oct 23, 2011 at 08:10:38AM +0200, Pawel Jakub Dawidek wrote:

My suggestion would be that if we won't be able to fix it before 9.0,
we should turn this assertion off, as the system seems to be able to
recover.


Shipped kernels have all assertions turned off.


Yes, I'm aware of that, but many people compile their production kernels
with INVARIANTS/INVARIANT_SUPPORT to fail early instead of eg.
corrupting data. I'd be fine in moving this under DIAGNOSTIC or changing
it into a printf, so it will be visible.


No, the kernel is corrupting things in other places when this is true, so
if you are running with INVARIANTS, we want to know about it.   Specifically,
in several places in TCP we assume that rcv_adv= rcv_nxt, and depend on
being able to do 'rcv_adv - rcv_nxt'.

In this case, it looks like the difference is consistently less than one
frame.  I suspect the other end of the connection is sending just beyond the
end of the advertised window (it probably assumes it is better to send a full
frame if it has that much pending data even though part of it is beyond the
window edge vs sending a truncated packet that just fills the window) and that
that frame is accepted ok in the header prediction case and it's ACK is
delayed, but the next packet to arrive then trips over this assumption.

Since 'win' is guaranteed to be non-negative and we explicitly cast
'rcv_adv - rcv_nxt' to (int) in the following line that the assert is checking
for:

tp-rcv_wnd = imax(win, (int)(tp-rcv_adv - tp-rcv_nxt));

I think we already handle this case ok and perhaps the assertion can just be
removed?  Not sure if others feel that it warrants a comment to note that this
is the case being handled.


I added debug to the places where rcv_adv and rcv_nxt are modified. Here
is what happens before the panic occurs:

tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022361548 
rcv_adv 4022360100 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022362298 
rcv_adv 4022361548 diff -750
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022363746 
rcv_adv 4022362298 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022364836 
rcv_adv 4022363746 diff -1090
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022366284 
rcv_adv 4022364836 diff -1448
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022370628 
rcv_adv 4022369690 diff -938
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022379140 
rcv_adv 4022377692 diff -1448
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022387792 
rcv_adv 4022386344 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022388890 
rcv_adv 4022387792 diff -1098
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022390338 
rcv_adv 4022388890 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022394563 
rcv_adv 4022394342 diff -221
panic: tcp_input negative window: tp 0xfe000dab1b70 rcv_nxt 4022394563 
rcv_adv 4022394342 win=0 diff -221

I can send you the full log if you want, I've plenty of messages where
rcv_adv  rcv_nxt, not all of them trigger this assertion.


The assertion would be triggered when the next packet arrives (as I said
above).  Try modifying your debugging output to also log if the ACK is
delayed.  I suspect it is not delayed until the last one.  (Pushing out an
ACK will reset rcv_adv to be beyond rcv_nxt in tcp_output(), so in the case
of an immediate ACK, rcv_nxt  rcv_adv is only a transient condition all
under a single lock invocation so never visible to other consumers of the
protocol control block.)  If that is what you see, then that confirms what
I guessed above and I will likely just remove the assertion in tcp_input()
and patch the timewait code to handle this case.



Pawel, have you been able to confirm John's hypothesis? What I don't 
quite get is why we haven't had a lot more reports of this issue...


Cheers,
Lawrence
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-27 Thread Pawel Jakub Dawidek
On Fri, Oct 28, 2011 at 11:29:34AM +1100, Lawrence Stewart wrote:
 On 10/26/11 22:53, John Baldwin wrote:
  The assertion would be triggered when the next packet arrives (as I said
  above).  Try modifying your debugging output to also log if the ACK is
  delayed.  I suspect it is not delayed until the last one.  (Pushing out an
  ACK will reset rcv_adv to be beyond rcv_nxt in tcp_output(), so in the case
  of an immediate ACK, rcv_nxt  rcv_adv is only a transient condition all
  under a single lock invocation so never visible to other consumers of the
  protocol control block.)  If that is what you see, then that confirms what
  I guessed above and I will likely just remove the assertion in tcp_input()
  and patch the timewait code to handle this case.
 
 
 Pawel, have you been able to confirm John's hypothesis? [...]

Yeah, sorry. I moved the debug to the points where we drop the t_inpcb
lock and I still see rcv_nxt being greater than rcv_adv:

tcp_do_segment:2970 negative window: tp 0xfe00685ee3d0 rcv_nxt 
1312878324 rcv_adv 1312878187

This is just before the INP_WUNLOCK(tp-t_inpcb) under 'check_delack'
label. I see this a lot (it was logged 545 times for 11 different tp
pointers during 24h period).

tcp_do_segment:3009 negative window: tp 0xfe005cfc6000 rcv_nxt 
1442546453 rcv_adv 1442545722

This is just before calling tcp_output(). This one was logged 65 times
for 3 different tp pointers.
I placed a debug also after tcp_output() call, but it is not logged, so
once we return from tcp_output() everything is fine.

The panic would be triggered 115 times for 5 different tp pointers
during that time.

I write 'tp pointers' as I'm not 100% sure if the same pointer always
represents the same connection or if it is reused.

 [...] What I don't 
 quite get is why we haven't had a lot more reports of this issue...

Maybe because my TCP/IP stack is heavly modified? ...not:)

No idea to be honest. Ask Ken to turn on INVARIANTS in 9.0-RC2 and we
will see:)

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpN4xRhoFGYE.pgp
Description: PGP signature


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-26 Thread Pawel Jakub Dawidek
On Mon, Oct 24, 2011 at 08:14:22AM -0400, John Baldwin wrote:
 On Sunday, October 23, 2011 11:58:28 am Pawel Jakub Dawidek wrote:
  On Sun, Oct 23, 2011 at 11:44:45AM +0300, Kostik Belousov wrote:
   On Sun, Oct 23, 2011 at 08:10:38AM +0200, Pawel Jakub Dawidek wrote:
My suggestion would be that if we won't be able to fix it before 9.0,
we should turn this assertion off, as the system seems to be able to
recover.
   
   Shipped kernels have all assertions turned off.
  
  Yes, I'm aware of that, but many people compile their production kernels
  with INVARIANTS/INVARIANT_SUPPORT to fail early instead of eg.
  corrupting data. I'd be fine in moving this under DIAGNOSTIC or changing
  it into a printf, so it will be visible.
 
 No, the kernel is corrupting things in other places when this is true, so
 if you are running with INVARIANTS, we want to know about it.   Specifically,
 in several places in TCP we assume that rcv_adv = rcv_nxt, and depend on
 being able to do 'rcv_adv - rcv_nxt'.
 
 In this case, it looks like the difference is consistently less than one 
 frame.  I suspect the other end of the connection is sending just beyond the 
 end of the advertised window (it probably assumes it is better to send a full 
 frame if it has that much pending data even though part of it is beyond the 
 window edge vs sending a truncated packet that just fills the window) and that
 that frame is accepted ok in the header prediction case and it's ACK is 
 delayed, but the next packet to arrive then trips over this assumption.
 
 Since 'win' is guaranteed to be non-negative and we explicitly cast
 'rcv_adv - rcv_nxt' to (int) in the following line that the assert is checking
 for:
 
   tp-rcv_wnd = imax(win, (int)(tp-rcv_adv - tp-rcv_nxt));
 
 I think we already handle this case ok and perhaps the assertion can just be
 removed?  Not sure if others feel that it warrants a comment to note that this
 is the case being handled.

I added debug to the places where rcv_adv and rcv_nxt are modified. Here
is what happens before the panic occurs:

tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022361548 
rcv_adv 4022360100 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022362298 
rcv_adv 4022361548 diff -750
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022363746 
rcv_adv 4022362298 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022364836 
rcv_adv 4022363746 diff -1090
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022366284 
rcv_adv 4022364836 diff -1448
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022370628 
rcv_adv 4022369690 diff -938
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022379140 
rcv_adv 4022377692 diff -1448
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022387792 
rcv_adv 4022386344 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022388890 
rcv_adv 4022387792 diff -1098
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022390338 
rcv_adv 4022388890 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022394563 
rcv_adv 4022394342 diff -221
panic: tcp_input negative window: tp 0xfe000dab1b70 rcv_nxt 4022394563 
rcv_adv 4022394342 win=0 diff -221

I can send you the full log if you want, I've plenty of messages where
rcv_adv  rcv_nxt, not all of them trigger this assertion.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpjTSToeUp9v.pgp
Description: PGP signature


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-26 Thread John Baldwin
On Wednesday, October 26, 2011 3:54:31 am Pawel Jakub Dawidek wrote:
 On Mon, Oct 24, 2011 at 08:14:22AM -0400, John Baldwin wrote:
  On Sunday, October 23, 2011 11:58:28 am Pawel Jakub Dawidek wrote:
   On Sun, Oct 23, 2011 at 11:44:45AM +0300, Kostik Belousov wrote:
On Sun, Oct 23, 2011 at 08:10:38AM +0200, Pawel Jakub Dawidek wrote:
 My suggestion would be that if we won't be able to fix it before 9.0,
 we should turn this assertion off, as the system seems to be able to
 recover.

Shipped kernels have all assertions turned off.
   
   Yes, I'm aware of that, but many people compile their production kernels
   with INVARIANTS/INVARIANT_SUPPORT to fail early instead of eg.
   corrupting data. I'd be fine in moving this under DIAGNOSTIC or changing
   it into a printf, so it will be visible.
  
  No, the kernel is corrupting things in other places when this is true, so
  if you are running with INVARIANTS, we want to know about it.   
  Specifically,
  in several places in TCP we assume that rcv_adv = rcv_nxt, and depend on
  being able to do 'rcv_adv - rcv_nxt'.
  
  In this case, it looks like the difference is consistently less than one 
  frame.  I suspect the other end of the connection is sending just beyond 
  the 
  end of the advertised window (it probably assumes it is better to send a 
  full 
  frame if it has that much pending data even though part of it is beyond the 
  window edge vs sending a truncated packet that just fills the window) and 
  that
  that frame is accepted ok in the header prediction case and it's ACK is 
  delayed, but the next packet to arrive then trips over this assumption.
  
  Since 'win' is guaranteed to be non-negative and we explicitly cast
  'rcv_adv - rcv_nxt' to (int) in the following line that the assert is 
  checking
  for:
  
  tp-rcv_wnd = imax(win, (int)(tp-rcv_adv - tp-rcv_nxt));
  
  I think we already handle this case ok and perhaps the assertion can just be
  removed?  Not sure if others feel that it warrants a comment to note that 
  this
  is the case being handled.
 
 I added debug to the places where rcv_adv and rcv_nxt are modified. Here
 is what happens before the panic occurs:
 
 tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022361548 
 rcv_adv 4022360100 diff -1448
 tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022362298 
 rcv_adv 4022361548 diff -750
 tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022363746 
 rcv_adv 4022362298 diff -1448
 tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022364836 
 rcv_adv 4022363746 diff -1090
 tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022366284 
 rcv_adv 4022364836 diff -1448
 tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022370628 
 rcv_adv 4022369690 diff -938
 tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022379140 
 rcv_adv 4022377692 diff -1448
 tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022387792 
 rcv_adv 4022386344 diff -1448
 tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022388890 
 rcv_adv 4022387792 diff -1098
 tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022390338 
 rcv_adv 4022388890 diff -1448
 tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022394563 
 rcv_adv 4022394342 diff -221
 panic: tcp_input negative window: tp 0xfe000dab1b70 rcv_nxt 4022394563 
 rcv_adv 4022394342 win=0 diff -221
 
 I can send you the full log if you want, I've plenty of messages where
 rcv_adv  rcv_nxt, not all of them trigger this assertion.

The assertion would be triggered when the next packet arrives (as I said
above).  Try modifying your debugging output to also log if the ACK is
delayed.  I suspect it is not delayed until the last one.  (Pushing out an
ACK will reset rcv_adv to be beyond rcv_nxt in tcp_output(), so in the case
of an immediate ACK, rcv_nxt  rcv_adv is only a transient condition all
under a single lock invocation so never visible to other consumers of the
protocol control block.)  If that is what you see, then that confirms what
I guessed above and I will likely just remove the assertion in tcp_input()
and patch the timewait code to handle this case.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-26 Thread Julian Elischer

On 10/26/11 12:54 AM, Pawel Jakub Dawidek wrote:

On Mon, Oct 24, 2011 at 08:14:22AM -0400, John Baldwin wrote:

On Sunday, October 23, 2011 11:58:28 am Pawel Jakub Dawidek wrote:

On Sun, Oct 23, 2011 at 11:44:45AM +0300, Kostik Belousov wrote:

On Sun, Oct 23, 2011 at 08:10:38AM +0200, Pawel Jakub Dawidek wrote:

My suggestion would be that if we won't be able to fix it before 9.0,
we should turn this assertion off, as the system seems to be able to
recover.

Shipped kernels have all assertions turned off.

Yes, I'm aware of that, but many people compile their production kernels
with INVARIANTS/INVARIANT_SUPPORT to fail early instead of eg.
corrupting data. I'd be fine in moving this under DIAGNOSTIC or changing
it into a printf, so it will be visible.

No, the kernel is corrupting things in other places when this is true, so
if you are running with INVARIANTS, we want to know about it.   Specifically,
in several places in TCP we assume that rcv_adv= rcv_nxt, and depend on
being able to do 'rcv_adv - rcv_nxt'.

In this case, it looks like the difference is consistently less than one
frame.  I suspect the other end of the connection is sending just beyond the
end of the advertised window (it probably assumes it is better to send a full
frame if it has that much pending data even though part of it is beyond the
window edge vs sending a truncated packet that just fills the window) and that
that frame is accepted ok in the header prediction case and it's ACK is
delayed, but the next packet to arrive then trips over this assumption.

Since 'win' is guaranteed to be non-negative and we explicitly cast
'rcv_adv - rcv_nxt' to (int) in the following line that the assert is checking
for:

tp-rcv_wnd = imax(win, (int)(tp-rcv_adv - tp-rcv_nxt));

I think we already handle this case ok and perhaps the assertion can just be
removed?  Not sure if others feel that it warrants a comment to note that this
is the case being handled.

I added debug to the places where rcv_adv and rcv_nxt are modified. Here
is what happens before the panic occurs:

tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022361548 
rcv_adv 4022360100 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022362298 
rcv_adv 4022361548 diff -750
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022363746 
rcv_adv 4022362298 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022364836 
rcv_adv 4022363746 diff -1090
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022366284 
rcv_adv 4022364836 diff -1448
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022370628 
rcv_adv 4022369690 diff -938
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022379140 
rcv_adv 4022377692 diff -1448
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022387792 
rcv_adv 4022386344 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022388890 
rcv_adv 4022387792 diff -1098
tcp_do_segment:1722 negative window: tp 0xfe000dab1b70 rcv_nxt 4022390338 
rcv_adv 4022388890 diff -1448
tcp_do_segment:2847 negative window: tp 0xfe000dab1b70 rcv_nxt 4022394563 
rcv_adv 4022394342 diff -221
panic: tcp_input negative window: tp 0xfe000dab1b70 rcv_nxt 4022394563 
rcv_adv 4022394342 win=0 diff -221

I can send you the full log if you want, I've plenty of messages where
rcv_adv  rcv_nxt, not all of them trigger this assertion.


Might be a good place to use the new sifter tool.

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-24 Thread John Baldwin
On Sunday, October 23, 2011 11:58:28 am Pawel Jakub Dawidek wrote:
 On Sun, Oct 23, 2011 at 11:44:45AM +0300, Kostik Belousov wrote:
  On Sun, Oct 23, 2011 at 08:10:38AM +0200, Pawel Jakub Dawidek wrote:
   My suggestion would be that if we won't be able to fix it before 9.0,
   we should turn this assertion off, as the system seems to be able to
   recover.
  
  Shipped kernels have all assertions turned off.
 
 Yes, I'm aware of that, but many people compile their production kernels
 with INVARIANTS/INVARIANT_SUPPORT to fail early instead of eg.
 corrupting data. I'd be fine in moving this under DIAGNOSTIC or changing
 it into a printf, so it will be visible.

No, the kernel is corrupting things in other places when this is true, so
if you are running with INVARIANTS, we want to know about it.   Specifically,
in several places in TCP we assume that rcv_adv = rcv_nxt, and depend on
being able to do 'rcv_adv - rcv_nxt'.

In this case, it looks like the difference is consistently less than one 
frame.  I suspect the other end of the connection is sending just beyond the 
end of the advertised window (it probably assumes it is better to send a full 
frame if it has that much pending data even though part of it is beyond the 
window edge vs sending a truncated packet that just fills the window) and that
that frame is accepted ok in the header prediction case and it's ACK is 
delayed, but the next packet to arrive then trips over this assumption.

Since 'win' is guaranteed to be non-negative and we explicitly cast
'rcv_adv - rcv_nxt' to (int) in the following line that the assert is checking
for:

tp-rcv_wnd = imax(win, (int)(tp-rcv_adv - tp-rcv_nxt));

I think we already handle this case ok and perhaps the assertion can just be
removed?  Not sure if others feel that it warrants a comment to note that this
is the case being handled.

Also, I'm not sure if this case can leak into the timewait code?  If so, we 
will need to fix this case:

/*
 * Recover last window size sent.
 */
KASSERT(SEQ_GEQ(tp-rcv_adv, tp-rcv_nxt),
(tcp_twstart negative window: tp %p rcv_nxt %u rcv_adv %u, tp,
tp-rcv_nxt, tp-rcv_adv));
tw-last_win = (tp-rcv_adv - tp-rcv_nxt)  tp-rcv_scale;

So that it sets last_win to 0 instead of some really huge value.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-23 Thread Pawel Jakub Dawidek
On Sun, Oct 23, 2011 at 12:35:15PM +1100, Lawrence Stewart wrote:
 On 10/22/11 19:49, Pawel Jakub Dawidek wrote:
  The panic message says:
 
  panic: tcp_input negative window: tp 0xfe007763e000 rcv_nxt 
  3718269252 rcv_adv 3718268291
 
  I only have picture of the backtrace:
 
  http://people.freebsd.org/~pjd/misc/panic_negative_window.jpg
 
 
 ewww that is not good. Can you give us any more information about the 
 machine and what it's doing? Is it terminating TCP connections from the 
 internet at large or only local LAN (i.e. is there likely to be packet 
 loss happening)? Are you doing TSO or LRO? Do you have any non-default 
 tuning in place?

It is my local file server. It is doing NFS and AFP over LAN and also
downloads files from the internet. It is triggered after few hours.
I changed the KASSERT() into printf() and added printing 'win' variable
and this is what got logged during the night:

05:16:24 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107827269 
rcv_adv 1107826256 win=242
05:16:29 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107833451 
rcv_adv 1107832977 win=880
05:16:41 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107849563 
rcv_adv 1107848860 win=639
05:20:02 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108108230 
rcv_adv 1108107331 win=567
05:24:30 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108433302 
rcv_adv 1108432272 win=974
05:24:46 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108450385 
rcv_adv 1108450060 win=751
05:26:44 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108574818 
rcv_adv 1108573851 win=71
05:28:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108654103 
rcv_adv 1108653166 win=0
05:28:43 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108692396 
rcv_adv 1108691451 win=0
05:30:06 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108781258 
rcv_adv 1108780372 win=235
05:35:05 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109067578 
rcv_adv 1109067335 win=663
05:37:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109180403 
rcv_adv 1109179411 win=0
05:41:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109428265 
rcv_adv 1109427375 win=170

And the systems seems to be fine.

I'm happy to test patches, but one round would take 24h.

My suggestion would be that if we won't be able to fix it before 9.0,
we should turn this assertion off, as the system seems to be able to
recover.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpcUNlslP3cT.pgp
Description: PGP signature


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-23 Thread Kostik Belousov
On Sun, Oct 23, 2011 at 08:10:38AM +0200, Pawel Jakub Dawidek wrote:
 On Sun, Oct 23, 2011 at 12:35:15PM +1100, Lawrence Stewart wrote:
  On 10/22/11 19:49, Pawel Jakub Dawidek wrote:
   The panic message says:
  
 panic: tcp_input negative window: tp 0xfe007763e000 rcv_nxt 
   3718269252 rcv_adv 3718268291
  
   I only have picture of the backtrace:
  
 http://people.freebsd.org/~pjd/misc/panic_negative_window.jpg
  
  
  ewww that is not good. Can you give us any more information about the 
  machine and what it's doing? Is it terminating TCP connections from the 
  internet at large or only local LAN (i.e. is there likely to be packet 
  loss happening)? Are you doing TSO or LRO? Do you have any non-default 
  tuning in place?
 
 It is my local file server. It is doing NFS and AFP over LAN and also
 downloads files from the internet. It is triggered after few hours.
 I changed the KASSERT() into printf() and added printing 'win' variable
 and this is what got logged during the night:
 
 05:16:24 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107827269 
 rcv_adv 1107826256 win=242
 05:16:29 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107833451 
 rcv_adv 1107832977 win=880
 05:16:41 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107849563 
 rcv_adv 1107848860 win=639
 05:20:02 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108108230 
 rcv_adv 1108107331 win=567
 05:24:30 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108433302 
 rcv_adv 1108432272 win=974
 05:24:46 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108450385 
 rcv_adv 1108450060 win=751
 05:26:44 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108574818 
 rcv_adv 1108573851 win=71
 05:28:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108654103 
 rcv_adv 1108653166 win=0
 05:28:43 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108692396 
 rcv_adv 1108691451 win=0
 05:30:06 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108781258 
 rcv_adv 1108780372 win=235
 05:35:05 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109067578 
 rcv_adv 1109067335 win=663
 05:37:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109180403 
 rcv_adv 1109179411 win=0
 05:41:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109428265 
 rcv_adv 1109427375 win=170
 
 And the systems seems to be fine.
 
 I'm happy to test patches, but one round would take 24h.
 
 My suggestion would be that if we won't be able to fix it before 9.0,
 we should turn this assertion off, as the system seems to be able to
 recover.

Shipped kernels have all assertions turned off.


pgpKtmVl4jzPD.pgp
Description: PGP signature


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-23 Thread Pawel Jakub Dawidek
On Sun, Oct 23, 2011 at 11:44:45AM +0300, Kostik Belousov wrote:
 On Sun, Oct 23, 2011 at 08:10:38AM +0200, Pawel Jakub Dawidek wrote:
  My suggestion would be that if we won't be able to fix it before 9.0,
  we should turn this assertion off, as the system seems to be able to
  recover.
 
 Shipped kernels have all assertions turned off.

Yes, I'm aware of that, but many people compile their production kernels
with INVARIANTS/INVARIANT_SUPPORT to fail early instead of eg.
corrupting data. I'd be fine in moving this under DIAGNOSTIC or changing
it into a printf, so it will be visible.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpe47qzpCChn.pgp
Description: PGP signature


9.0-RC1 panic in tcp_input: negative winow.

2011-10-22 Thread Pawel Jakub Dawidek
The panic message says:

panic: tcp_input negative window: tp 0xfe007763e000 rcv_nxt 
3718269252 rcv_adv 3718268291

I only have picture of the backtrace:

http://people.freebsd.org/~pjd/misc/panic_negative_window.jpg

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpo3XZOvEVzL.pgp
Description: PGP signature


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-22 Thread Garrett Cooper

On Sat, 22 Oct 2011, Pawel Jakub Dawidek wrote:


The panic message says:

panic: tcp_input negative window: tp 0xfe007763e000 rcv_nxt 
3718269252 rcv_adv 3718268291

I only have picture of the backtrace:

http://people.freebsd.org/~pjd/misc/panic_negative_window.jpg


I've seen that issue once before in the r222XYZ days -- it's a problem 
that's been around since the TCP / IP code has been refactored in FreeBSD. 
Thanks,

-Garrett
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-22 Thread Lawrence Stewart

On 10/22/11 19:49, Pawel Jakub Dawidek wrote:

The panic message says:

panic: tcp_input negative window: tp 0xfe007763e000 rcv_nxt 
3718269252 rcv_adv 3718268291

I only have picture of the backtrace:

http://people.freebsd.org/~pjd/misc/panic_negative_window.jpg



ewww that is not good. Can you give us any more information about the 
machine and what it's doing? Is it terminating TCP connections from the 
internet at large or only local LAN (i.e. is there likely to be packet 
loss happening)? Are you doing TSO or LRO? Do you have any non-default 
tuning in place?


Cheers,
Lawrence
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-22 Thread Garrett Cooper

On Sun, 23 Oct 2011, Lawrence Stewart wrote:


On 10/22/11 19:49, Pawel Jakub Dawidek wrote:

The panic message says:

	panic: tcp_input negative window: tp 0xfe007763e000 rcv_nxt 
3718269252 rcv_adv 3718268291


I only have picture of the backtrace:

http://people.freebsd.org/~pjd/misc/panic_negative_window.jpg



ewww that is not good. Can you give us any more information about the machine 
and what it's doing? Is it terminating TCP connections from the internet at 
large or only local LAN (i.e. is there likely to be packet loss happening)? 
Are you doing TSO or LRO? Do you have any non-default tuning in place?


I can't speak for pjd@, but when this issue occured a couple months ago 
for me, the system I had run into the issue with was doing packet 
forwarding via ipfw from an internal NAT via ipfw to a corporate network.


The system was using bce(4) on the internal and external interface.

Thanks,
-Garrett
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org