On Tuesday 05 October 2004 14:16, Yann Dupont wrote: > Hello There. I'm seeing something strange here. > Not sure it's vserver related at all. Probably just 2.6 related, but > before going on lkml I'd like to see if someone else seeing > those kind of messages: > > I have one machine (Dual Xeon, 2 Gb Ram + Qlogic FC & SAN), with 8 > vservers on it, > Each vserver is using dedicated EVMS volume on the san . > one of this vservers is a very busy vserver (rsyncd master, where 100+ > servers are syncing on it every hour). > This vserver use some large partitions (300 Gb+, and has zillions of > file in it) > > This was working fine with 2.4 kernel > > I have switched the host from 2.4 to 2.6, and I started to have thoses > messages : > > TCP: Treason uncloaked! Peer 172.20.12.49:37066/873 shrinks window > 723200794:723201746. Repaired. > TCP: Treason uncloaked! Peer 172.20.12.49:37066/873 shrinks window > 723200794:723201746. Repaired. > TCP: Treason uncloaked! Peer 192.168.100.17:53343/873 shrinks window > 2029005703:2029007151. Repaired. > TCP: Treason uncloaked! Peer 192.168.100.17:53343/873 shrinks window > 2029017287:2029018735. Repaired. > > Thoses IP (cleints) are others vservers in 2.4.27 Kernel... The only > explanation I saw is a broken TCP/IP stack on the client side. > Seems not to be the case ... > > More harmfull : > > > swapper: page allocation failure. order:0, mode:0x20 > [<c013a545>] __alloc_pages+0x1ab/0x317 > [<c013a6c9>] __get_free_pages+0x18/0x24 > [<c013d529>] kmem_getpages+0x1a/0xbe > [<c013e108>] cache_grow+0x9e/0x127 > [<c013e304>] cache_alloc_refill+0x173/0x218 > [<c013e710>] __kmalloc+0x7c/0x83 > [<c030f574>] alloc_skb+0x32/0xc3 > [<c0286c02>] e1000_alloc_rx_buffers+0x3b/0xd5 > [<c028690d>] e1000_clean_rx_irq+0x192/0x44c > [<c02948c4>] scsi_io_completion+0x135/0x3ee > [<c02864f1>] e1000_clean+0x3e/0xb3 > [<c0314bc5>] net_rx_action+0x70/0xef > [<c011d078>] __do_softirq+0xb4/0xc3 > [<c011d0b4>] do_softirq+0x2d/0x2f > [<c0106633>] do_IRQ+0x105/0x11e > [<c0104768>] common_interrupt+0x18/0x20 > [<c0101f7a>] default_idle+0x0/0x2c > [<c0101fa3>] default_idle+0x29/0x2c > [<c010200c>] cpu_idle+0x33/0x3c > [<c049a7d0>] start_kernel+0x15b/0x176 > [<c049a303>] unknown_bootoption+0x0/0x144 > rsync: page allocation failure. order:0, mode:0x20 > [<c013a545>] __alloc_pages+0x1ab/0x317 > [<c011565c>] __wake_up+0x38/0x4e > [<c013a6c9>] __get_free_pages+0x18/0x24 > [<c013d529>] kmem_getpages+0x1a/0xbe > [<c013e108>] cache_grow+0x9e/0x127 > [<c013e304>] cache_alloc_refill+0x173/0x218 > [<c013e710>] __kmalloc+0x7c/0x83 > [<c030f574>] alloc_skb+0x32/0xc3 > [<c0286c02>] e1000_alloc_rx_buffers+0x3b/0xd5 > [<c028690d>] e1000_clean_rx_irq+0x192/0x44c > [<c013a740>] __pagevec_free+0x17/0x1f > [<c02864f1>] e1000_clean+0x3e/0xb3 > [<c0314bc5>] net_rx_action+0x70/0xef > [<c011d078>] __do_softirq+0xb4/0xc3 > [<c011d0b4>] do_softirq+0x2d/0x2f > [<c0106633>] do_IRQ+0x105/0x11e > [<c0104768>] common_interrupt+0x18/0x20 > [<c011007b>] unknown_nmi_panic_callback+0x38/0x47 > [<c01408f3>] shrink_cache+0x109/0x388 > [<c012047d>] del_timer_sync+0x7d/0xb5 > [<c01204ca>] del_singleshot_timer_sync+0x15/0x23 > [<c0365d22>] schedule_timeout+0x6f/0xbb > [<c0141105>] shrink_zone+0xa9/0xc0 > [<c0141170>] shrink_caches+0x54/0x56 > [<c0141229>] try_to_free_pages+0xb7/0x17f > [<c013a58e>] __alloc_pages+0x1f4/0x317 > [<c030c073>] sock_aio_read+0xe2/0x13e > [<c013a6c9>] __get_free_pages+0x18/0x24 > [<c0162e29>] __pollwait+0x80/0xc1 > [<c032ea66>] tcp_poll+0x1a/0x152 > [<c030c6d9>] sock_poll+0x12/0x14 > [<c01631a0>] do_select+0x25d/0x2b9 > [<c0162da9>] __pollwait+0x0/0xc1 > [<c01634af>] sys_select+0x29e/0x498 > [<c011c7da>] sys_time+0x16/0x50 > [<c0103d83>] syscall_call+0x7/0xb > > > This was with 2.6.9-rc2 + VS for it (2.6.9-rc2-vs1.9.2.28.4) > > All this seems eepro1000 related, but not sure. I saw others have some > kind of similar problems with eepro1000, > and doing echo 2048 > /proc/sys/vm/min_free_kbytes seems to lower those > problems. This is what I've done. > > This morning the server was crashed (after 14 days of uptime). I didn't > get a chance to see the oops. > > So I recompiled another kernel, with all the bleeding edge, to see if > this is changing something > so this time : > 2.6.9-rc3-bk4 + vs1.9.3-rc2 > the device mapper has all the last patches, > the eepro1000 has been changed to 5.4.11-NAPI (directly from intel page) > the qlogic driver has been changed to 8.00.00b21-k > > .... And the results are the same ... > > I've no problems on a non vserver-patched kernel, but with different > hardware. So the question is : > Is there a chance there are allocations on vserver code that can affect > this ? > > Or do you think vserver is totally innocent in that case ?
I've had a crash with vs1.9.2 and 2.6.8.1 after 3 days of uptime. It happend during accessing a automounted nfs share and was a kernel nullpointer dereference at ....... I have a 3com vortex card. This did not happen with stock 2.6.8.1. I'm currently trying 1.9.3-rc2 + 2.6.9-rc3-bk3 to see if it happens again. Hopefully the log output of the crash will make it to a remote logging host... -- lg, Chris _______________________________________________ Vserver mailing list [EMAIL PROTECTED] http://list.linux-vserver.org/mailman/listinfo/vserver
