Not sure it's vserver related at all. Probably just 2.6 related, but before going on lkml I'd like to see if someone else seeing
those kind of messages:
I have one machine (Dual Xeon, 2 Gb Ram + Qlogic FC & SAN), with 8 vservers on it,
Each vserver is using dedicated EVMS volume on the san .
one of this vservers is a very busy vserver (rsyncd master, where 100+ servers are syncing on it every hour).
This vserver use some large partitions (300 Gb+, and has zillions of file in it)
This was working fine with 2.4 kernel
I have switched the host from 2.4 to 2.6, and I started to have thoses messages :
TCP: Treason uncloaked! Peer 172.20.12.49:37066/873 shrinks window 723200794:723201746. Repaired.
TCP: Treason uncloaked! Peer 172.20.12.49:37066/873 shrinks window 723200794:723201746. Repaired.
TCP: Treason uncloaked! Peer 192.168.100.17:53343/873 shrinks window 2029005703:2029007151. Repaired.
TCP: Treason uncloaked! Peer 192.168.100.17:53343/873 shrinks window 2029017287:2029018735. Repaired.
Thoses IP (cleints) are others vservers in 2.4.27 Kernel... The only explanation I saw is a broken TCP/IP stack on the client side.
Seems not to be the case ...
More harmfull :
swapper: page allocation failure. order:0, mode:0x20 [<c013a545>] __alloc_pages+0x1ab/0x317 [<c013a6c9>] __get_free_pages+0x18/0x24 [<c013d529>] kmem_getpages+0x1a/0xbe [<c013e108>] cache_grow+0x9e/0x127 [<c013e304>] cache_alloc_refill+0x173/0x218 [<c013e710>] __kmalloc+0x7c/0x83 [<c030f574>] alloc_skb+0x32/0xc3 [<c0286c02>] e1000_alloc_rx_buffers+0x3b/0xd5 [<c028690d>] e1000_clean_rx_irq+0x192/0x44c [<c02948c4>] scsi_io_completion+0x135/0x3ee [<c02864f1>] e1000_clean+0x3e/0xb3 [<c0314bc5>] net_rx_action+0x70/0xef [<c011d078>] __do_softirq+0xb4/0xc3 [<c011d0b4>] do_softirq+0x2d/0x2f [<c0106633>] do_IRQ+0x105/0x11e [<c0104768>] common_interrupt+0x18/0x20 [<c0101f7a>] default_idle+0x0/0x2c [<c0101fa3>] default_idle+0x29/0x2c [<c010200c>] cpu_idle+0x33/0x3c [<c049a7d0>] start_kernel+0x15b/0x176 [<c049a303>] unknown_bootoption+0x0/0x144 rsync: page allocation failure. order:0, mode:0x20 [<c013a545>] __alloc_pages+0x1ab/0x317 [<c011565c>] __wake_up+0x38/0x4e [<c013a6c9>] __get_free_pages+0x18/0x24 [<c013d529>] kmem_getpages+0x1a/0xbe [<c013e108>] cache_grow+0x9e/0x127 [<c013e304>] cache_alloc_refill+0x173/0x218 [<c013e710>] __kmalloc+0x7c/0x83 [<c030f574>] alloc_skb+0x32/0xc3 [<c0286c02>] e1000_alloc_rx_buffers+0x3b/0xd5 [<c028690d>] e1000_clean_rx_irq+0x192/0x44c [<c013a740>] __pagevec_free+0x17/0x1f [<c02864f1>] e1000_clean+0x3e/0xb3 [<c0314bc5>] net_rx_action+0x70/0xef [<c011d078>] __do_softirq+0xb4/0xc3 [<c011d0b4>] do_softirq+0x2d/0x2f [<c0106633>] do_IRQ+0x105/0x11e [<c0104768>] common_interrupt+0x18/0x20 [<c011007b>] unknown_nmi_panic_callback+0x38/0x47 [<c01408f3>] shrink_cache+0x109/0x388 [<c012047d>] del_timer_sync+0x7d/0xb5 [<c01204ca>] del_singleshot_timer_sync+0x15/0x23 [<c0365d22>] schedule_timeout+0x6f/0xbb [<c0141105>] shrink_zone+0xa9/0xc0 [<c0141170>] shrink_caches+0x54/0x56 [<c0141229>] try_to_free_pages+0xb7/0x17f [<c013a58e>] __alloc_pages+0x1f4/0x317 [<c030c073>] sock_aio_read+0xe2/0x13e [<c013a6c9>] __get_free_pages+0x18/0x24 [<c0162e29>] __pollwait+0x80/0xc1 [<c032ea66>] tcp_poll+0x1a/0x152 [<c030c6d9>] sock_poll+0x12/0x14 [<c01631a0>] do_select+0x25d/0x2b9 [<c0162da9>] __pollwait+0x0/0xc1 [<c01634af>] sys_select+0x29e/0x498 [<c011c7da>] sys_time+0x16/0x50 [<c0103d83>] syscall_call+0x7/0xb
This was with 2.6.9-rc2 + VS for it (2.6.9-rc2-vs1.9.2.28.4)
All this seems eepro1000 related, but not sure. I saw others have some kind of similar problems with eepro1000,
and doing echo 2048 > /proc/sys/vm/min_free_kbytes seems to lower those problems. This is what I've done.
This morning the server was crashed (after 14 days of uptime). I didn't get a chance to see the oops.
So I recompiled another kernel, with all the bleeding edge, to see if this is changing something
so this time :
2.6.9-rc3-bk4 + vs1.9.3-rc2
the device mapper has all the last patches,
the eepro1000 has been changed to 5.4.11-NAPI (directly from intel page)
the qlogic driver has been changed to 8.00.00b21-k
... And the results are the same ...
I've no problems on a non vserver-patched kernel, but with different hardware. So the question is :
Is there a chance there are allocations on vserver code that can affect this ?
Or do you think vserver is totally innocent in that case ?
Sincerely yours,
-- Yann Dupont, Cri de l'universit� de Nantes Tel: 02.51.12.53.91 - Fax: 02.51.12.58.60 - [EMAIL PROTECTED]
_______________________________________________ Vserver mailing list [EMAIL PROTECTED] http://list.linux-vserver.org/mailman/listinfo/vserver
