Hello There. I'm seeing something strange here.
Not sure it's vserver related at all. Probably just 2.6 related, but before going on lkml I'd like to see if someone else seeing
those kind of messages:


I have one machine (Dual Xeon, 2 Gb Ram + Qlogic FC & SAN), with 8 vservers on it,
Each vserver is using dedicated EVMS volume on the san .
one of this vservers is a very busy vserver (rsyncd master, where 100+ servers are syncing on it every hour).
This vserver use some large partitions (300 Gb+, and has zillions of file in it)


This was working fine with 2.4 kernel

I have switched the host from 2.4 to 2.6, and I started to have thoses messages :

TCP: Treason uncloaked! Peer 172.20.12.49:37066/873 shrinks window 723200794:723201746. Repaired.
TCP: Treason uncloaked! Peer 172.20.12.49:37066/873 shrinks window 723200794:723201746. Repaired.
TCP: Treason uncloaked! Peer 192.168.100.17:53343/873 shrinks window 2029005703:2029007151. Repaired.
TCP: Treason uncloaked! Peer 192.168.100.17:53343/873 shrinks window 2029017287:2029018735. Repaired.


Thoses IP (cleints) are others vservers in 2.4.27 Kernel... The only explanation I saw is a broken TCP/IP stack on the client side.
Seems not to be the case ...


More harmfull :


swapper: page allocation failure. order:0, mode:0x20 [<c013a545>] __alloc_pages+0x1ab/0x317 [<c013a6c9>] __get_free_pages+0x18/0x24 [<c013d529>] kmem_getpages+0x1a/0xbe [<c013e108>] cache_grow+0x9e/0x127 [<c013e304>] cache_alloc_refill+0x173/0x218 [<c013e710>] __kmalloc+0x7c/0x83 [<c030f574>] alloc_skb+0x32/0xc3 [<c0286c02>] e1000_alloc_rx_buffers+0x3b/0xd5 [<c028690d>] e1000_clean_rx_irq+0x192/0x44c [<c02948c4>] scsi_io_completion+0x135/0x3ee [<c02864f1>] e1000_clean+0x3e/0xb3 [<c0314bc5>] net_rx_action+0x70/0xef [<c011d078>] __do_softirq+0xb4/0xc3 [<c011d0b4>] do_softirq+0x2d/0x2f [<c0106633>] do_IRQ+0x105/0x11e [<c0104768>] common_interrupt+0x18/0x20 [<c0101f7a>] default_idle+0x0/0x2c [<c0101fa3>] default_idle+0x29/0x2c [<c010200c>] cpu_idle+0x33/0x3c [<c049a7d0>] start_kernel+0x15b/0x176 [<c049a303>] unknown_bootoption+0x0/0x144 rsync: page allocation failure. order:0, mode:0x20 [<c013a545>] __alloc_pages+0x1ab/0x317 [<c011565c>] __wake_up+0x38/0x4e [<c013a6c9>] __get_free_pages+0x18/0x24 [<c013d529>] kmem_getpages+0x1a/0xbe [<c013e108>] cache_grow+0x9e/0x127 [<c013e304>] cache_alloc_refill+0x173/0x218 [<c013e710>] __kmalloc+0x7c/0x83 [<c030f574>] alloc_skb+0x32/0xc3 [<c0286c02>] e1000_alloc_rx_buffers+0x3b/0xd5 [<c028690d>] e1000_clean_rx_irq+0x192/0x44c [<c013a740>] __pagevec_free+0x17/0x1f [<c02864f1>] e1000_clean+0x3e/0xb3 [<c0314bc5>] net_rx_action+0x70/0xef [<c011d078>] __do_softirq+0xb4/0xc3 [<c011d0b4>] do_softirq+0x2d/0x2f [<c0106633>] do_IRQ+0x105/0x11e [<c0104768>] common_interrupt+0x18/0x20 [<c011007b>] unknown_nmi_panic_callback+0x38/0x47 [<c01408f3>] shrink_cache+0x109/0x388 [<c012047d>] del_timer_sync+0x7d/0xb5 [<c01204ca>] del_singleshot_timer_sync+0x15/0x23 [<c0365d22>] schedule_timeout+0x6f/0xbb [<c0141105>] shrink_zone+0xa9/0xc0 [<c0141170>] shrink_caches+0x54/0x56 [<c0141229>] try_to_free_pages+0xb7/0x17f [<c013a58e>] __alloc_pages+0x1f4/0x317 [<c030c073>] sock_aio_read+0xe2/0x13e [<c013a6c9>] __get_free_pages+0x18/0x24 [<c0162e29>] __pollwait+0x80/0xc1 [<c032ea66>] tcp_poll+0x1a/0x152 [<c030c6d9>] sock_poll+0x12/0x14 [<c01631a0>] do_select+0x25d/0x2b9 [<c0162da9>] __pollwait+0x0/0xc1 [<c01634af>] sys_select+0x29e/0x498 [<c011c7da>] sys_time+0x16/0x50 [<c0103d83>] syscall_call+0x7/0xb


This was with 2.6.9-rc2 + VS for it (2.6.9-rc2-vs1.9.2.28.4)

All this seems eepro1000 related, but not sure. I saw others have some kind of similar problems with eepro1000,
and doing echo 2048 > /proc/sys/vm/min_free_kbytes seems to lower those problems. This is what I've done.


This morning the server was crashed (after 14 days of uptime). I didn't get a chance to see the oops.

So I recompiled another kernel, with all the bleeding edge, to see if this is changing something
so this time :
2.6.9-rc3-bk4 + vs1.9.3-rc2
the device mapper has all the last patches,
the eepro1000 has been changed to 5.4.11-NAPI (directly from intel page)
the qlogic driver has been changed to 8.00.00b21-k


... And the results are the same ...

I've no problems on a non vserver-patched kernel, but with different hardware. So the question is :
Is there a chance there are allocations on vserver code that can affect this ?


Or do you think vserver is totally innocent in that case ?

Sincerely yours,

--
Yann Dupont, Cri de l'universit� de Nantes
Tel: 02.51.12.53.91 - Fax: 02.51.12.58.60 - [EMAIL PROTECTED]

_______________________________________________
Vserver mailing list
[EMAIL PROTECTED]
http://list.linux-vserver.org/mailman/listinfo/vserver

Reply via email to