On Tuesday 05 October 2004 14:16, Yann Dupont wrote:
> Hello There. I'm seeing something strange here.
> Not sure it's vserver related at all. Probably just 2.6 related, but 
> before going on lkml I'd like to see if someone else seeing
> those kind of messages:
> 
> I have one machine (Dual Xeon, 2 Gb Ram + Qlogic FC & SAN), with 8 
> vservers on it,
> Each vserver is using dedicated EVMS volume on the san .
> one of this vservers is a very busy vserver (rsyncd master, where 100+ 
> servers are syncing on it every hour).
> This vserver use  some large partitions (300 Gb+, and has zillions of 
> file in it)
> 
> This was working fine with 2.4 kernel
> 
> I have switched the host  from 2.4 to 2.6, and I started to have thoses 
> messages :
> 
> TCP: Treason uncloaked! Peer 172.20.12.49:37066/873 shrinks window 
> 723200794:723201746. Repaired.
> TCP: Treason uncloaked! Peer 172.20.12.49:37066/873 shrinks window 
> 723200794:723201746. Repaired.
> TCP: Treason uncloaked! Peer 192.168.100.17:53343/873 shrinks window 
> 2029005703:2029007151. Repaired.
> TCP: Treason uncloaked! Peer 192.168.100.17:53343/873 shrinks window 
> 2029017287:2029018735. Repaired.
> 
> Thoses IP (cleints) are others vservers in 2.4.27 Kernel... The only 
> explanation I saw is a broken TCP/IP stack on the client side.
> Seems not to be the case ...
> 
> More harmfull :
> 
> 
> swapper: page allocation failure. order:0, mode:0x20
>  [<c013a545>] __alloc_pages+0x1ab/0x317
>  [<c013a6c9>] __get_free_pages+0x18/0x24
>  [<c013d529>] kmem_getpages+0x1a/0xbe
>  [<c013e108>] cache_grow+0x9e/0x127
>  [<c013e304>] cache_alloc_refill+0x173/0x218
>  [<c013e710>] __kmalloc+0x7c/0x83
>  [<c030f574>] alloc_skb+0x32/0xc3
>  [<c0286c02>] e1000_alloc_rx_buffers+0x3b/0xd5
>  [<c028690d>] e1000_clean_rx_irq+0x192/0x44c
>  [<c02948c4>] scsi_io_completion+0x135/0x3ee
>  [<c02864f1>] e1000_clean+0x3e/0xb3
>  [<c0314bc5>] net_rx_action+0x70/0xef
>  [<c011d078>] __do_softirq+0xb4/0xc3
>  [<c011d0b4>] do_softirq+0x2d/0x2f
>  [<c0106633>] do_IRQ+0x105/0x11e
>  [<c0104768>] common_interrupt+0x18/0x20
>  [<c0101f7a>] default_idle+0x0/0x2c
>  [<c0101fa3>] default_idle+0x29/0x2c
>  [<c010200c>] cpu_idle+0x33/0x3c
>  [<c049a7d0>] start_kernel+0x15b/0x176
>  [<c049a303>] unknown_bootoption+0x0/0x144
> rsync: page allocation failure. order:0, mode:0x20
>  [<c013a545>] __alloc_pages+0x1ab/0x317
>  [<c011565c>] __wake_up+0x38/0x4e
>  [<c013a6c9>] __get_free_pages+0x18/0x24
>  [<c013d529>] kmem_getpages+0x1a/0xbe
>  [<c013e108>] cache_grow+0x9e/0x127
>  [<c013e304>] cache_alloc_refill+0x173/0x218
>  [<c013e710>] __kmalloc+0x7c/0x83
>  [<c030f574>] alloc_skb+0x32/0xc3
>  [<c0286c02>] e1000_alloc_rx_buffers+0x3b/0xd5
>  [<c028690d>] e1000_clean_rx_irq+0x192/0x44c
>  [<c013a740>] __pagevec_free+0x17/0x1f
>  [<c02864f1>] e1000_clean+0x3e/0xb3
>  [<c0314bc5>] net_rx_action+0x70/0xef
>  [<c011d078>] __do_softirq+0xb4/0xc3
>  [<c011d0b4>] do_softirq+0x2d/0x2f
>  [<c0106633>] do_IRQ+0x105/0x11e
>  [<c0104768>] common_interrupt+0x18/0x20
>  [<c011007b>] unknown_nmi_panic_callback+0x38/0x47
>  [<c01408f3>] shrink_cache+0x109/0x388
>  [<c012047d>] del_timer_sync+0x7d/0xb5
>  [<c01204ca>] del_singleshot_timer_sync+0x15/0x23
>  [<c0365d22>] schedule_timeout+0x6f/0xbb
>  [<c0141105>] shrink_zone+0xa9/0xc0
>  [<c0141170>] shrink_caches+0x54/0x56
>  [<c0141229>] try_to_free_pages+0xb7/0x17f
>  [<c013a58e>] __alloc_pages+0x1f4/0x317
>  [<c030c073>] sock_aio_read+0xe2/0x13e
>  [<c013a6c9>] __get_free_pages+0x18/0x24
>  [<c0162e29>] __pollwait+0x80/0xc1
>  [<c032ea66>] tcp_poll+0x1a/0x152
>  [<c030c6d9>] sock_poll+0x12/0x14
>  [<c01631a0>] do_select+0x25d/0x2b9
>  [<c0162da9>] __pollwait+0x0/0xc1
>  [<c01634af>] sys_select+0x29e/0x498
>  [<c011c7da>] sys_time+0x16/0x50
>  [<c0103d83>] syscall_call+0x7/0xb
> 
> 
> This was with 2.6.9-rc2 + VS for it (2.6.9-rc2-vs1.9.2.28.4)
> 
> All this seems eepro1000 related, but not sure. I saw others have some 
> kind of similar problems with eepro1000,
> and doing echo 2048 > /proc/sys/vm/min_free_kbytes seems to lower those 
> problems. This is what I've done.
> 
> This morning the server was crashed (after 14 days of uptime). I didn't 
> get a chance to see the oops.
> 
> So I recompiled another kernel, with all the bleeding edge, to see if 
> this is changing something
> so this time :
> 2.6.9-rc3-bk4 + vs1.9.3-rc2
> the device mapper has all the last patches,
> the eepro1000 has been changed to 5.4.11-NAPI (directly from intel page)
> the qlogic driver has been changed to 8.00.00b21-k
> 
> .... And the results are the same ...
> 
> I've no problems on a non vserver-patched kernel, but with different 
> hardware. So the question is :
> Is there a chance there are allocations on vserver code that can affect 
> this ?
> 
> Or do you think vserver is totally innocent in that case ?

I've had a crash with vs1.9.2 and 2.6.8.1 after 3 days of uptime.
It happend during accessing a automounted nfs share and was a
kernel nullpointer dereference at .......
I have a 3com vortex card.

This did not happen with stock 2.6.8.1.
I'm currently trying 1.9.3-rc2 + 2.6.9-rc3-bk3 to see if it happens
again.
Hopefully the log output of the crash will make it to a remote logging host...

-- 
lg, Chris

_______________________________________________
Vserver mailing list
[EMAIL PROTECTED]
http://list.linux-vserver.org/mailman/listinfo/vserver

Reply via email to