from:"Holger Kiehl"

Re: 3.16.49 Oops, does not boot on two socket server

2017-12-12 Thread Holger Kiehl

Hello,

just want to give a follow up. I have tested this with 3.16.51 and the
problem still exists. It seems the 3.16.x tree is no longer usable
for two socket servers :-(

Regards,
Holger

PS: here the panic with 3.16.51:

smpboot: Total of 24 processors activated (95963.71 BogoMIPS)
[ cut here ]
WARNING: CPU: 0 PID: 1 at kernel/sched/core.c:5811 
init_overlap_sched_group+0x114/0x120()
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.16.51-1.el6.x86_64 #1
Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/02/2014
  880fe96c7da8 815432dc 
 16b3 880fe96c7de8 8104cc72 880fff803c00
 880fe8d05650 881fe96ba3a8 880fe96af540 
Call Trace:
 [] dump_stack+0x4e/0x6a
 [] warn_slowpath_common+0x82/0xb0
 [] warn_slowpath_null+0x15/0x20
 [] init_overlap_sched_group+0x114/0x120
 [] build_overlap_sched_groups+0x134/0x1e0
 [] build_sched_domains+0x159/0x330
 [] sched_init_smp+0x65/0xf8
 [] kernel_init_freeable+0xb2/0x12d
 [] ? rest_init+0x80/0x80
 [] kernel_init+0x9/0xf0
 [] ret_from_fork+0x58/0x90
 [] ? rest_init+0x80/0x80
---[ end trace 207206398bdf8ddb ]---
BUG: unable to handle kernel paging request at 01024a7f
IP: [] init_overlap_sched_group+0xae/0x120
PGD 0
Oops:  [#1] SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Tainted: GW 3.16.51-1.el6.x86_64 #1
Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/02/2014
task: 880fe96d ti: 880fe96c4000 task.ti: 880fe96c4000
RIP: 0010:[]  [] 
init_overlap_sched_group+0xae/0x120
RSP: :880fe96c7e08  EFLAGS: 00010246
RAX: 0100 RBX: 880fe8d05650 RCX: 0020
RDX: 00014a80 RSI: 0020 RDI: 0020
RBP: 880fe96c7e28 R08: 880fe96af558 R09: 
R10: 0002 R11: 0001 R12: 881fe96ba3a8
R13: 880fe96af540 R14:  R15: 881fe96ba3a8
FS:  () GS:880fffc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 01024a7f CR3: 01714000 CR4: 000407f0
Stack:
    880fe8d05650
 880fe96c7ea8 81079b04 0011 880fe96af540
   cd68 
Call Trace:
 [] build_overlap_sched_groups+0x134/0x1e0
 [] build_sched_domains+0x159/0x330
 [] sched_init_smp+0x65/0xf8
 [] kernel_init_freeable+0xb2/0x12d
 [] ? rest_init+0x80/0x80
 [] kernel_init+0x9/0xf0
 [] ret_from_fork+0x58/0x90
 [] ? rest_init+0x80/0x80
Code: 60 83 00 85 c0 74 70 49 8d 75 18 48 c7 c2 38 f9 8a 81 bf ff ff ff ff e8 
31 f9 1f 00 49 8b 54 24 10 48 98 48 8b 04 c5 a0 fc 78 81 <48> 8b 14 10 b8 01 00 
00 00 49 89 55 10 f0 0f c1 02 85 c0 75 0f
RIP  [] init_overlap_sched_group+0xae/0x120
 RSP 
CR2: 01024a7f
---[ end trace 207206398bdf8ddc ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0009


On Wed, 18 Oct 2017, Holger Kiehl wrote:

> Hello,
> 
> just tried to boot 3.16.49 on a 2 socket server and it fails with the
> following error:
> 
>smpboot: Total of 24 processors activated (95818.36 BogoMIPS)
>[ cut here ]
>WARNING: CPU: 0 PID: 1 at kernel/sched/core.c:5811 
> init_overlap_sched_group+0x114/0x120()
>Modules linked in:
>CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.16.49-1.el6.x86_64 #1
>Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/02/2014
>  880bfd6d3da8 81542f1c 
> 16b3 880bfd6d3de8 8104cd72 880c0f803c00
> 880bfcc69650 8817fd695ca8 880bfd6e2300 
>Call Trace:
> [] dump_stack+0x4e/0x6a
> [] warn_slowpath_common+0x82/0xb0
> [] warn_slowpath_null+0x15/0x20
> [] init_overlap_sched_group+0x114/0x120
> [] build_overlap_sched_groups+0x134/0x1e0
> [] build_sched_domains+0x159/0x330
> [] sched_init_smp+0x65/0xf8
> [] kernel_init_freeable+0xb2/0x12d
> [] ? rest_init+0x80/0x80
> [] kernel_init+0x9/0xf0
> [] ret_from_fork+0x58/0x90
> [] ? rest_init+0x80/0x80
>---[ end trace a491a27c866dd06e ]---
>BUG: unable to handle kernel paging request at 010247bf
>IP: [] init_overlap_sched_group+0xae/0x120
>PGD 0
>Oops:  [#1] SMP
>Modules linked in:
>CPU: 0 PID: 1 Comm: swapper/0 Tainted: GW 3.16.49-1.el6.x86_64 
> #1
>Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/02/2014
>task: 8817fd6a8000 ti: 880bfd6d task.ti: 880bfd6d
>RIP: 0010:[]  [] 
> init_overlap_sched_group+0xae/0x120
>RSP: :880bfd6d3e08  EFLAGS: 00010246
>RAX: 0100 RBX: 880bfcc69650 RCX:

3.16.49 Oops, does not boot on two socket server

2017-10-18 Thread Holger Kiehl

Hello,

just tried to boot 3.16.49 on a 2 socket server and it fails with the
following error:

   smpboot: Total of 24 processors activated (95818.36 BogoMIPS)
   [ cut here ]
   WARNING: CPU: 0 PID: 1 at kernel/sched/core.c:5811 
init_overlap_sched_group+0x114/0x120()
   Modules linked in:
   CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.16.49-1.el6.x86_64 #1
   Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/02/2014
 880bfd6d3da8 81542f1c 
16b3 880bfd6d3de8 8104cd72 880c0f803c00
880bfcc69650 8817fd695ca8 880bfd6e2300 
   Call Trace:
[] dump_stack+0x4e/0x6a
[] warn_slowpath_common+0x82/0xb0
[] warn_slowpath_null+0x15/0x20
[] init_overlap_sched_group+0x114/0x120
[] build_overlap_sched_groups+0x134/0x1e0
[] build_sched_domains+0x159/0x330
[] sched_init_smp+0x65/0xf8
[] kernel_init_freeable+0xb2/0x12d
[] ? rest_init+0x80/0x80
[] kernel_init+0x9/0xf0
[] ret_from_fork+0x58/0x90
[] ? rest_init+0x80/0x80
   ---[ end trace a491a27c866dd06e ]---
   BUG: unable to handle kernel paging request at 010247bf
   IP: [] init_overlap_sched_group+0xae/0x120
   PGD 0
   Oops:  [#1] SMP
   Modules linked in:
   CPU: 0 PID: 1 Comm: swapper/0 Tainted: GW 3.16.49-1.el6.x86_64 #1
   Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/02/2014
   task: 8817fd6a8000 ti: 880bfd6d task.ti: 880bfd6d
   RIP: 0010:[]  [] 
init_overlap_sched_group+0xae/0x120
   RSP: :880bfd6d3e08  EFLAGS: 00010246
   RAX: 0100 RBX: 880bfcc69650 RCX: 0020
   RDX: 000147c0 RSI: 0020 RDI: 0020
   RBP: 880bfd6d3e28 R08: 880bfd6e2318 R09: 
   R10: 0002 R11: 0001 R12: 8817fd695ca8
   R13: 880bfd6e2300 R14:  R15: 8817fd695ca8
   FS:  () GS:880c0fc0() knlGS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2: 010247bf CR3: 001714000 CR4: 000407f0
   Stack:
   880bfcc69650
880bfd6d3ea8 81079974 0011 880bfd6e2300
  cac8 
   Call Trace:
[] build_overlap_sched_groups+0x134/0x1e0
[] build_sched_domains+0x159/0x330
[] sched_init_smp+0x65/0xf8
[] kernel_init_freeable+0xb2/0x12d
[] ? rest_init+0x80/0x80
[] kernel_init+0x9/0xf0
[] ret_from_fork+0x58/0x90
[] ? rest_init+0x80/0x80
   Code: 61 83 00 85 c0 74 70 49 8d 75 18 48 c7 c2 38 f9 8a 81 bf ff ff ff ff 
e8 51 fa 1f 00 49 8b 54 24 10 48 98 48 8b 04 c5 a0 fc 78 81 <48> 8b 14 10 b8 01 
00 00 00 49 89 55 10 f0 0f c1 02 85 c0 75 0f
   RIP  [] init_overlap_sched_group+0xae/0x120
RSP 
   CR2: 010247bf
   ---[ end trace a491a27c866dd06f ]---
   Kernel panic - not syncing: Attempted to kill init! exitcode=0x0009

   Rebooting in 5 seconds..

This happened on three different systems. On a similar system with just
one CPU in a socket it boots fine. The last Kernel of this series I tried
was 2.16.48 and that worked fine.

Any idea what is wrong? In case it is useful I have attached my kernel
config.

Regards,
Holger#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 3.16.49 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not se

Re: [PATCH] MD: make bio mergeable

2016-04-29 Thread Holger Kiehl

On Thu, 28 Apr 2016, Shaohua Li wrote:

> On Thu, Apr 28, 2016 at 08:00:22PM +0000, Holger Kiehl wrote:
> > Hello,
> > 
> > On Mon, 25 Apr 2016, Shaohua Li wrote:
> > 
> > > blk_queue_split marks bio unmergeable, which makes sense for normal bio.
> > > But if dispatching the bio to underlayer disk, the blk_queue_split
> > > checks are invalid, hence it's possible the bio becomes mergeable.
> > > 
> > > In the reported bug, this bug causes trim against raid0 performance slash
> > > https://bugzilla.kernel.org/show_bug.cgi?id=117051
> > > 
> > This patch makes a huge difference. On a system with two Samsung 850 Pro
> > in a MD Raid0 setup the time for fstrim went down from ~30min to 18sec!
> > 
> > However, on another system with two Intel P3700 1.6TB NVMe PCIe SSD's
> > also setup as one big MD Raid0, the patch does not make any difference
> > at all. fstrim takes more then 4 hours!
> 
> Does the raid0 cross two partitions or two SSD?
> 
Two SSD's. Where it works, for the two Samsung 850 Pro SATA SSD it was
via partitions.

> can you post blktrace data in the bugzilloa, I'll track the bug there.
> 
I did the blktrace on the two md raid0 devices /dev/nvme[01]n1 for 2 minutes
and attached them to the bug 117051 as a tar.bz2 file:

   https://bugzilla.kernel.org/show_bug.cgi?id=117051

Please just ask if I have forgotten anything. And many thanks for looking
at this and all the good work!

Regards,
Holger

Re: [PATCH] MD: make bio mergeable

2016-04-28 Thread Holger Kiehl

Hello,

On Mon, 25 Apr 2016, Shaohua Li wrote:

> blk_queue_split marks bio unmergeable, which makes sense for normal bio.
> But if dispatching the bio to underlayer disk, the blk_queue_split
> checks are invalid, hence it's possible the bio becomes mergeable.
> 
> In the reported bug, this bug causes trim against raid0 performance slash
> https://bugzilla.kernel.org/show_bug.cgi?id=117051
> 
This patch makes a huge difference. On a system with two Samsung 850 Pro
in a MD Raid0 setup the time for fstrim went down from ~30min to 18sec!

However, on another system with two Intel P3700 1.6TB NVMe PCIe SSD's
also setup as one big MD Raid0, the patch does not make any difference
at all. fstrim takes more then 4 hours!

Any idea what could be wrong?

Regards,
Holger


> Reported-by: Park Ju Hyung 
> Fixes: 6ac45aeb6bca(block: avoid to merge splitted bio)
> Cc: sta...@vger.kernel.org (v4.3+)
> Cc: Ming Lei 
> Cc: Jens Axboe 
> Cc: Neil Brown 
> Signed-off-by: Shaohua Li 
> ---
>  drivers/md/md.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 194580f..14d3b37 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -284,6 +284,8 @@ static blk_qc_t md_make_request(struct request_queue *q, 
> struct bio *bio)
>* go away inside make_request
>*/
>   sectors = bio_sectors(bio);
> + /* bio could be mergeable after passing to underlayer */
> + bio->bi_rw &= ~REQ_NOMERGE;
>   mddev->pers->make_request(mddev, bio);
>  
>   cpu = part_stat_lock();
> -- 
> 2.8.0.rc2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: Filesystem corruption MD (imsm) Raid0 via 2 SSD's + discard

2015-05-22 Thread Holger Kiehl

On Thu, 21 May 2015, NeilBrown wrote:

On Thu, 21 May 2015 06:44:27 + (UTC) Holger Kiehl
wrote:

On Thu, 21 May 2015, NeilBrown wrote:

On Thu, 21 May 2015 01:32:13 +0500 Roman Mamedov wrote:

On Wed, 20 May 2015 20:12:31 + (UTC)
Holger Kiehl wrote:

The kernel I was running when I discovered the
problem was 4.0.2 from kernel.org. However, after reinstalling from DVD
I updated to Fedora's lattest kernel, which was 3.19.? (I do not remember
the last numbers). So that kernel seems also effected, but I assume it
contains many 'fixes' from 4.0.x. As filesystem I use ext4, distribution
is Fedora 21 and hardware is: Xeon E3-1275, 16GB ECC Ram.

My system seems to be now running stable for some days with kernel.org
kernel 4.0.3 and with discard DISABLED. But I am still unsure what could
be the real cause.

It is a bug in the 4.0.2 kernel, fixed in 4.0.3.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785672
https://bbs.archlinux.org/viewtopic.php?id=197400
https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/d2dc317d564a46dfc683978a2e5a4f91434e9711

I suspect that is a different bug.
I think this one is
https://bugzilla.kernel.org/show_bug.cgi?id=98501

Should there not be a big fat warning going around telling users to disable
discard on Raid 0 until this is fixed? This breaks the filesystem completely
and I believe there is absolutly no way one can get back the data.

Probably. Would you like to do that?

Is this fixed in 4.0.4? And which kernels are effected? There could be many
people running systems that have not noticed this and don't know in what
dangerous situation they are when they delete data.

The patch was only added to my tree today. I will send to Linus tomorrow so
it should appear in the next -rc.
Any -stable kernel released since mid-April probably has the bug. It was
caused by
commit 47d68979cc968535cb87f3e5f2e6a3533ea48fbd

Once the fix gets into Linus' tree, it should get into subsequent -stable
releases.

The fix is here:

http://git.neil.brown.name/?p=md.git;a=commitdiff;h=a81157768a00e8cf8a7b43b5ea5cac931262374f

commit id should remain unchanged.

I would like to confirm that with this patch and discard enabled, I no longer
see any corruption.

Many thanks for the quick fix!

Regards,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

WARNING: Software Raid 0 on SSD's and discard corrupts data

2015-05-21 Thread Holger Kiehl


Hello,

all users using a Software Raid 0 on SSD's with discard should disable
discard, if they use any recent kernel since mid-April 2015. The bug
was introduced by commit 47d68979cc968535cb87f3e5f2e6a3533ea48fbd and
the fix is not yet in Linus tree. The fix can be found here:

   
http://git.neil.brown.name/?p=md.git;a=commitdiff;h=a81157768a00e8cf8a7b43b5ea5cac931262374f

Users should immediately remove the discard option from any mounted
software Raid 0 filesystems. Any delete or modification of files can
lead to random destruction on the filesystem. Use the remount option
of the mount command to remove the discard option. Do not do it via
editing /etc/fstab if your root filesystem is on a software raid 0.

Regards,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Filesystem corruption MD (imsm) Raid0 via 2 SSD's + discard

2015-05-20 Thread Holger Kiehl

On Thu, 21 May 2015, NeilBrown wrote:

On Thu, 21 May 2015 01:32:13 +0500 Roman Mamedov wrote:

On Wed, 20 May 2015 20:12:31 + (UTC)
Holger Kiehl wrote:

My system seems to be now running stable for some days with kernel.org
kernel 4.0.3 and with discard DISABLED. But I am still unsure what could
be the real cause.

It is a bug in the 4.0.2 kernel, fixed in 4.0.3.

I suspect that is a different bug.
I think this one is
https://bugzilla.kernel.org/show_bug.cgi?id=98501

Filesystem corruption MD (imsm) Raid0 via 2 SSD's + discard

2015-05-20 Thread Holger Kiehl


Hello,

I had a terrible weekend recovering my home system. Always when files
where deleted some data got corrupted. At first I did not notice it,
but when I rebooted the system would not come up again, systemd crashed
with SIGSEGV and that was it. Booting from an USB stick I saw that
some glibc lib had a different size from that in the original RPM. So
all I did reinstalled that lib from USB stick and everything was fine
after rebooting from Raid 0. But I then wanted to make sure that
no other files where corrupted so I checked and found more. So again I
reinstalled those RPM's and rebooted. To my big surprise the system was
again broken and failed to boot. I again tried to recover my system
from USB stick, but this time did not manage to recover the system. So
decided to reinstall the system completely from DVD. Everything looked good
until that moment when I had activated the discard option in /etc/fstab.
After doing some more work (adding and removing things) I rebooted and
again the system failed to boot. Booting from the USB stick I saw that
the /etc/fstab was all filled with NULL's. This gave me the clue that
there must be some problem with discard (trim). My system is using
a software raid 0 IMSM (intel 'fake' raid) on two Samsung SSD 840 pro.

A window system on the same disks (that is why I am using IMSM raid)
was not effected by this problem. I have checked the ram with memtest86
and everything is ok. The kernel I was running when I discovered the
problem was 4.0.2 from kernel.org. However, after reinstalling from DVD
I updated to Fedora's lattest kernel, which was 3.19.? (I do not remember
the last numbers). So that kernel seems also effected, but I assume it
contains many 'fixes' from 4.0.x. As filesystem I use ext4, distribution
is Fedora 21 and hardware is: Xeon E3-1275, 16GB ECC Ram.

My system seems to be now running stable for some days with kernel.org
kernel 4.0.3 and with discard DISABLED. But I am still unsure what could
be the real cause.

Regards,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

qlcnic very high TX values, as of 3.13.x

2014-07-11 Thread Holger Kiehl


Hello,

upgrading from 3.10.x to the next stable series 3.14.x I noticed that
ifconfig reports very high TX values. Taking the qlcnic source from
3.15.5 and compile it under 3.14.12, the problem remains. Going
backwards always just copying the qlcnic source from the older kernels
to the 3.14.12 tree, I noticed that the 3.12.x kernel was the last
version that does not generate those high TX values. So the problem
started with the qlcnic driver in 3.13.x. However, comparing 3.13.x
and 3.14.x the numbers go higher in 3.14.x much quicker. In 3.14.x
I get TX values in Terabytes very quickly after boot. I once even got
Petabyte values!

Hardware is the following:

HP ProLiant DL380 G7
2 x Intel Xeon X5690 (24 cores with hypertreading)
106 GByte Ram
1 x NC523SFP 10Gb 2-port Server Adapter Board Chip rev 0x54 (qlcnic)
1 x Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (ixgbe)

The qlcnic and ixgbe cards are bonded together in fault-tolerance
(active-backup) mode. And even when I switch to the Intel card, after I
get crazy TX values on qlcnic card, the TX vaules on this card still
go up at a very quick rate. This only stops when I reset the card
(reload the module). Also, there is no differnce if I compile the driver
in or use it as module. There are no strange messages in
/var/log/messages or dmesg. Here the output with the 3.13.x driver in
3.14.12 when system boots:

   [   18.229195] QLogic 1/10 GbE Converged/Intelligent Ethernet Driver v5.3.52
   [   18.229415] qlcnic :1a:00.0: 2048KB memory map
   [   18.854134] qlcnic :1a:00.0: Default minidump capture mask 0x1f
   [   19.602491] qlcnic :1a:00.0: FW dump enabled
   [   19.631257] qlcnic :1a:00.0: Supports FW dump capability
   [   19.667072] qlcnic :1a:00.0: Driver v5.3.52, firmware v4.14.26
   [   19.704279] qlcnic :1a:00.0: Set 4 Tx rings
   [   19.733001] qlcnic :1a:00.0: Set 4 SDS rings
   [   19.898808] qlcnic: 2c:27:d7:50:04:48: NC523SFP 10Gb 2-port Server 
Adapter Board Chip rev 0x54
   [   19.949325] qlcnic :1a:00.0: irq 129 for MSI/MSI-X
   [   19.949329] qlcnic :1a:00.0: irq 130 for MSI/MSI-X
   [   19.949333] qlcnic :1a:00.0: irq 131 for MSI/MSI-X
   [   19.949336] qlcnic :1a:00.0: irq 132 for MSI/MSI-X
   [   19.949340] qlcnic :1a:00.0: irq 133 for MSI/MSI-X
   [   19.949343] qlcnic :1a:00.0: irq 134 for MSI/MSI-X
   [   19.949347] qlcnic :1a:00.0: irq 135 for MSI/MSI-X
   [   19.949350] qlcnic :1a:00.0: irq 136 for MSI/MSI-X
   [   19.949369] qlcnic :1a:00.0: using msi-x interrupts
   [   19.982782] qlcnic :1a:00.0: Set 4 Tx queues
   [   20.055099] qlcnic :1a:00.0: eth2: XGbE port initialized
   [   20.090408] qlcnic :1a:00.1: 2048KB memory map
   [   20.179836] qlcnic :1a:00.1: Default minidump capture mask 0x1f
   [   20.217848] qlcnic :1a:00.1: FW dump enabled
   [   20.246979] qlcnic :1a:00.1: Supports FW dump capability
   [   20.282318] qlcnic :1a:00.1: Driver v5.3.52, firmware v4.14.26
   [   20.320238] qlcnic :1a:00.1: Set 4 Tx rings
   [   20.350038] qlcnic :1a:00.1: Set 4 SDS rings
   [   20.429714] qlcnic :1a:00.1: irq 137 for MSI/MSI-X
   [   20.429718] qlcnic :1a:00.1: irq 138 for MSI/MSI-X
   [   20.429722] qlcnic :1a:00.1: irq 139 for MSI/MSI-X
   [   20.429726] qlcnic :1a:00.1: irq 140 for MSI/MSI-X
   [   20.429729] qlcnic :1a:00.1: irq 141 for MSI/MSI-X
   [   20.429732] qlcnic :1a:00.1: irq 142 for MSI/MSI-X
   [   20.429736] qlcnic :1a:00.1: irq 143 for MSI/MSI-X
   [   20.429739] qlcnic :1a:00.1: irq 144 for MSI/MSI-X
   [   20.429757] qlcnic :1a:00.1: using msi-x interrupts
   [   20.458895] qlcnic :1a:00.1: Set 4 Tx queues
   [   20.486907] qlcnic :1a:00.1: eth3: XGbE port initialized

My kernel config can be downloaded here:

   ftp://ftp.dwd.de/pub/afd/test/.config

Please, just ask if I need to provide more details and please CC me,
since I am not on the list.

Thanks,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Kernel panic with 3.10.33 and possible hpwdt watchdog

2014-03-18 Thread Holger Kiehl


Hello,

I use a plain kernel.org kernel 3.10.33 and when I do a HP ILO (proprietary
embedded server management technology) reset of my Proliant 380p server,
the system hangs. Unfortunatly I cannot do a serial trace, so copied
everything by hand what I could read from console:

   [] ? vga_set_palette+0xd1/0x130
   [] ? panic+0x18c/0x1c7
   [] ? panic+0xf4/0x1c7
   [] ? hpwdt_pretimeout+0xc5/0xd0 [hpwdt]
   [] ? nmi_handle+0x59/0x80
   [] ? default_do_nmi+0x12f/0x2a0
   [] ? do_nmi+0x88/0xd0
   [] ? end_repeat_nmi+0x1e/0x2e
   [] ? intel_idle+0xb6/0x120
   [] ? intel_idle+0xb6/0x120
   [] ? intel_idle+0xb6/0x120
   <>  [] ? cpuidle_enter_state+0x3d/0xd0
   [] ? cpuidle_idle_call+0xba/0x140
   [] ? __tick_nohz_idle_enter+0x8d/0x120
   [] ? arch_cpu_idle+0x9/0x30
   [] ? cpu_idle_loop+0x92/0x160
   [] ? cpu_startup_entry+0x6b/0x70
   [] ? start_kernel+0x3e2/0x3ed
   [] ? repair_env_string+0x5e/0x5e
   [] ? x86_64_start_kernel+0x12a/0x130
   ---[ end trace 2a7f5aee76758ec0 ]---
   dmar: DRHD: handling fault status reg 2
   dmar: DMAR:[DMA Read] Request device [01:00.2] fault addr e9000
   DMAR:[fault reason 06] PTE Read access is not set

If I remove the hpwdt driver and I then reset the HP ILO system, the
system also hangs, but continuously at an interval of aprrox. 2 seconds
writes the following to console:

   NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
   NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.

Also, setting nmi_watchdog=0 does not change anything.

This does not happen when I do take the default kernel of the
disrtibution (Scientific Linux 6.5) 2.6.32-431.5.1.el6.x86_64.

The bad thing is that when the hpwdt driver is loaded, the watchdog does
not reset the system, ie. it hangs forever. And I cannot use Intel TCO
WatchDog Timer Driver since it is disabled in bios.

Please, can someone give me a hint where the error could be and what I
can do so I can continue to use the kernel.org kernel.

Many thanks in advance,
Holger

PS: Please CC me since I am not subscribed

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Holger Kiehl


On Mon, 3 Feb 2014, David Rientjes wrote:


On Mon, 3 Feb 2014, Vlastimil Babka wrote:


It seems to come from balloon_page_movable() and its test page_count(page) ==
1.



Hmm, I think it might be because compound_head() == NULL here.  Holger,
this looks like a race condition when allocating a compound page, did you
only see it once or is it actually reproducible?


No, this only happened once. It is not reproducable, the system was running
for four days without problems. And before this kernel, five years without
any problems.

Thanks,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Holger Kiehl


On Mon, 3 Feb 2014, Michal Hocko wrote:


On Mon 03-02-14 14:29:22, Holger Kiehl wrote:

I have attached it. Please, tell me if you do not get the attachment.


I hoped it would help me to get a closer compiled code to yours but I am
probably using too different gcc.


I have an old gcc, it is 4.4.1-2.


Anyway I've tried to check whether I can hook on something and it seems
that this is a race with thp merge/split or something like that.

[...]

  Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer 
dereference at 001c
  Jan 31 13:07:43 asterix kernel: IP: [] 
isolate_migratepages_range+0x32d/0x653
  Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
  Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
  Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp 
ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci 
i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore 
usb_common [last unloaded: microcode]
  Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
3.12.9 #1
  Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 
S4 /D2519, BIOS 4.06  Rev. 1.04.2519 07/30/2008
  Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 
task.ti: 8807d30b2000
  Jan 31 13:07:43 asterix kernel: RIP: 0010:[]  
[] isolate_migratepages_range+0x32d/0x653
  Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 00010286
  Jan 31 13:07:43 asterix kernel: RAX:  RBX: 0020ec09 
RCX: 0002
  Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 
RDI: 006c
  Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 
R09: 0001
  Jan 31 13:07:43 asterix kernel: R10:  R11: ea000733a000 
R12: 8807d30b3a58
  Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14:  
R15: 88083ffe1d80
  Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
GS:88083fd4() knlGS:
  Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
8005003b
  Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 
CR4: 000407e0
  Jan 31 13:07:43 asterix kernel: Stack:
  Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
ea2e6af0 8807d30b3998
  Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
8807d30b08c0 0020f000
  Jan 31 13:07:43 asterix kernel:  083b 
000a 8807d30b3a68
  Jan 31 13:07:43 asterix kernel: Call Trace:
  Jan 31 13:07:43 asterix kernel: [] ? 
lru_add_drain_cpu+0x25/0x97
  Jan 31 13:07:43 asterix kernel: [] compact_zone+0x2b5/0x319
  Jan 31 13:07:43 asterix kernel: [] ? put_super+0x20/0x2c
  Jan 31 13:07:43 asterix kernel: [] 
compact_zone_order+0xad/0xc4
  Jan 31 13:07:43 asterix kernel: [] 
try_to_compact_pages+0x91/0xe8
  Jan 31 13:07:43 asterix kernel: [] ? 
page_alloc_cpu_notify+0x3e/0x3e
  Jan 31 13:07:43 asterix kernel: [] 
__alloc_pages_direct_compact+0xae/0x195
  Jan 31 13:07:43 asterix kernel: [] 
__alloc_pages_nodemask+0x772/0x7b5
  Jan 31 13:07:43 asterix kernel: [] 
alloc_pages_vma+0xd6/0x101
  Jan 31 13:07:43 asterix kernel: [] 
do_huge_pmd_anonymous_page+0x199/0x2ee
  Jan 31 13:07:43 asterix kernel: [] 
handle_mm_fault+0x1b7/0xceb
  Jan 31 13:07:43 asterix kernel: [] ? 
__dequeue_entity+0x2e/0x33
  Jan 31 13:07:43 asterix kernel: [] 
__do_page_fault+0x3bd/0x3e4
  Jan 31 13:07:43 asterix kernel: [] ? 
mprotect_fixup+0x1c9/0x1fb
  Jan 31 13:07:43 asterix kernel: [] ? vm_mmap_pgoff+0x6d/0x8f
  Jan 31 13:07:43 asterix kernel: [] ? SyS_futex+0x103/0x13d
  Jan 31 13:07:43 asterix kernel: [] do_page_fault+0x9/0xb
  Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30
  Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 
41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 
<8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
  Jan 31 13:07:43 asterix kernel: RIP  [] 
isolate_migratepages_range+0x32d/0x653
  Jan 31 13:07:43 asterix kernel: RSP 
  Jan 31 13:07:43 asterix kernel: CR2: 001c
  Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---


This seems to match:
  17027:   49 8b 17mov(%r15),%rdx   # page->flags
  1702a:   4c 89 f8mov%r15,%rax
  1702d:   80 e6 80and$0x80,%dh # PageTail test
  17030:   74 04   je 17036 

  17032:   49 8b 47 30 mov0x30(%r15),%rax   # page = 
page->first_page
  17036:   8b 40 1cmov0x1c(%rax),%eax   <<< page->_count
  17039:   ff c8   dec%eax

Which seems to be inlined comp

Need help in bug in isolate_migratepages_range

2014-01-31 Thread Holger Kiehl


Hello,

today one of our system got a kernel bug message. It kept on running
but more and more process begin to be stuck in D state (eg. a simple w
command would never return) and I eventually had to reboot. Here the
full message:

   Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer 
dereference at 001c
   Jan 31 13:07:43 asterix kernel: IP: [] 
isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
   Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
   Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp 
ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci 
i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore 
usb_common [last unloaded: microcode]
   Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
3.12.9 #1
   Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY 
RX300 S4 /D2519, BIOS 4.06  Rev. 1.04.2519 07/30/2008
   Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 
task.ti: 8807d30b2000
   Jan 31 13:07:43 asterix kernel: RIP: 0010:[]  
[] isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 00010286
   Jan 31 13:07:43 asterix kernel: RAX:  RBX: 0020ec09 
RCX: 0002
   Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 
RDI: 006c
   Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 
R09: 0001
   Jan 31 13:07:43 asterix kernel: R10:  R11: ea000733a000 
R12: 8807d30b3a58
   Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14:  
R15: 88083ffe1d80
   Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
GS:88083fd4() knlGS:
   Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
8005003b
   Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 
CR4: 000407e0
   Jan 31 13:07:43 asterix kernel: Stack:
   Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
ea2e6af0 8807d30b3998
   Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
8807d30b08c0 0020f000
   Jan 31 13:07:43 asterix kernel:  083b 
000a 8807d30b3a68
   Jan 31 13:07:43 asterix kernel: Call Trace:
   Jan 31 13:07:43 asterix kernel: [] ? 
lru_add_drain_cpu+0x25/0x97
   Jan 31 13:07:43 asterix kernel: [] compact_zone+0x2b5/0x319
   Jan 31 13:07:43 asterix kernel: [] ? put_super+0x20/0x2c
   Jan 31 13:07:43 asterix kernel: [] 
compact_zone_order+0xad/0xc4
   Jan 31 13:07:43 asterix kernel: [] 
try_to_compact_pages+0x91/0xe8
   Jan 31 13:07:43 asterix kernel: [] ? 
page_alloc_cpu_notify+0x3e/0x3e
   Jan 31 13:07:43 asterix kernel: [] 
__alloc_pages_direct_compact+0xae/0x195
   Jan 31 13:07:43 asterix kernel: [] 
__alloc_pages_nodemask+0x772/0x7b5
   Jan 31 13:07:43 asterix kernel: [] 
alloc_pages_vma+0xd6/0x101
   Jan 31 13:07:43 asterix kernel: [] 
do_huge_pmd_anonymous_page+0x199/0x2ee
   Jan 31 13:07:43 asterix kernel: [] 
handle_mm_fault+0x1b7/0xceb
   Jan 31 13:07:43 asterix kernel: [] ? 
__dequeue_entity+0x2e/0x33
   Jan 31 13:07:43 asterix kernel: [] 
__do_page_fault+0x3bd/0x3e4
   Jan 31 13:07:43 asterix kernel: [] ? 
mprotect_fixup+0x1c9/0x1fb
   Jan 31 13:07:43 asterix kernel: [] ? 
vm_mmap_pgoff+0x6d/0x8f
   Jan 31 13:07:43 asterix kernel: [] ? SyS_futex+0x103/0x13d
   Jan 31 13:07:43 asterix kernel: [] do_page_fault+0x9/0xb
   Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30
   Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 
41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 
<8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
   Jan 31 13:07:43 asterix kernel: RIP  [] 
isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: RSP 
   Jan 31 13:07:43 asterix kernel: CR2: 001c
   Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---

Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to replicate
data to another host. Any idea what the cause of this bug is? Could it be
hardware? The system has been running now for five years without any problems.

Please CC me since I am not on the list.

Many thanks in advance.

Regards,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Problems with ixgbe driver

2013-06-17 Thread Holger Kiehl


Hello,

first, thank you for the quick help!

On Fri, 14 Jun 2013, Tantilov, Emil S wrote:


-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
Behalf Of Holger Kiehl
Sent: Friday, June 14, 2013 4:50 AM
To: e1000-de...@lists.sf.net
Cc: linux-kernel; net...@vger.kernel.org
Subject: Problems with ixgbe driver

Hello,

I have dual port 10Gb Intel network card on a 2 socket (Xeon X5690) with
a total of 12 cores. Hyperthreading is enabled so there are 24 cores.
The problem I have is that when other systems send large amount of data
the network with the intel ixgbe driver gets very slow. Ping times go up
from 0.2ms to appr. 60ms. Some FTP connections stall for more then 2
minutes. What is strange is that heatbeat is configured on the system
with a serial connection to another node and kernel always reports


If the network slows down so much there should be some indication in dmesg. 
Like Tx hangs perhaps.
Can you provide the output of dmesg and ethtool -S from the offending interface 
after the issue occurs?


No, there is absolute no indication in dmesg or /var/log/messages. But here
the ethtool output when ping times go up:

   root@helena:~# ethtool -S eth6
   NIC statistics:
rx_packets: 4410779
tx_packets: 8902514
rx_bytes: 2014041824
tx_bytes: 13199913202
rx_errors: 0
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
multicast: 4245
collisions: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
rx_missed_errors: 28143
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
rx_pkts_nic: 2401276937
tx_pkts_nic: 3868619482
rx_bytes_nic: 868282794731
tx_bytes_nic: 5743382228649
lsc_int: 4
tx_busy: 0
non_eop_descs: 743957
broadcast: 1745556
rx_no_buffer_count: 0
tx_timeout_count: 0
tx_restart_queue: 425
rx_long_length_errors: 0
rx_short_length_errors: 0
tx_flow_control_xon: 171
rx_flow_control_xon: 0
tx_flow_control_xoff: 277
rx_flow_control_xoff: 0
rx_csum_offload_errors: 0
alloc_rx_page_failed: 0
alloc_rx_buff_failed: 0
lro_aggregated: 0
lro_flushed: 0
rx_no_dma_resources: 0
hw_rsc_aggregated: 1153374
hw_rsc_flushed: 129169
fdir_match: 2424508153
fdir_miss: 1706029
fdir_overflow: 33
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_queue_0_packets: 470182
tx_queue_0_bytes: 690123121
tx_queue_1_packets: 797784
tx_queue_1_bytes: 1203968369
tx_queue_2_packets: 648692
tx_queue_2_bytes: 950171718
tx_queue_3_packets: 647434
tx_queue_3_bytes: 948647518
tx_queue_4_packets: 263216
tx_queue_4_bytes: 394806409
tx_queue_5_packets: 426786
tx_queue_5_bytes: 629387628
tx_queue_6_packets: 253708
tx_queue_6_bytes: 371774276
tx_queue_7_packets: 544634
tx_queue_7_bytes: 812223169
tx_queue_8_packets: 279056
tx_queue_8_bytes: 407792510
tx_queue_9_packets: 735792
tx_queue_9_bytes: 1092693961
tx_queue_10_packets: 393576
tx_queue_10_bytes: 583283986
tx_queue_11_packets: 712565
tx_queue_11_bytes: 1037740789
tx_queue_12_packets: 264445
tx_queue_12_bytes: 386010613
tx_queue_13_packets: 246828
tx_queue_13_bytes: 370387352
tx_queue_14_packets: 191789
tx_queue_14_bytes: 281160607
tx_queue_15_packets: 384581
tx_queue_15_bytes: 579890782
tx_queue_16_packets: 175119
tx_queue_16_bytes: 261312970
tx_queue_17_packets: 151219
tx_queue_17_bytes: 220259675
tx_queue_18_packets: 467746
tx_queue_18_bytes: 707472612
tx_queue_19_packets: 30642
tx_queue_19_bytes: 44896997
tx_queue_20_packets: 157957
tx_queue_20_bytes: 238772784
tx_queue_21_packets: 287819
tx_queue_21_bytes: 434965075
tx_queue_22_packets: 269298
tx_queue_22_bytes: 407637986
tx_queue_23_packets: 102344
tx_queue_23_bytes: 145542751
rx_queue_0_packets: 219438
rx_queue_0_bytes: 273936020
rx_queue_1_packets: 398269
rx_queue_1_bytes: 52080243
rx_queue_2_packets: 285870
rx_queue_2_bytes: 102299543
rx_queue_3_packets: 347238
rx_queue_3_bytes: 145830086
rx_queue_4_packets: 118448
rx_queue_4_bytes: 17515218
rx_queue_5_packets: 228029
rx_queue_5_bytes: 114142681
rx_queue_6_packets: 94285
rx_queue_6_bytes: 107618165
rx_queue_7_packets: 289615
rx_queue_7_bytes: 168428647

Problems with ixgbe driver

2013-06-14 Thread Holger Kiehl


Hello,

I have dual port 10Gb Intel network card on a 2 socket (Xeon X5690) with
a total of 12 cores. Hyperthreading is enabled so there are 24 cores.
The problem I have is that when other systems send large amount of data
the network with the intel ixgbe driver gets very slow. Ping times go up
from 0.2ms to appr. 60ms. Some FTP connections stall for more then 2
minutes. What is strange is that heatbeat is configured on the system
with a serial connection to another node and kernel always reports

ttyS0: 4 input overrun(s)

when lot of data is send and the ping time goes up.

On the network there are three vlan's configured. The network is bonded
(active-backup) together with another HP NC523SFP 10Gb 2-port Server
Adapter. When I switch the network to this card the problem goes away.
Also the ttyS0 input overruns disappear. Note also both network cards
are connected to the same switch.

The system uses Scientific Linux 6.4 with kernel.org kernel. I noticed
this behavior with kernel 3.9.5 and 3.9.6-rc1. Before I did not notice
it because traffic always went over the HP NC523SFP qlcnic card.

In search for a solution to the problem I found a newer ixgbe driver
3.15.1 (3.9.6-rc1. has 3.11.33-k) and tried that. But it has the same
problem. However when I load the module as follows:

modprobe ixgbe RSS=8,8

the problem goes away. The kernel.org ixgbe driver does not offer this
option. Why? It seems that both drivers have problems on systems with
24 cpu's. But I cannot believe that I am the only one who noticed this,
since ixgbe is widely used.

It would really be nice if one could set the RSS=8,8 option for kernel.org
ixgbe driver too. Or if someone could tell me where I can force the driver
to Receive Side Scaling to 8 even if it means editing the source code.

Below I have added some additional information. Please CC me since I
am not subscribed to any of these lists. And please do not hesitate
to ask if more information is needed.

Many thanks in advance.

Regards,
Holger


Loading ixgbe module 3.15.1 without any options:

   2013-06-14T10:01:15.001506+00:00 helena kernel: [74474.075411] Intel(R) 10 
Gigabit PCI Express Network Driver - version 3.15.1
   2013-06-14T10:01:15.033866+00:00 helena kernel: [74474.116422] Copyright (c) 
1999-2013 Intel Corporation.
   2013-06-14T10:01:15.204956+00:00 helena kernel: [74474.319440] ixgbe 
:10:00.0: (PCI Express:5.0GT/s:Width x4) 90:e2:ba:2b:40:80
   2013-06-14T10:01:15.317447+00:00 helena kernel: [74474.362568] ixgbe 
:10:00.0 eth6: MAC: 2, PHY: 15, SFP+: 5, PBA No: E68785-006
   2013-06-14T10:01:15.317465+00:00 helena kernel: [74474.394068] bonding: 
bond0: Adding slave eth6.
   2013-06-14T10:01:15.317468+00:00 helena kernel: [74474.431805] ixgbe 
:10:00.0 eth6: Enabled Features: RxQ: 24 TxQ: 24 FdirHash RSC
   2013-06-14T10:01:15.519117+00:00 helena kernel: [74474.599206] 8021q: adding 
VLAN 0 to HW filter on device eth6
   2013-06-14T10:01:15.592853+00:00 helena kernel: [74474.633370] bonding: 
bond0: enslaving eth6 as a backup interface with a down link.
   2013-06-14T10:01:15.592864+00:00 helena kernel: [74474.666823] ixgbe 
:10:00.0 eth6: detected SFP+: 5
   2013-06-14T10:01:15.634509+00:00 helena kernel: [74474.707900] ixgbe 
:10:00.0 eth6: Intel(R) 10 Gigabit Network Connection
   2013-06-14T10:01:15.888030+00:00 helena kernel: [74474.917771] ixgbe 
:10:00.1: (PCI Express:5.0GT/s:Width x4) 90:e2:ba:2b:40:81
   2013-06-14T10:01:15.888032+00:00 helena kernel: [74474.918516] ixgbe 
:10:00.0 eth6: NIC Link is Up 10 Gbps, Flow Control: RX/TX
   2013-06-14T10:01:15.981283+00:00 helena kernel: [74475.001538] ixgbe 
:10:00.1 eth7: MAC: 2, PHY: 15, SFP+: 6, PBA No: E68785-006
   2013-06-14T10:01:15.981293+00:00 helena kernel: [74475.006351] bonding: 
bond0: link status definitely up for interface eth6, 1 Mbps full duplex.
   2013-06-14T10:01:16.025063+00:00 helena kernel: [74475.094633] ixgbe 
:10:00.1 eth7: Enabled Features: RxQ: 24 TxQ: 24 FdirHash RSC
   2013-06-14T10:01:16.067357+00:00 helena kernel: [74475.138402] ixgbe 
:10:00.1 eth7: Intel(R) 10 Gigabit Network Connection


Loading ixgbe module 3.15.1 with RSS=8,8:

   2013-06-14T10:04:24.790464+00:00 helena kernel: [74663.558702] Intel(R) 10 
Gigabit PCI Express Network Driver - version 3.15.1
   2013-06-14T10:04:24.790484+00:00 helena kernel: [74663.601435] Copyright (c) 
1999-2013 Intel Corporation.
   2013-06-14T10:04:24.853174+00:00 helena kernel: [74663.630652] ixgbe: 
Receive-Side Scaling (RSS) set to 8
   2013-06-14T10:04:25.043310+00:00 helena kernel: [74663.813984] ixgbe 
:10:00.0: (PCI Express:5.0GT/s:Width x4) 90:e2:ba:2b:40:80
   2013-06-14T10:04:25.113547+00:00 helena kernel: [74663.853937] ixgbe 
:10:00.0 eth6: MAC: 2, PHY: 15, SFP+: 5, PBA No: E68785-006
   2013-06-14T10:04:25.113561+00:00 helena kernel: [74663.882910] bonding: 
bond0: Adding slave eth6.
   2013-06-14T10:04:25.159260+00:00 helena kernel: [74663.924060] ixgbe 
:10:

Re: Enabling hardlink restrictions to the Linux VFS in 3.6 by default

2012-10-26 Thread Holger Kiehl


Hello Kees,

first, many thanks for trying to help!

On Thu, 25 Oct 2012, Kees Cook wrote:


Hi Holger,

On Thu, Oct 25, 2012 at 12:13:40PM +, Holger Kiehl wrote:

as of linux 3.6 hardlink restrictions to the Linux VFS have been enabled
by default. This breaks the application AFD [1] of which I am the author.


Sorry this created a problem for you!


Internally it uses hardlink to distribute files. The reason for hardlinks
is that AFD can distribute one file to many destinations and for each
distributing process it creates a directory with hardlinks to the original
file. That way AFD itself never needs to copy the content of a file. Another
nice feature about hardlinks was that there is no need to have any logic in
the code needing AFD to know where the original file was, each distributing
process could delete its hardlink and the last one would delete the real
file. This way AFD could distribute files at rates of more then 2 files
per second (in benchmarks). This has worked from the first linux kernel
up to 3.5.7 and with solaris, hpux, aix, ftx, irix. As of 3.6 this does
not work for files where AFD does not have write permissions. It was always
sufficient to just have read permission on a file it wants to distribute.


Just to clarify, not even read access was needed for hardlinks:

$ whoami
kees
$ ls -l /etc/shadow
-r--r- 1 root shadow 3112 Oct 22 17:02 /etc/shadow
$ ln /etc/shadow /tmp/ohai
$ ls -l /tmp/ohai
-r--r- 2 root shadow 3112 Oct 22 17:02 ohai


Correct, but when AFD wants to distribute the file via for example FTP
it must have read access on the file. Because it needs to read the file
when it wants to send it on a socket.


You mention "the last one would delete the real file". That would have
required AFD to have write permission to the directory where the original
file existed? Maybe there is something in your architecture that could
take advantage of that? Directory group-write set-gid? I haven't taken
a look at AFD's code.


Right, it must have write permission on the directory that is monitored
by AFD. When it detects a file it moves (rename()) it to an internal
directory where AFD works. So this step still works. But from there it
creates hardlinks for each distributing job. But this no longer works if
AFD does not have write access on the file itself. So even if set-gid
is set, this would still not work if the file does not have write
permission for the group.


The fix for the "at" daemon [2] mentioned in the commitdiff [3] cannot
be used for AFD since it is not run with root privileges. Is there any
other way I can "fix" my application? I currently can see no other way
then doing it via: echo 0 > /proc/sys/fs/protected_hardlinks


You said you have read access to these files, so perhaps you can make
a copy when you have read but not write, and then all the subsequent
duplication would be able to hardlink?


This is exactly what AFD tries to avoid. AFD is used on systems where it
distributes Terabytes of data daily and if it would need to copy the file
first imagine the strain it imposes on those servers.


If you wanted to turn off the sysctl, you could have AFD could ship
files in /etc/sysctl.d/ (or your distro equivalent) to turn it off.


Yes, that could be done. However, I do not want as a maintainer of one
software package by default disable or enable anything in the kernel.
I do not think the system administrators would like this.


I'm sure there are plenty of options available.


Sorry, I cannot see them. But please if you or others have more ideas
I am certainly open to change AFD if it can be done efficiently.


Why is such a fundamentally change to the linux kernel activated by default?


Based on about two years of testing in Ubuntu, the number of problems
was vanishingly small, so the security benefit is seen to outweigh
the downside.


Ubuntu is known to be very user friendly and mostly used by users on their
laptops/pc's and not so common in the server environment such as Redhat,
SLES, etc. So I question the statement "vanishingly small", when you
enable it in those environments by default.

And I think there is a real benefit in that one can do hardlinks on a
file that one does not own, which I think was not seen by those that
disable this feature now by default.


Would it not be better if it is the other way around, that the system
administrator or distributions enable this?


Virtually all distributions would have turned this on by default,
so it seemed better to many people to just make it the default in the
kernel. Only unusual corner-cases would need it disabled.


So you too would say not all distributions would enable it by default.
Would it then not be better for them to first try this and see if the number
of problems is really "vanishingly small". And then if all distributions
enable this by default one can do it in the kernel by default as well.
Has it not always

Enabling hardlink restrictions to the Linux VFS in 3.6 by default

2012-10-25 Thread Holger Kiehl

Hello,

as of linux 3.6 hardlink restrictions to the Linux VFS have been enabled
by default. This breaks the application AFD [1] of which I am the author.
Internally it uses hardlink to distribute files. The reason for hardlinks
is that AFD can distribute one file to many destinations and for each
distributing process it creates a directory with hardlinks to the original
file. That way AFD itself never needs to copy the content of a file. Another
nice feature about hardlinks was that there is no need to have any logic in
the code needing AFD to know where the original file was, each distributing
process could delete its hardlink and the last one would delete the real
file. This way AFD could distribute files at rates of more then 2 files
per second (in benchmarks). This has worked from the first linux kernel
up to 3.5.7 and with solaris, hpux, aix, ftx, irix. As of 3.6 this does
not work for files where AFD does not have write permissions. It was always
sufficient to just have read permission on a file it wants to distribute.

The fix for the "at" daemon [2] mentioned in the commitdiff [3] cannot
be used for AFD since it is not run with root privileges. Is there any
other way I can "fix" my application? I currently can see no other way
then doing it via: echo 0 > /proc/sys/fs/protected_hardlinks

Why is such a fundamentally change to the linux kernel activated by default?
Would it not be better if it is the other way around, that the system
administrator or distributions enable this?

Regards,
Holger

PS: Please CC me as I am not on the list.

[1] http://www.dwd.de/AFD
[2]
http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
[3]
https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=800179c9b8a1e796e441674776d11cd4c05d61d7
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

What happened to TRIM support for raid linear/0/1/10?

2012-08-08 Thread Holger Kiehl


Hello,

I have been using the patches posted by Shaohua Li on 16th March 2012:

   http://lkml.indiana.edu/hypermail/linux/kernel/1203.2/00048.html

for several month on a very busy file server (serving 9 million files
with 5.3 TiB daily) without any problems.

Is there any chance that these patches will go into the official kernel?
Or what is the reason that these patches are no applied?

I have attached the patch set in one big patch for 3.5. Please do not
use it since I am not sure if it is correct. Shaohua could you please
take a look if it is correct and maybe post a new one?

Personally, I would think that TRIM support MD would be a very good thing.

Regards,
Holgerdiff -u --recursive --new-file linux-3.5.orig/drivers/md/linear.c 
linux-3.5/drivers/md/linear.c
--- linux-3.5.orig/drivers/md/linear.c  2012-07-21 20:58:29.0 +
+++ linux-3.5/drivers/md/linear.c   2012-07-27 06:53:39.507121434 +
@@ -138,6 +138,7 @@
struct linear_conf *conf;
struct md_rdev *rdev;
int i, cnt;
+   bool discard_supported = false;
 
conf = kzalloc (sizeof (*conf) + raid_disks*sizeof(struct dev_info),
GFP_KERNEL);
@@ -171,6 +172,8 @@
conf->array_sectors += rdev->sectors;
cnt++;
 
+   if (blk_queue_discard(bdev_get_queue(rdev->bdev)))
+   discard_supported = true;
}
if (cnt != raid_disks) {
printk(KERN_ERR "md/linear:%s: not enough drives present. 
Aborting!\n",
@@ -178,6 +181,11 @@
goto out;
}
 
+   if (!discard_supported)
+   queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
+   else
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
+
/*
 * Here we calculate the device offsets.
 */
@@ -326,6 +334,14 @@
bio->bi_sector = bio->bi_sector - start_sector
+ tmp_dev->rdev->data_offset;
rcu_read_unlock();
+
+   if (unlikely((bio->bi_rw & REQ_DISCARD) &&
+   !blk_queue_discard(bdev_get_queue(bio->bi_bdev {
+   /* Just ignore it */
+   bio_endio(bio, 0);
+   return;
+   }
+
generic_make_request(bio);
 }
 
diff -u --recursive --new-file linux-3.5.orig/drivers/md/raid0.c 
linux-3.5/drivers/md/raid0.c
--- linux-3.5.orig/drivers/md/raid0.c   2012-07-21 20:58:29.0 +
+++ linux-3.5/drivers/md/raid0.c2012-07-27 06:53:39.507121434 +
@@ -88,6 +88,7 @@
char b[BDEVNAME_SIZE];
char b2[BDEVNAME_SIZE];
struct r0conf *conf = kzalloc(sizeof(*conf), GFP_KERNEL);
+   bool discard_supported = false;
 
if (!conf)
return -ENOMEM;
@@ -195,6 +196,9 @@
if (!smallest || (rdev1->sectors < smallest->sectors))
smallest = rdev1;
cnt++;
+
+   if (blk_queue_discard(bdev_get_queue(rdev1->bdev)))
+   discard_supported = true;
}
if (cnt != mddev->raid_disks) {
printk(KERN_ERR "md/raid0:%s: too few disks (%d of %d) - "
@@ -272,6 +276,11 @@
blk_queue_io_opt(mddev->queue,
 (mddev->chunk_sectors << 9) * mddev->raid_disks);
 
+   if (!discard_supported)
+   queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
+   else
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
+
pr_debug("md/raid0:%s: done.\n", mdname(mddev));
*private_conf = conf;
 
@@ -422,6 +431,7 @@
if (md_check_no_bitmap(mddev))
return -EINVAL;
blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
+   blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
 
/* if private is not null, we are here after takeover */
if (mddev->private == NULL) {
@@ -509,7 +519,7 @@
sector_t sector = bio->bi_sector;
struct bio_pair *bp;
/* Sanity check -- queue functions should prevent this 
happening */
-   if (bio->bi_vcnt != 1 ||
+   if ((bio->bi_vcnt != 1 && bio->bi_vcnt != 0) ||
bio->bi_idx != 0)
goto bad_map;
/* This is a one page bio that upper layers
@@ -535,6 +545,13 @@
bio->bi_sector = sector_offset + zone->dev_start +
tmp_dev->data_offset;
 
+   if (unlikely((bio->bi_rw & REQ_DISCARD) &&
+   !blk_queue_discard(bdev_get_queue(bio->bi_bdev {
+   /* Just ignore it */
+   bio_endio(bio, 0);
+   return;
+   }
+
generic_make_request(bio);
return;
 
diff -u --recursive --new-file linux-3.5.orig/drivers/md/raid10.c 
linux-3.5/drivers/md/raid10.c
--- linux-3.5.orig/drivers/md/raid10.c  2012-07-21 20:58:29.0 +
+++ linux-3.5/drivers/md/raid10.c   2012-07-27

Re: Where is the performance bottleneck?

2005-09-01 Thread Holger Kiehl


On Wed, 31 Aug 2005, Holger Kiehl wrote:


On Thu, 1 Sep 2005, Nick Piggin wrote:


Holger Kiehl wrote:


meminfo.dump:

   MemTotal:  8124172 kB
   MemFree: 23564 kB
   Buffers:   7825944 kB
   Cached:  19216 kB
   SwapCached:  0 kB
   Active:  25708 kB
   Inactive:  7835548 kB
   HighTotal:   0 kB
   HighFree:0 kB
   LowTotal:  8124172 kB
   LowFree: 23564 kB
   SwapTotal:15631160 kB
   SwapFree: 15631160 kB
   Dirty: 3145604 kB


Hmm OK, dirty memory is pinned pretty much exactly on dirty_ratio
so maybe I've just led you on a goose chase.

You could
   echo 5 > /proc/sys/vm/dirty_background_ratio
   echo 10 > /proc/sys/vm/dirty_ratio

To further reduce dirty memory in the system, however this is
a long shot, so please continue your interaction with the
other people in the thread first.


Yes, this does make a difference, here the results of running

 dd if=/dev/full of=/dev/sd?1 bs=4M count=4883

on 8 disks at the same time:

 34.273340
 33.938829
 33.598469
 32.970575
 32.841351
 32.723988
 31.559880
 29.778112

That's 32.710568 MB/s on average per disk with your change and without
it it was 24.958557 MB/s on average per disk.

I will do more tests tomorrow.


Just rechecked those numbers. Did a fresh boot and run the test several
times. With defaults (dirty_background_ratio=10, dirty_ratio=40) I get
for the dd write tests an average of 24.559491 MB/s (8 disks in parallel)
per disk. With the suggested values (dirty_background_ratio=5, dirty_ratio=10)
32.390659 MB/s per disk.

I then did a SW raid0 over all disks with the following command:

  mdadm -C /dev/md3 -l0 -n8 /dev/sd[cdefghij]1

  (dirty_background_ratio=10, dirty_ratio=40) 223.955995 MB/s
  (dirty_background_ratio=5, dirty_ratio=10)  234.318936 MB/s

So the differnece is not so big anymore.

Something else I notice while doing the dd over 8 disks is the following
(top just before they are finished):

top - 08:39:11 up  2:03,  2 users,  load average: 23.01, 21.48, 15.64
Tasks: 102 total,   2 running, 100 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us, 17.7% sy,  0.0% ni,  0.0% id, 78.9% wa,  0.2% hi,  3.1% si
Mem:   8124184k total,  8093068k used,31116k free,  7831348k buffers
Swap: 15631160k total,13352k used, 15617808k free, 5524k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 3423 root  18   0 55204  460  392 R 12.0  0.0   1:15.55 dd
 3421 root  18   0 55204  464  392 D 11.3  0.0   1:17.36 dd
 3418 root  18   0 55204  464  392 D 10.3  0.0   1:10.92 dd
 3416 root  18   0 55200  464  392 D 10.0  0.0   1:09.20 dd
 3420 root  18   0 55204  464  392 D 10.0  0.0   1:10.49 dd
 3422 root  18   0 55200  460  392 D  9.3  0.0   1:13.58 dd
 3417 root  18   0 55204  460  392 D  7.6  0.0   1:13.11 dd
  158 root  15   0 000 D  1.3  0.0   1:12.61 kswapd3
  159 root  15   0 000 D  1.3  0.0   1:08.75 kswapd2
  160 root  15   0 000 D  1.0  0.0   1:07.11 kswapd1
 3419 root  18   0 51096  552  476 D  1.0  0.0   1:17.15 dd
  161 root  15   0 000 D  0.7  0.0   0:54.46 kswapd0
1 root  16   0  4876  372  332 S  0.0  0.0   0:01.15 init
2 root  RT   0 000 S  0.0  0.0   0:00.00 migration/0
3 root  34  19 000 S  0.0  0.0   0:00.00 ksoftirqd/0
4 root  RT   0 000 S  0.0  0.0   0:00.00 migration/1
5 root  34  19 000 S  0.0  0.0   0:00.00 ksoftirqd/1
6 root  RT   0 000 S  0.0  0.0   0:00.00 migration/2
7 root  34  19 000 S  0.0  0.0   0:00.00 ksoftirqd/2
8 root  RT   0 000 S  0.0  0.0   0:00.00 migration/3
9 root  34  19 000 S  0.0  0.0   0:00.00 ksoftirqd/3

A loadaverage of 23 for 8 dd's seems a bit high. Also why is kswapd working
so hard? Is that correct.

Please just tell me if there is anything else I can test or dumps that
could be useful.

Thanks,
Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Where is the performance bottleneck?

2005-08-31 Thread Holger Kiehl


On Thu, 1 Sep 2005, Nick Piggin wrote:


Holger Kiehl wrote:


meminfo.dump:

   MemTotal:  8124172 kB
   MemFree: 23564 kB
   Buffers:   7825944 kB
   Cached:  19216 kB
   SwapCached:  0 kB
   Active:  25708 kB
   Inactive:  7835548 kB
   HighTotal:   0 kB
   HighFree:0 kB
   LowTotal:  8124172 kB
   LowFree: 23564 kB
   SwapTotal:15631160 kB
   SwapFree: 15631160 kB
   Dirty: 3145604 kB


Hmm OK, dirty memory is pinned pretty much exactly on dirty_ratio
so maybe I've just led you on a goose chase.

You could
   echo 5 > /proc/sys/vm/dirty_background_ratio
   echo 10 > /proc/sys/vm/dirty_ratio

To further reduce dirty memory in the system, however this is
a long shot, so please continue your interaction with the
other people in the thread first.


Yes, this does make a difference, here the results of running

  dd if=/dev/full of=/dev/sd?1 bs=4M count=4883

on 8 disks at the same time:

  34.273340
  33.938829
  33.598469
  32.970575
  32.841351
  32.723988
  31.559880
  29.778112

That's 32.710568 MB/s on average per disk with your change and without
it it was 24.958557 MB/s on average per disk.

I will do more tests tomorrow.

Thanks,
Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Where is the performance bottleneck?

2005-08-31 Thread Holger Kiehl


On Wed, 31 Aug 2005, Dr. David Alan Gilbert wrote:


* Holger Kiehl ([EMAIL PROTECTED]) wrote:

On Wed, 31 Aug 2005, Jens Axboe wrote:

Full vmstat session can be found under:


Have you got iostat?  iostat -x 10 might be interesting to see
for a period while it is going.


The following is the result from all 8 disks at the same time with the command
dd if=/dev/sd?1 of=/dev/null bs=256k count=78125

There is however one difference, here I had set
/sys/block/sd?/queue/nr_requests to 4096.

avg-cpu:  %user   %nice%sys %iowait   %idle
   0.100.00   21.85   58.55   19.50

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda  0.00   0.00  0.00  0.300.002.40 0.00 1.20 8.00 
0.001.00   1.00   0.03
sdb  0.70   0.00  0.10  0.306.402.40 3.20 1.2022.00 
0.004.25   4.25   0.17
sdc8276.90   0.00 267.10  0.00 68352.000.00 34176.00 0.00   
255.90 1.957.29   3.74 100.02
sdd9098.50   0.00 293.50  0.00 75136.000.00 37568.00 0.00   
256.00 1.936.59   3.41 100.03
sde10428.40   0.00 336.40  0.00 86118.400.00 43059.20 0.00   
256.00 1.925.71   2.97 100.02
sdf11314.90   0.00 365.10  0.00 93440.000.00 46720.00 0.00   
255.93 1.925.26   2.74  99.98
sdg7973.20   0.00 257.20  0.00 65843.200.00 32921.60 0.00   
256.00 1.947.53   3.89 100.01
sdh9436.30   0.00 304.70  0.00 77928.000.00 38964.00 0.00   
255.75 1.936.35   3.28 100.01
sdi10604.80   0.00 342.40  0.00 87577.600.00 43788.80 0.00   
255.78 1.925.62   2.92 100.02
sdj10914.30   0.00 352.20  0.00 90132.800.00 45066.40 0.00   
255.91 1.915.43   2.84 100.00
md0  0.00   0.00  0.00  0.100.000.80 0.00 0.40 8.00 
0.000.00   0.00   0.00
md2  0.00   0.00  0.80  0.006.400.00 3.20 0.00 8.00 
0.000.00   0.00   0.00
md1  0.00   0.00  0.00  0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00

avg-cpu:  %user   %nice%sys %iowait   %idle
   0.070.00   24.49   66.818.62

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda  0.00   0.40  0.00  1.000.00   11.20 0.00 5.6011.20 
0.001.30   0.50   0.05
sdb  0.00   0.40  0.00  1.000.00   11.20 0.00 5.6011.20 
0.001.50   0.70   0.07
sdc8161.90   0.00 263.70  0.00 67404.800.00 33702.40 0.00   
255.61 1.957.38   3.79 100.02
sdd9157.30   0.00 295.50  0.00 75622.400.00 37811.20 0.00   
255.91 1.936.53   3.38 100.00
sde10505.60   0.00 339.20  0.00 86758.400.00 43379.20 0.00   
255.77 1.935.68   2.95  99.99
sdf11212.50   0.00 361.90  0.00 92595.200.00 46297.60 0.00   
255.86 1.915.28   2.76 100.00
sdg7988.40   0.00 258.00  0.00 65971.200.00 32985.60 0.00   
255.70 1.937.49   3.88  99.98
sdh9436.20   0.00 304.40  0.00 77924.800.00 38962.40 0.00   
255.99 1.926.32   3.28  99.99
sdi10406.10   0.00 336.30  0.00 85939.200.00 42969.60 0.00   
255.54 1.925.70   2.97 100.00
sdj11027.00   0.00 356.00  0.00 91064.000.00 45532.00 0.00   
255.80 1.925.40   2.81  99.96
md0  0.00   0.00  0.00  1.000.008.00 0.00 4.00 8.00 
0.000.00   0.00   0.00
md2  0.00   0.00  0.00  0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
md1  0.00   0.00  0.00  0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00

avg-cpu:  %user   %nice%sys %iowait   %idle
   0.080.00   22.23   60.44   17.25

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda  0.00   0.00  0.00  0.300.002.40 0.00 1.20 8.00 
0.001.00   1.00   0.03
sdb  0.00   0.00  0.00  0.300.002.40 0.00 1.20 8.00 
0.000.67   0.67   0.02
sdc8204.50   0.00 264.76  0.00 67754.150.00 33877.08 0.00   
255.90 1.957.38   3.78 100.12
sdd9166.47   0.00 295.90  0.00 75698.100.00 37849.05 0.00   
255.83 1.946.55   3.38 100.12
sde10534.93   0.00 339.94  0.00 86999.000.00 43499.50 0.00   
255.92 1.935.67   2.95 100.12
sdf11282.68   0.00 364.16  0.00 93174.770.00 46587.39 0.00   
255.86 1.925.28   2.75 100.10
sdg8114.61   0.00 261.76  0.00 67011.010.00 33505.51 0.00   
256.00 1.957.44   3.82 100.11
sdh9380.68   0.00 302.60  0.00 77466.270.00 38733.13 0.00   
256.00 1.936.38

Re: Where is the performance bottleneck?

2005-08-31 Thread Holger Kiehl


On Wed, 31 Aug 2005, Jens Axboe wrote:


On Wed, Aug 31 2005, Holger Kiehl wrote:

# ./oread /dev/sdX

and it will read 128k chunks direct from that device. Run on the same
drives as above, reply with the vmstat info again.


Using kernel 2.6.12.5 again, here the results:


[snip]

Ok, reads as expected, like the buffered io but using less system time.
And you are still 1/3 off the target data rate, hmmm...

With the reads, how does the aggregate bandwidth look when you add
'clients'? Same as with writes, gradually decreasing per-device
throughput?


I performed the following tests with this command:

   dd if=/dev/sd?1 of=/dev/null bs=256k count=78125

Single disk tests:

   /dev/sdc1 74.954715 MB/s
   /dev/sdg1 74.973417 MB/s

Following disks in parallel:

   2 disks on same channel
   /dev/sdc1 75.034191 MB/s
   /dev/sdd1 74.984643 MB/s

   3 disks on same channel
   /dev/sdc1 75.027850 MB/s
   /dev/sdd1 74.976583 MB/s
   /dev/sde1 75.278276 MB/s

   4 disks on same channel
   /dev/sdc1 58.343166 MB/s
   /dev/sdd1 62.993059 MB/s
   /dev/sde1 66.940569 MB/s
   /dev/sdd1 70.986072 MB/s

   2 disks on different channels
   /dev/sdc1 74.954715 MB/s
   /dev/sdg1 74.973417 MB/s

   4 disks on different channels
   /dev/sdc1 74.959030 MB/s
   /dev/sdd1 74.877703 MB/s
   /dev/sdg1 75.009697 MB/s
   /dev/sdh1 75.028138 MB/s

   6 disks on different channels
   /dev/sdc1 49.640743 MB/s
   /dev/sdd1 55.935419 MB/s
   /dev/sde1 58.795241 MB/s
   /dev/sdg1 50.280864 MB/s
   /dev/sdh1 54.210705 MB/s
   /dev/sdi1 59.413176 MB/s

So this looks different from writting, only as of four disks does the
performance begin to drop.

I just noticed, did you want me to do these test with the oread program?

Thanks,
Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Where is the performance bottleneck?

2005-08-31 Thread Holger Kiehl


On Wed, 31 Aug 2005, Jens Axboe wrote:


On Wed, Aug 31 2005, Holger Kiehl wrote:

On Wed, 31 Aug 2005, Jens Axboe wrote:


Nothing sticks out here either. There's plenty of idle time. It smells
like a driver issue. Can you try the same dd test, but read from the
drives instead? Use a bigger blocksize here, 128 or 256k.


I used the following command reading from all 8 disks in parallel:

   dd if=/dev/sd?1 of=/dev/null bs=256k count=78125

Here vmstat output (I just cut something out in the middle):

procs ---memory-- ---swap-- -io --system--
cpu^M
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id
 wa^M
 3  7   4348  42640 7799984   961200 322816 0 3532  4987  0 22
 0 78
 1  7   4348  42136 7800624   958400 322176 0 3526  4987  0 23
 4 74
 0  8   4348  39912 7802648   966800 322176 0 3525  4955  0 22
 12 66
 1  7   4348  38912 7803700   963600 322432 0 3526  5078  0 23


Ok, so that's somewhat better than the writes but still off from what
the individual drives can do in total.


You might want to try the same with direct io, just to eliminate the
costly user copy. I don't expect it to make much of a difference though,
feels like the problem is elsewhere (driver, most likely).


Sorry, I don't know how to do this. Do you mean using a C program
that sets some flag to do direct io, or how can I do that?


I've attached a little sample for you, just run ala

# ./oread /dev/sdX

and it will read 128k chunks direct from that device. Run on the same
drives as above, reply with the vmstat info again.


Using kernel 2.6.12.5 again, here the results:

procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 0  0  0 8009648   4764  4059200 0 0 101132  0  0 100  0
 0  0  0 8009648   4764  4059200 0 0 101134  0  0 100  0
 0  0  0 8009648   4764  4059200 0 0 100861  0  0 100  0
 0  0  0 8009648   4764  4059200 0 0 100626  0  0 100  0
 0  8  0 8006372   4764  4059200 120192 0 1944  1929  0  1 89 10
 2  8  0 8006372   4764  4059200 319488 0 3502  4999  0  2 75 24
 0  8  0 8006372   4764  4059200 319488 0 3506  4995  0  2 75 24
 0  8  0 8006372   4764  4059200 319744 0 3504  4999  0  1 75 24
 0  8  0 8006372   4764  4059200 319488 0 3507  5009  0  2 75 23
 0  8  0 8006372   4764  4059200 319616 0 3506  5011  0  2 75 24
 0  8  0 8005124   4800  4110000 319976 0 3536  4995  0  2 73 25
 0  8  0 8005124   4800  4110000 323584 0 3534  5000  0  2 75 23
 0  8  0 8005124   4800  4110000 323968 0 3540  5035  0  1 75 24
 0  8  0 8005124   4800  4110000 319232 0 3506  4811  0  1 75 24
 0  8  0 8005504   4800  4110000 317952 0 3498  4747  0  1 75 24
 0  8  0 8005504   4800  4110000 318720 0 3495  4672  0  2 75 23
 1  8  0 8005504   4800  4110000 318720 0 3509  4707  0  1 75 24
 0  8  0 8005504   4800  4110000 318720 0 3499  4667  0  2 75 23
 0  8  0 8005504   4808  4109200 31884840 3509  4674  0  1 75 24
 0  8  0 8005380   4808  4109200 318848 0 3497  4693  0  2 72 26
 0  8  0 8005380   4808  4109200 318592 0 3500  4646  0  2 75 23
 0  8  0 8005380   4808  4109200 318592 0 3495  4828  0  2 61 37
 0  8  0 8005380   4808  4109200 318848 0 3499  4827  0  1 62 37
 1  8  0 8005380   4808  4109200 318464 0 3495  4642  0  2 75 23
 0  8  0 8005380   4816  4108400 31884832 3511  4672  0  1 75 24
 0  8  0 8005380   4816  4108400 320640 0 3512  4877  0  2 75 23
 0  8  0 8005380   4816  4108400 322944 0 3533  5047  0  2 75 24
 0  8  0 8005380   4816  4108400 322816 0 3531  5053  0  1 75 24
 0  8  0 8005380   4816  4108400 322944 0 3531  5048  0  2 75 23
 0  8  0 8005380   4816  4108400 322944 0 3529  5043  0  1 75 24
 0  0  0 8008360   4816  4108400 266880 0 3112  4224  0  2 78 20
 0  0  0 8008360   4816  4108400 0 0 101228  0  0 100  0

Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Where is the performance bottleneck?

2005-08-31 Thread Holger Kiehl


On Wed, 31 Aug 2005, Nick Piggin wrote:


Holger Kiehl wrote:


3236497 total  1.4547
2507913 default_idle 52248.1875
158752 shrink_zone   43.3275
121584 copy_user_generic_c  3199.5789
 34271 __wake_up_bit713.9792
 31131 __make_request23.1629
 22096 scsi_request_fn   18.4133
 21915 rotate_reclaimable_page   80.5699

  ^

I don't think this function should be here. This indicates that
lots of writeout is happening due to pages falling off the end
of the LRU.

There was a bug recently causing memory estimates to be wrong
on Opterons that could cause this I think.

Can you send in 2 dumps of /proc/vmstat taken 10 seconds apart
while you're writing at full speed (with 2.6.13 or the latest
-git tree).


I took 2.6.13, there where no git snapshots at www.kernel.org when
I looked. With 2.6.13 I must load the Fusion MPT driver as module.
Compiling it in it does not detect the drive correctly, as module
there is no problem.

Here is what I did:

   #!/bin/bash

   time dd if=/dev/full of=/dev/sdc1 bs=4M count=4883 &
   time dd if=/dev/full of=/dev/sdd1 bs=4M count=4883 &
   time dd if=/dev/full of=/dev/sde1 bs=4M count=4883 &
   time dd if=/dev/full of=/dev/sdf1 bs=4M count=4883 &
   time dd if=/dev/full of=/dev/sdg1 bs=4M count=4883 &
   time dd if=/dev/full of=/dev/sdh1 bs=4M count=4883 &
   time dd if=/dev/full of=/dev/sdi1 bs=4M count=4883 &
   time dd if=/dev/full of=/dev/sdj1 bs=4M count=4883 &

   sleep 20

   cat /proc/vmstat > /root/vmstat-1.dump

   sleep 10

   cat /proc/vmstat > /root/vmstat-2.dump
   cat /proc/zoneinfo > /root/zoneinfo.dump
   cat /proc/meminfo > /root/meminfo.dump

   exit 0

vmstat-1.dump:

   nr_dirty 787282
   nr_writeback 44317
   nr_unstable 0
   nr_page_table_pages 633
   nr_mapped 6373
   nr_slab 53030
   pgpgin 263362
   pgpgout 5260352
   pswpin 0
   pswpout 0
   pgalloc_high 0
   pgalloc_normal 2448628
   pgalloc_dma 1041
   pgfree 2457343
   pgactivate 5775
   pgdeactivate 2113
   pgfault 465679
   pgmajfault 321
   pgrefill_high 0
   pgrefill_normal 5940
   pgrefill_dma 33
   pgsteal_high 0
   pgsteal_normal 148759
   pgsteal_dma 0
   pgscan_kswapd_high 0
   pgscan_kswapd_normal 153813
   pgscan_kswapd_dma 1089
   pgscan_direct_high 0
   pgscan_direct_normal 0
   pgscan_direct_dma 0
   pginodesteal 0
   slabs_scanned 0
   kswapd_steal 148759
   kswapd_inodesteal 0
   pageoutrun 5304
   allocstall 0
   pgrotated 0
   nr_bounce 0

vmstat-2.dump:

   nr_dirty 786397
   nr_writeback 44233
   nr_unstable 0
   nr_page_table_pages 640
   nr_mapped 6406
   nr_slab 53027
   pgpgin 263382
   pgpgout 7835732
   pswpin 0
   pswpout 0
   pgalloc_high 0
   pgalloc_normal 3091687
   pgalloc_dma 2420
   pgfree 3101327
   pgactivate 5817
   pgdeactivate 2918
   pgfault 466269
   pgmajfault 322
   pgrefill_high 0
   pgrefill_normal 28265
   pgrefill_dma 150
   pgsteal_high 0
   pgsteal_normal 789909
   pgsteal_dma 1388
   pgscan_kswapd_high 0
   pgscan_kswapd_normal 904101
   pgscan_kswapd_dma 4950
   pgscan_direct_high 0
   pgscan_direct_normal 0
   pgscan_direct_dma 0
   pginodesteal 0
   slabs_scanned 1152
   kswapd_steal 791297
   kswapd_inodesteal 0
   pageoutrun 28299
   allocstall 0
   pgrotated 562
   nr_bounce 0

zoneinfo.dump:

   Node 3, zone   Normal
 pages free 899
   min  726
   low  907
   high 1089
   active   3996
   inactive 490989
   scanned  0 (a: 16 i: 0)
   spanned  524287
   present  524287
   protection: (0, 0, 0)
 pagesets
   cpu: 0 pcp: 0
 count: 2
 low:   62
 high:  186
 batch: 31
   cpu: 0 pcp: 1
 count: 0
 low:   0
 high:  62
 batch: 31
   numa_hit:   10186
   numa_miss:  3313
   numa_foreign:   0
   interleave_hit: 10136
   local_node: 0
   other_node: 13499
   cpu: 1 pcp: 0
 count: 13
 low:   62
 high:  186
 batch: 31
   cpu: 1 pcp: 1
 count: 0
 low:   0
 high:  62
 batch: 31
   numa_hit:   6559
   numa_miss:  1668
   numa_foreign:   0
   interleave_hit: 6559
   local_node: 0
   other_node: 8227
   cpu: 2 pcp: 0
 count: 84
 low:   62
 high:  186
 batch: 31
   cpu: 2 pcp: 1
 count: 0
 low:   0
 high:  62

Re: Where is the performance bottleneck?

2005-08-31 Thread Holger Kiehl


On Wed, 31 Aug 2005, Jens Axboe wrote:


Nothing sticks out here either. There's plenty of idle time. It smells
like a driver issue. Can you try the same dd test, but read from the
drives instead? Use a bigger blocksize here, 128 or 256k.


I used the following command reading from all 8 disks in parallel:

   dd if=/dev/sd?1 of=/dev/null bs=256k count=78125

Here vmstat output (I just cut something out in the middle):

procs ---memory-- ---swap-- -io --system-- cpu^M
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa^M
 3  7   4348  42640 7799984   961200 322816 0 3532  4987  0 22  0 78
 1  7   4348  42136 7800624   958400 322176 0 3526  4987  0 23  4 74
 0  8   4348  39912 7802648   966800 322176 0 3525  4955  0 22 12 66
 1  7   4348  38912 7803700   963600 322432 0 3526  5078  0 23  7 70
 2  6   4348  37552 7805120   964400 322432 0 3527  4908  0 23 12 64
 0  8   4348  41152 7801552   960800 322176 0 3524  5018  0 24  6 70
 1  7   4348  41644 7801044   957200 322560 0 3530  5175  0 23  0 76
 1  7   4348  37184 7805396   964000 322176 0 3525  4914  0 24 18 59
 3  7   4348  41704 7800376   983200 32217620 3531  5080  0 23  4 73
 1  7   4348  40652 7801700   973200 323072 0 3533  5115  0 24 13 64
 1  7   4348  40284 7802224   961600 322560 0 3527  4967  0 23  1 76
 0  8   4348  40156 7802356   968800 322560 0 3528  5080  0 23  2 75
 6  8   4348  41896 7799984   981600 322176 0 3530  4945  0 24 20 57
 0  8   4348  39540 7803124   960000 322560 0 3529  4811  0 24 21 55
 1  7   4348  41520 7801084   960000 322560 0 3532  4843  0 23 22 55
 0  8   4348  40408 7802116   958800 322560 0 3527  5010  0 23  4 72
 0  8   4348  38172 7804300   958000 322176 0 3526  4992  0 24  7 69
 4  7   4348  42264 7799784   981200 322688 0 3529  5003  0 24  8 68
 1  7   4348  39908 7802520   966000 322700 0 3529  4963  0 24 14 62
 0  8   4348  37428 7805076   962000 322420 0 3528  4967  0 23 15 62
 0  8   4348  37056 7805348   968800 322048 0 3525  4982  0 24 26 50
 1  7   4348  37804 7804456   969600 322560 0 3528  5072  0 24 16 60
 0  8   4348  38416 7804084   966000 323200 0 3533  5081  0 24 23 53
 0  8   4348  40160 7802300   967600 32320028 3543  5095  0 24 17 59
 1  7   4348  37928 7804612   960800 323072 0 3532  5175  0 24  7 68
 2  6   4348  38680 7803724   961200 322944 0 3531  4906  0 25 24 51
 1  7   4348  40408 7802192   964800 322048 0 3524  4947  0 24 19 57

Full vmstat session can be found under:

  ftp://ftp.dwd.de/pub/afd/linux_kernel_debug/vmstat-256k-read

And here the profile data:

2106577 total  0.9469
1638177 default_idle 34128.6875
179615 copy_user_generic_c  4726.7105
 27670 end_buffer_async_read108.0859
 26055 shrink_zone7.
 23199 __make_request17.2612
 17221 kmem_cache_free  153.7589
 11796 drop_buffers  52.6607
 11016 add_to_page_cache 52.9615
  9470 __wake_up_bit197.2917
  8760 buffered_rmqueue  12.4432
  8646 find_get_page 90.0625
  8319 __do_page_cache_readahead 11.0625
  7976 kmem_cache_alloc 124.6250
  7463 scsi_request_fn6.2192
  7208 try_to_free_buffers   40.9545
  6716 create_empty_buffers  41.9750
  6432 __end_that_request_first  11.8235
  6044 test_clear_page_dirty 25.1833
  5643 scsi_dispatch_cmd  9.7969
  5588 free_hot_cold_page19.4028
  5479 submit_bh 18.0230
  3903 __alloc_pages  3.2965
  3671 file_read_actor9.9755
  3425 thread_return 14.2708
   generic_make_request   5.6301
  3294 bio_alloc_bioset   7.6250
  2868 bio_put   44.8125
  2851 mpt_interrupt  2.8284
  2697 mempool_alloc  8.8717
  2642 block_read_full_page   3.9315
  2512 do_generic_mapping_read2.1216
  2394 set_page_refs149.6250
  2235 alloc_page_buffers 9.9777
  1992 __pagevec_lru_add  8.3000
  1859 __memset   9.6823
  1791 page_waitqueue

Re: Where is the performance bottleneck?

2005-08-31 Thread Holger Kiehl


On Wed, 31 Aug 2005, Vojtech Pavlik wrote:


On Tue, Aug 30, 2005 at 08:06:21PM +, Holger Kiehl wrote:

How does one determine the PCI-X bus speed?


Usually only the card (in your case the Symbios SCSI controller) can
tell. If it does, it'll be most likely in 'dmesg'.


There is nothing in dmesg:

   Fusion MPT base driver 3.01.20
   Copyright (c) 1999-2004 LSI Logic Corporation
   ACPI: PCI Interrupt :02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
   mptbase: Initiating ioc0 bringup
   ioc0: 53C1030: Capabilities={Initiator,Target}
   ACPI: PCI Interrupt :02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
   mptbase: Initiating ioc1 bringup
   ioc1: 53C1030: Capabilities={Initiator,Target}
   Fusion MPT SCSI Host driver 3.01.20


To find where the bottleneck is, I'd suggest trying without the
filesystem at all, and just filling a large part of the block device
using the 'dd' command.

Also, trying without the RAID, and just running 4 (and 8) concurrent
dd's to the separate drives could show whether it's the RAID that's
slowing things down.


Ok, I did run the following dd command in different combinations:

   dd if=/dev/zero of=/dev/sd?1 bs=4k count=500


I think a bs of 4k is way too small and will cause huge CPU overhead.
Can you try with something like 4M? Also, you can use /dev/full to avoid
the pre-zeroing.


Ok, I now use the following command:

  dd if=/dev/full of=/dev/sd?1 bs=4M count=4883

Here the results for all 8 disks in parallel:

  /dev/sdc1 24.957257 MB/s
  /dev/sdd1 25.290177 MB/s
  /dev/sde1 25.046711 MB/s
  /dev/sdf1 26.369777 MB/s
  /dev/sdg1 24.080695 MB/s
  /dev/sdh1 25.008803 MB/s
  /dev/sdi1 24.202202 MB/s
  /dev/sdj1 24.712840 MB/s

A little bit faster but not much.

Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Where is the performance bottleneck?

2005-08-31 Thread Holger Kiehl


On Wed, 31 Aug 2005, Jens Axboe wrote:


On Wed, Aug 31 2005, Vojtech Pavlik wrote:

On Tue, Aug 30, 2005 at 08:06:21PM +, Holger Kiehl wrote:

How does one determine the PCI-X bus speed?


Usually only the card (in your case the Symbios SCSI controller) can
tell. If it does, it'll be most likely in 'dmesg'.


There is nothing in dmesg:

   Fusion MPT base driver 3.01.20
   Copyright (c) 1999-2004 LSI Logic Corporation
   ACPI: PCI Interrupt :02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
   mptbase: Initiating ioc0 bringup
   ioc0: 53C1030: Capabilities={Initiator,Target}
   ACPI: PCI Interrupt :02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
   mptbase: Initiating ioc1 bringup
   ioc1: 53C1030: Capabilities={Initiator,Target}
   Fusion MPT SCSI Host driver 3.01.20


To find where the bottleneck is, I'd suggest trying without the
filesystem at all, and just filling a large part of the block device
using the 'dd' command.

Also, trying without the RAID, and just running 4 (and 8) concurrent
dd's to the separate drives could show whether it's the RAID that's
slowing things down.


Ok, I did run the following dd command in different combinations:

   dd if=/dev/zero of=/dev/sd?1 bs=4k count=500


I think a bs of 4k is way too small and will cause huge CPU overhead.
Can you try with something like 4M? Also, you can use /dev/full to avoid
the pre-zeroing.


That was my initial thought as well, but since he's writing the io side
should look correct. I doubt 8 dd's writing 4k chunks will gobble that
much CPU as to make this much difference.

Holger, we need vmstat 1 info while the dd's are running. A simple
profile would be nice as well, boot with profile=2 and do a readprofile
-r; run tests; readprofile > foo and send the first 50 lines of foo to
this list.


Here vmstat for 8 dd's still with 4k blocksize:

procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 9  2   5244  38272 7738248  1040000 3 11444  39024  0  5 75 20
 5 10   5244  30824 7747680   868400 0 265672 2582  1917  1 95  0  4
 2 12   5244  30948 7747248   870800 0 222620 2858   292  0 33  0 67
 4 10   5244  31072 7747516   864400 0 236400 3132   326  0 43  0 57
 2 12   5244  31320 7747792   851200 0 250204 3225   285  0 37  0 63
 1 13   5244  30948 7747412   85520024 227600 3261   312  0 41  0 59
 2 12   5244  32684 7746124   861600 0 235392 3219   274  0 32  0 68
 1 13   5244  30948 7747940   856800 0 228020 3394   296  0 37  0 63
 0 14   5244  31196 7747680   862400 0 232932 3389   300  0 32  0 68
 3 12   5244  31072 7747904   853600 0 233096 3545   312  0 33  0 67
 1 13   5244  31072 7747852   852000 0 226992 3381   290  0 31  0 69
 1 13   5244  31196 7747704   839600 0 230112 3372   265  0 28  0 72
 0 14   5244  31072 7747928   851200 0 240652 3491   295  0 33  0 67
 3 13   5244  31072 7748104   860800 0 222944 3433   269  0 27  0 73
 1 13   5244  31072 7748000   850800 0 207944 3470   294  0 28  0 72
 0 14   5244  31072 7747980   852800 0 234608 3496   272  0 31  0 69
 2 12   5244  31196 7748148   849600 0 228760 3480   280  0 28  0 72
 0 14   5244  30948 7748568   862000 0 214372 3551   302  0 29  0 71
 1 13   5244  31072 7748392   852400 0 226732 3494   284  0 29  0 71
 0 14   5244  31072 7748004   864000 0 229628 3604   273  0 26  0 74
 1 13   5244  30948 7748392   866000 0 212868 3563   266  0 28  0 72
 1 13   5244  30948 7748600   852000 0 228244 3568   294  0 30  0 70
 1 13   5244  31196 7748228   841600 0 221692 3543   258  0 27  0 73
 1 13   5244  31072 7748192   852000 0 241040 3983   330  0 25  0 74
 1 13   5244  31196 7748288   856000 0 217108 3676   276  0 28  0 72
 .
 .
 .
   This goses on up to the end.
 .
 .
 .
 0  3   5244 825096 6949252   859600 0 241244 2683   223  0  7 71 22
 0  2   5244 825108 6949252   859600 0 229764 2683   214  0  7 73 20
 0  3   5244 826348 6949252   859600 0 116840 2046   450  0  4 71 26
 0  3   5244 826976 6949252   859600 0 141992 188797  0  4 73 23
 0  3   5244 827100 6949252   859600 0 137716 187193  0  4 70 26
 0  3   5244 827100 6949252   859600 0 137032 189496  0  4 75 21
 0  3   5244 827224 6949252   859600 0 131332 1860   288  0  4 73 23
 0  1   5244 1943732 5833756   862000 0 72404 1560   481  0 24 61 16
 0  2

Re: Where is the performance bottleneck?

2005-08-30 Thread Holger Kiehl


On Mon, 29 Aug 2005, Vojtech Pavlik wrote:


On Mon, Aug 29, 2005 at 06:20:56PM +, Holger Kiehl wrote:

Hello

I have a system with the following setup:

Board is Tyan S4882 with AMD 8131 Chipset
4 Opterons 848 (2.2GHz)
8 GB DDR400 Ram (2GB for each CPU)
1 onboard Symbios Logic 53c1030 dual channel U320 controller
2 SATA disks put together as a SW Raid1 for system, swap and spares
8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)
  are on one channel and the other four (sdg, sdh, sdi, sdj) on
  the other channel.

The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is
no other device on that bus. Unfortunatly I was unable to determine at
what speed it is running, here the output from lspci -vv:



How does one determine the PCI-X bus speed?


Usually only the card (in your case the Symbios SCSI controller) can
tell. If it does, it'll be most likely in 'dmesg'.


There is nothing in dmesg:

   Fusion MPT base driver 3.01.20
   Copyright (c) 1999-2004 LSI Logic Corporation
   ACPI: PCI Interrupt :02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
   mptbase: Initiating ioc0 bringup
   ioc0: 53C1030: Capabilities={Initiator,Target}
   ACPI: PCI Interrupt :02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
   mptbase: Initiating ioc1 bringup
   ioc1: 53C1030: Capabilities={Initiator,Target}
   Fusion MPT SCSI Host driver 3.01.20


Anyway, I thought with this system I would get theoretically 640 MB/s using
both channels.


You can never use the full theoretical bandwidth of the channel for
data. A lot of overhead remains for other signalling. Similarly for PCI.


I tested several software raid setups to get the best possible write
speeds for this system. But testing shows that the absolute maximum I
can reach with software raid is only approx. 270 MB/s for writting.
Which is very disappointing.


I'd expect somewhat better (in the 300-400 MB/s range), but this is not
too bad.

To find where the bottleneck is, I'd suggest trying without the
filesystem at all, and just filling a large part of the block device
using the 'dd' command.

Also, trying without the RAID, and just running 4 (and 8) concurrent
dd's to the separate drives could show whether it's the RAID that's
slowing things down.


Ok, I did run the following dd command in different combinations:

   dd if=/dev/zero of=/dev/sd?1 bs=4k count=500

Here the results:

   Each disk alone
   /dev/sdc1 59.094636 MB/s
   /dev/sdd1 58.686592 MB/s
   /dev/sde1 55.282807 MB/s
   /dev/sdf1 62.271240 MB/s
   /dev/sdg1 60.872891 MB/s
   /dev/sdh1 62.252781 MB/s
   /dev/sdi1 59.145637 MB/s
   /dev/sdj1 60.921119 MB/s

   sdc + sdd in parallel (2 disks on same channel)
   /dev/sdc1 42.512287 MB/s
   /dev/sdd1 43.118483 MB/s

   sdc + sdg in parallel (2 disks on different channels)
   /dev/sdc1 42.938186 MB/s
   /dev/sdg1 43.934779 MB/s

   sdc + sdd + sde in parallel (3 disks on same channel)
   /dev/sdc1 35.043501 MB/s
   /dev/sdd1 35.686878 MB/s
   /dev/sde1 34.580457 MB/s

   Similar results for three disks (sdg + sdh + sdi) on the other channel
   /dev/sdg1 36.381137 MB/s
   /dev/sdh1 37.541758 MB/s
   /dev/sdi1 35.834920 MB/s

   sdc + sdd + sde + sdf in parallel (4 disks on same channel)
   /dev/sdc1 31.432914 MB/s
   /dev/sdd1 32.058752 MB/s
   /dev/sde1 31.393455 MB/s
   /dev/sdf1 33.208165 MB/s

   And here for the four disks on the other channel
   /dev/sdg1 31.873028 MB/s
   /dev/sdh1 33.277193 MB/s
   /dev/sdi1 31.91 MB/s
   /dev/sdj1 32.626744 MB/s

   All 8 disks in parallel
   /dev/sdc1 24.120545 MB/s
   /dev/sdd1 24.419801 MB/s
   /dev/sde1 24.296588 MB/s
   /dev/sdf1 25.609548 MB/s
   /dev/sdg1 24.572617 MB/s
   /dev/sdh1 25.552590 MB/s
   /dev/sdi1 24.575616 MB/s
   /dev/sdj1 25.124165 MB/s

So from these results, I may assume that md is not the cause of the problem.

What comes as a big surprise is that I loose 25% performance with only
two disks and each hanging on its own channel!

Is this normal? I wonder if other people have the same problem with
other controllers or the same.

What can I do next to find out if this is a kernel, driver or hardware
problem?

Thanks,
Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Where is the performance bottleneck?

2005-08-30 Thread Holger Kiehl


On Mon, 29 Aug 2005, Al Boldi wrote:


Holger Kiehl wrote:

Why do I only get 247 MB/s for writting and 227 MB/s for reading (from the
bonnie++ results) for a Raid0 over 8 disks? I was expecting to get nearly
three times those numbers if you take the numbers from the individual
disks.

What limit am I hitting here?


You may be hitting a 2.6 kernel bug, which has something to do with
readahead, ask Jens Axboe about it! (see "[git patches] IDE update" thread)
Sadly, 2.6.13 did not fix it either.


I did read that threat, but due to my limited understanding about kernel
code, don't see the relation to my problem.

But I am willing to try any patches to solve the problem.


Did you try 2.4.31?


No. Will give this a try if the problem is not found.

Thanks,
Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Where is the performance bottleneck?

2005-08-30 Thread Holger Kiehl


On Mon, 29 Aug 2005, Mark Hahn wrote:


The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no other
device on that bus. Unfortunatly I was unable to determine at what speed
it is running, here the output from lspci -vv:

...

 Status: Bus=2 Dev=4 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple,


the "133MHz+" is a good sign.  OTOH the latency (72) seems rather low - my
understanding is that that would noticably limit the size of burst transfers.


I have tried with 128 and 144, but the transfer rate is only a little
bit higher barely measurable. Or what values should I try?




Version  1.03--Sequential Output-- --Sequential Input- --Random-
  -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Raid0 (8 disk)15744M 54406  96 247419 90 100752 25 60266  98 226651 29 830.2   1
Raid0s(4 disk)15744M 54915  97 253642 89 73976  18 59445  97 198372 24 659.8   1
Raid0s(4 disk)15744M 54866  97 268361 95 72852  17 59165  97 187183 22 666.3   1


you're obviously saturating something already with 2 disks.  did you play
with "blockdev --setra" setings?


Yes, I did play a little bit with it but this only changed read performance,
it made no measurable difference when writting.

Thanks,
Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Where is the performance bottleneck?

2005-08-29 Thread Holger Kiehl


Hello

I have a system with the following setup:

Board is Tyan S4882 with AMD 8131 Chipset
4 Opterons 848 (2.2GHz)
8 GB DDR400 Ram (2GB for each CPU)
1 onboard Symbios Logic 53c1030 dual channel U320 controller
2 SATA disks put together as a SW Raid1 for system, swap and spares
8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)
  are on one channel and the other four (sdg, sdh, sdi, sdj) on
  the other channel.

The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no other
device on that bus. Unfortunatly I was unable to determine at what speed
it is running, here the output from lspci -vv:

02:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-
Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Step
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- TAbort- /*/
/*File Write Performance */
/*== */
/*/

#include   /* printf()  */
#include  /* strcmp()  */
#include  /* exit(), atoi(), calloc(), free()  */
#include  /* write(), sysconf(), close(), fsync()  */
#include   /* times(), struct tms   */
#include 
#include 
#include 
#include 
#include 

#define MAXLINE 4096
#define BUFSIZE 512
#define DEFAULT_FILE_SIZE   31457280
#define TEST_FILE   "test.file"
#define FILE_MODE   (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)


static void err_doit(int, char *, va_list),
err_quit(char *, ...),
err_sys(char *, ...);


/*### main() */
int
main(int argc, char *argv[])
{
   registern,
   loops,
   rest;
   int fd,
   oflag,
   blocksize = BUFSIZE;
   off_t   filesize = DEFAULT_FILE_SIZE;
   clock_t start,
   end,
   syncend;
   longclktck;
   char*buf;
   struct tms  tmsdummy;

   if ((argc > 1) && (argc < 5))
   {
  filesize = (off_t)atoi(argv[1]) * 1024;
  if (argc == 3)
 blocksize = atoi(argv[2]);
  else  if (argc == 4)
   err_quit("Usage: %s [filesize] [blocksize]");
   }
   else  if (argc != 1)
err_quit("Usage: %s [filesize] [blocksize]", argv[0]);

   if ((clktck = sysconf(_SC_CLK_TCK)) < 0)
  err_sys("sysconf error");

   /* If clktck=0 it dosn't make sence to run the test */
   if (clktck == 0)
   {
  (void)printf("0\n");
  exit(0);
   }

   if ((buf = calloc(blocksize, sizeof(char))) == NULL)
  err_sys("calloc error");

   for (n = 0; n < blocksize; n++)
  buf[n] = 'T';

   loops = filesize / blocksize;
   rest = filesize % blocksize;

   oflag = O_WRONLY | O_CREAT;

   if ((fd = open(TEST_FILE, oflag, FILE_MODE)) < 0)
  err_quit("Could not open %s", TEST_FILE);

   if ((start = times(&tmsdummy)) == -1)
  err_sys("Could not get start time");

   for (n = 0; n < loops; n++)
  if (write(fd, buf, blocksize) != blocksize)
err_sys("write error");
   if (rest > 0)
  if (write(fd, buf, rest) != rest)
err_sys("write error");

   if ((end = times(&tmsdummy)) == -1)
  err_sys("Could not get end time");

   (void)fsync(fd);

   if ((syncend = times(&tmsdummy)) == -1)
  err_sys("Could not get end time");

   (void)close(fd);
   free(buf);

   (void)printf("%f %f\n", (double)filesize / ((double)(end - start) / 
(double)clktck),
   (double)filesize / ((double)(syncend - start) / 
(double)clktck));

   exit(0);
}


static void
err_sys(char *fmt, ...)
{
   va_list  ap;

   va_start(ap, fmt);
   err_doit(1, fmt, ap);
   va_end(ap);
   exit(1);
}


static void
err_quit(char *fmt, ...)
{
   va_list  ap;

   va_start(ap, fmt);
   err_doit(0, fmt, ap);
   va_end(ap);
   exit(1);
}


static void
err_doit(int errnoflag, char *fmt, va_list ap)
{
   int   errno_save;
   char  buf[MAXLINE];

   errno_save = errno;
   (void)vsprintf(buf, fmt, ap);
   if (errnoflag)
  (void)sprintf(buf+strlen(buf), ": %s", strerror(errno_save));
   (void)strcat(buf, "\n");
   fflush(stdout);
   (void)fputs(buf, stderr);
   fflush(NULL); /* Flushes all stdio output streams */
   return;
}

RE: As of 2.6.13-rc1 Fusion-MPT very slow

2005-08-09 Thread Holger Kiehl


On Mon, 8 Aug 2005, Moore, Eric Dean wrote:


On Sunday, August 07, 2005 8:30 AM, James Bottomley wrote:


On Sun, 2005-08-07 at 05:59 +, Holger Kiehl wrote:

Thanks, removing those it compiles fine. This patch also

solves my problem,

here the output of dmesg:


Well ... the transport class was supposed to help diagnose the problem
rather than fix it.

However, what it shows is that the original problem is in the fusion
internal domain validation somewhere, but that we still don't know
where...

James




I was corresponding to Mr Holger Hiehl in private email.
What I understood the problem to be was when he compiled the drivers into
the kernel, instead of as modules, we would get some drives negotiating as
asyn narrow on the 2nd channel.


It's always the first channel that has the problem. There are four disks
and the first always negotiated as wide and has the full speed. Disk 2 to
4 are always narrow and give me only 2MB/s. On the 2nd channel everything
is always ok, here all 4 disks have the full speed.


What I was trying to do was reproduce the
issue here, and I was unable to.Has Mr Holger Hiehl tried compiling
your patch with the drivers compiled statically into the kernel, instead
of modules?


It was compilled in statically into the kernel.


Anyways - My last suggesting was that he change the scsi cable, and reset
the parameters in  the bios configuration utility.  I don't believe
that fixed it.


No. I exchanged cables still always the same results. Also on a second
system that has identical hardware, as soon as I put kernel 2.6.13-rc1
I get the same problem.


Here's my next suggestion.  Recompile the driver with domain validation
debugging enabled.  Then send me the output dmesg so I can analyze it.


This brings us closer to the root of the problem, I think. With domain
validation debugging enabled, this problem is no longer reliably reproducable.
I once even saw that only the forth disk on the first channel had the slow
performance. Booting several times, gave me most the time full speed for
all four disk on the first channel. But the results where not stable.
I then took out some unused drivers (hardware watchdog and IPMI) and the
system would always come up with all four disk at full speed. I then
removed domain validation debugging but then the problem was there again.
So I put in a msleep(2000) in ./drivers/block/elevator.c just after it prints
out what elevator it used and enabled domain validation debugging again.
Booting with this kernel I managed to capture the debugging output with
disk 2 to 4 having only 2MB/s. So I think there is some timing problem,
somewhere.

I also have the output without the msleep(), that is with all four disk
having full speed on the first channel. Please tell me if this is of
intrest, then I will post it as well.

Thanks,
Holger
---



Bootdata ok (command line is ro root=/dev/md0)
Linux version 2.6.13-rc5-git3 ([EMAIL PROTECTED]) (gcc version 4.0.1 20050727 
(Red Hat 4.0.1-5)) #6 SMP Tue Aug 9 11:14:17 GMT 2005
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009a000 (usable)
 BIOS-e820: 0009a000 - 000a (reserved)
 BIOS-e820: 000d2000 - 0010 (reserved)
 BIOS-e820: 0010 - f7f7 (usable)
 BIOS-e820: f7f7 - f7f76000 (ACPI data)
 BIOS-e820: f7f76000 - f7f8 (ACPI NVS)
 BIOS-e820: f7f8 - f800 (reserved)
 BIOS-e820: fec0 - fec00400 (reserved)
 BIOS-e820: fee0 - fee01000 (reserved)
 BIOS-e820: fff8 - 0001 (reserved)
 BIOS-e820: 0001 - 0002 (usable)
ACPI: RSDP (v002 PTLTD ) @ 0x000f6a70
ACPI: XSDT (v001 PTLTD   XSDT   0x0604  LTP 0x) @ 
0xf7f72e3b
ACPI: FADT (v003 AMDHAMMER   0x0604 PTEC 0x000f4240) @ 
0xf7f72f97
ACPI: SRAT (v001 AMDHAMMER   0x0604 AMD  0x0001) @ 
0xf7f75904
ACPI: SSDT (v001 PTLTD  POWERNOW 0x0604  LTP 0x0001) @ 
0xf7f75a3c
ACPI: HPET (v001 AMDHAMMER   0x0604 PTEC 0x) @ 
0xf7f75dac
ACPI: SSDT (v001 AMD-K8 AMD-ACPI 0x0604  AMD 0x0001) @ 
0xf7f75de4
ACPI: SSDT (v001 AMD-K8 AMD-ACPI 0x0604  AMD 0x0001) @ 
0xf7f75e81
ACPI: MADT (v001 PTLTD   APIC   0x0604  LTP 0x) @ 
0xf7f75f1e
ACPI: SPCR (v001 PTLTD  $UCRTBL$ 0x0604 PTL  0x0001) @ 
0xf7f75fb0
ACPI: DSDT (v001 AMD-K8  AMDACPI 0x0604 MSFT 0x010e) @ 
0x
SRAT: PXM 0 -> APIC 0 -> CPU 0 -> Node 0
SRAT: PXM 1 -> APIC 1 -> CPU 1 -> Node 1
SRAT: PXM 2 -> APIC 2 -> CPU 2 -> Node 2
SRAT: PXM 3 -> APIC 3 -> CPU 3 -> Node 3
SRAT: Node 0 PXM 0 0-9
SRAT: Node 0 PXM 0 0-7fff
SRAT: Node 1 PXM 1 8000-f7ff
SRAT: Node 2 PXM 2 1-17fff
SRAT: Node 3 PX

RE: As of 2.6.13-rc1 Fusion-MPT very slow

2005-08-06 Thread Holger Kiehl


On Sat, 6 Aug 2005, James Bottomley wrote:


On Sat, 2005-08-06 at 21:12 +, Holger Kiehl wrote:

drivers/message/fusion/mptspi.c:505: error: unknown field 
â..get_hold_mcsâ.. specified in initializer
drivers/message/fusion/mptspi.c:505: warning: excess elements in struct 
initializer
drivers/message/fusion/mptspi.c:505: warning: (near initialization for 
â..mptspi_transport_functionsâ..)
drivers/message/fusion/mptspi.c:506: error: unknown field 
â..set_hold_mcsâ.. specified in initializer
drivers/message/fusion/mptspi.c:506: warning: excess elements in struct 
initializer
drivers/message/fusion/mptspi.c:506: warning: (near initialization for 
â..mptspi_transport_functionsâ..)
drivers/message/fusion/mptspi.c:507: error: unknown field 
â..show_hold_mcsâ.. specified in initializer
drivers/message/fusion/mptspi.c:507: warning: excess elements in struct 
initializer
drivers/message/fusion/mptspi.c:507: warning: (near initialization for 
â..mptspi_transport_functionsâ..)


This is actually because -mm is slightly behind the scsi-misc tree.  It
looks like the hold_mcs parameters haven't propagated into the -mm tree
yet.  You should be able to correct this by cutting these three lines:

.get_hold_mcs   = mptspi_read_parameters,
.set_hold_mcs   = mptspi_write_hold_mcs,
.show_hold_mcs  = 1,

Out of the code at lines 505-507.  You'll get a warning about
mptspi_write_hold_mcs() being defined but not used which you can ignore.


Thanks, removing those it compiles fine. This patch also solves my problem,
here the output of dmesg:

   Fusion MPT base driver 3.03.02
   Copyright (c) 1999-2005 LSI Logic Corporation
   Fusion MPT SPI Host driver 3.03.02
   ACPI: PCI Interrupt :02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
   mptbase: Initiating ioc0 bringup
   ioc0: 53C1030: Capabilities={Initiator,Target}
   scsi4 : ioc0: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=217
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
target4:0:0: Beginning Domain Validation
target4:0:0: Ending Domain Validation
target4:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU (6.25 ns, offset 127)
   SCSI device sdc: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdc: drive cache: write back
   SCSI device sdc: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdc: drive cache: write back
sdc: sdc1
   Attached scsi disk sdc at scsi4, channel 0, id 0, lun 0
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
target4:0:1: Beginning Domain Validation
target4:0:1: Ending Domain Validation
target4:0:1: FAST-160 WIDE SCSI 320.0 MB/s DT IU (6.25 ns, offset 127)
   SCSI device sdd: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdd: drive cache: write back
   SCSI device sdd: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdd: drive cache: write back
sdd: sdd1
   Attached scsi disk sdd at scsi4, channel 0, id 1, lun 0
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
target4:0:2: Beginning Domain Validation
target4:0:2: Ending Domain Validation
target4:0:2: FAST-160 WIDE SCSI 320.0 MB/s DT IU (6.25 ns, offset 127)
   SCSI device sde: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sde: drive cache: write back
   SCSI device sde: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sde: drive cache: write back
sde: sde1
   Attached scsi disk sde at scsi4, channel 0, id 2, lun 0
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
target4:0:3: Beginning Domain Validation
target4:0:3: Ending Domain Validation
target4:0:3: FAST-160 WIDE SCSI 320.0 MB/s DT IU (6.25 ns, offset 127)
   SCSI device sdf: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdf: drive cache: write back
   SCSI device sdf: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdf: drive cache: write back
sdf: sdf1
   Attached scsi disk sdf at scsi4, channel 0, id 3, lun 0
   ACPI: PCI Interrupt :02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
   mptbase: Initiating ioc1 bringup
   ioc1: 53C1030: Capabilities={Initiator,Target}
   scsi5 : ioc1: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=225
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
target5:0:0: Beginning Domain Validation
target5:0:0: Ending Domain Validation
target5:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU (6.25 ns, offset 127)
   SCSI device sdg: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdg: drive cache: write back
   SCSI device sdg: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI d

RE: As of 2.6.13-rc1 Fusion-MPT very slow

2005-08-06 Thread Holger Kiehl


On Sat, 6 Aug 2005, James Bottomley wrote:


On Mon, 2005-08-01 at 15:40 +, Holger Kiehl wrote:

No I did not get it. Can you please send it to me or tell me where I can
download it?


OK, since this has stalled, how about trying a different approach.

If you apply the attached patch it will cause fusion to use the
transport class domain validation.  That should show us which parameters
are causing the problem and exactly what the negotiations said.  We can
also tell you how to tweak the parameters.

It should apply to any recent -mm (unless Andrew does a turn to pick up
the fusion module rework).


I tried from 2.6.13-rc2-mm2 up to 2.6.13-rc4-mm1 and always get the following
error when applying this patch:

 CC  drivers/message/fusion/mptbase.o
 CC  drivers/message/fusion/mptscsih.o
 CC  drivers/message/fusion/mptspi.o
   drivers/message/fusion/mptspi.c: In function â..mptspi_target_allocâ..:
   drivers/message/fusion/mptspi.c:113: error: invalid storage class for 
function â..mptspi_write_offsetâ..
   drivers/message/fusion/mptspi.c:114: error: invalid storage class for 
function â..mptspi_write_widthâ..
   drivers/message/fusion/mptspi.c:131: warning: implicit declaration of 
function â..mptspi_write_widthâ..
   drivers/message/fusion/mptspi.c: At top level:
   drivers/message/fusion/mptspi.c:453: warning: conflicting types for 
â..mptspi_write_widthâ..
   drivers/message/fusion/mptspi.c:453: error: static declaration of 
â..mptspi_write_widthâ.. follows non-static declaration
   drivers/message/fusion/mptspi.c:131: error: previous implicit declaration of 
â..mptspi_write_widthâ.. was here
   drivers/message/fusion/mptspi.c:505: error: unknown field â..get_hold_mcsâ.. 
specified in initializer
   drivers/message/fusion/mptspi.c:505: warning: excess elements in struct 
initializer
   drivers/message/fusion/mptspi.c:505: warning: (near initialization for 
â..mptspi_transport_functionsâ..)
   drivers/message/fusion/mptspi.c:506: error: unknown field â..set_hold_mcsâ.. 
specified in initializer
   drivers/message/fusion/mptspi.c:506: warning: excess elements in struct 
initializer
   drivers/message/fusion/mptspi.c:506: warning: (near initialization for 
â..mptspi_transport_functionsâ..)
   drivers/message/fusion/mptspi.c:507: error: unknown field 
â..show_hold_mcsâ.. specified in initializer
   drivers/message/fusion/mptspi.c:507: warning: excess elements in struct 
initializer
   drivers/message/fusion/mptspi.c:507: warning: (near initialization for 
â..mptspi_transport_functionsâ..)
   make[3]: *** [drivers/message/fusion/mptspi.o] Error 1
   make[2]: *** [drivers/message/fusion] Error 2
   make[1]: *** [drivers/message] Error 2
   make: *** [drivers] Error 2

The first errors I was able to resolve by placing the function prototype
definitions (line 113 and 114) outside the function. I am using gcc 4.0.1.
But the errors in line 505 onwards I don't know what to do. Should I take
an earlier -mm release?

Thanks,
Holger

RE: As of 2.6.13-rc1 Fusion-MPT very slow

2005-08-01 Thread Holger Kiehl


No I did not get it. Can you please send it to me or tell me where I can
download it?

Thanks,
Holger
--

On Mon, 1 Aug 2005, Moore, Eric Dean wrote:


I provided an application called getspeed as an attachment
in the email I sent last Friday. Did you receive that, or do
I need to resend?  If possible, can run that application
and send me the output.

Regards,
Eric Moore

On Monday, August 01, 2005 4:16 AM, Holger Kiehl wrote:


On Fri, 29 Jul 2005, Andrew Morton wrote:


"Moore, Eric Dean" <[EMAIL PROTECTED]> wrote:


 Regarding the 1st issue, can you try this patch out.  It

maybe in the

 -mm branch. Andrew cc'd on this email can confirm.



ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/
2.6.13-rc3/2.6

 .13-rc3-mm3/broken-out/mpt-fusion-dv-fixes.patch


Yes, that's part of 2.6.13-rc3-mm3.


The patch makes no difference. Still get the following
results when fusion
is compiled in:

   sdc   74MB/s
   sdd2MB/s
   sde2MB/s
   sdf2MB/s

On second channel:

   sdg   74MB/s
   sdh   74MB/s
   sdi   74MB/s
   sdj   74MB/s

The patch was applied to linux-2.6.13-rc4-git3.

Here part of dmesg output:

Fusion MPT base driver 3.03.02
Copyright (c) 1999-2005 LSI Logic Corporation
Fusion MPT SPI Host driver 3.03.02
ACPI: PCI Interrupt :02:04.0[A] -> GSI 24 (level,
low) -> IRQ 217
mptbase: Initiating ioc0 bringup
ioc0: 53C1030: Capabilities={Initiator,Target}
scsi4 : ioc0: LSI53C1030, FwRev=01032700h, Ports=1,
MaxQ=255, IRQ=217
  Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
  Type:   Direct-Access  ANSI SCSI
revision: 03
SCSI device sdc: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdc: drive cache: write back
SCSI device sdc: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdc: drive cache: write back
 sdc: sdc1
Attached scsi disk sdc at scsi4, channel 0, id 0, lun 0
  Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
  Type:   Direct-Access  ANSI SCSI
revision: 03
SCSI device sdd: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdd: drive cache: write back
SCSI device sdd: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdd: drive cache: write back
 sdd: sdd1
Attached scsi disk sdd at scsi4, channel 0, id 1, lun 0
  Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
  Type:   Direct-Access  ANSI SCSI
revision: 03
SCSI device sde: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sde: drive cache: write back
SCSI device sde: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sde: drive cache: write back
 sde: sde1
Attached scsi disk sde at scsi4, channel 0, id 2, lun 0
  Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
  Type:   Direct-Access  ANSI SCSI
revision: 03
SCSI device sdf: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdf: drive cache: write back
SCSI device sdf: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdf: drive cache: write back
 sdf: sdf1
Attached scsi disk sdf at scsi4, channel 0, id 3, lun 0
ACPI: PCI Interrupt :02:04.1[B] -> GSI 25 (level,
low) -> IRQ 225
mptbase: Initiating ioc1 bringup
ioc1: 53C1030: Capabilities={Initiator,Target}
scsi5 : ioc1: LSI53C1030, FwRev=01032700h, Ports=1,
MaxQ=255, IRQ=225
  Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
  Type:   Direct-Access  ANSI SCSI
revision: 03
SCSI device sdg: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdg: drive cache: write back
SCSI device sdg: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdg: drive cache: write back
 sdg: sdg1
Attached scsi disk sdg at scsi5, channel 0, id 0, lun 0
  Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
  Type:   Direct-Access  ANSI SCSI
revision: 03
SCSI device sdh: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdh: drive cache: write back
SCSI device sdh: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdh: drive cache: write back
 sdh: sdh1
Attached scsi disk sdh at scsi5, channel 0, id 1, lun 0
  Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
  Type:   Direct-Access  ANSI SCSI
revision: 03
SCSI device sdi: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdi: drive cache: write back
SCSI device sdi: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdi: drive cache: write back
 sdi: sdi1
Attached scsi disk sdi at scsi5, channel 0, id 2, lun 0
  Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
  Type:   Direct-Access  ANSI SCSI
revision: 03
SCSI device sdj: 143552136 512-byte hdwr sectors (73499 MB)
SCSI device sdj: drive

Re: As of 2.6.13-rc1 Fusion-MPT very slow

2005-08-01 Thread Holger Kiehl


On Fri, 29 Jul 2005, Andrew Morton wrote:


"Moore, Eric Dean" <[EMAIL PROTECTED]> wrote:


 Regarding the 1st issue, can you try this patch out.  It maybe in the
 -mm branch. Andrew cc'd on this email can confirm.

 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.13-rc3/2.6
 .13-rc3-mm3/broken-out/mpt-fusion-dv-fixes.patch


Yes, that's part of 2.6.13-rc3-mm3.


The patch makes no difference. Still get the following results when fusion
is compiled in:

  sdc   74MB/s
  sdd2MB/s
  sde2MB/s
  sdf2MB/s

On second channel:

  sdg   74MB/s
  sdh   74MB/s
  sdi   74MB/s
  sdj   74MB/s

The patch was applied to linux-2.6.13-rc4-git3.

Here part of dmesg output:

   Fusion MPT base driver 3.03.02
   Copyright (c) 1999-2005 LSI Logic Corporation
   Fusion MPT SPI Host driver 3.03.02
   ACPI: PCI Interrupt :02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
   mptbase: Initiating ioc0 bringup
   ioc0: 53C1030: Capabilities={Initiator,Target}
   scsi4 : ioc0: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=217
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
   SCSI device sdc: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdc: drive cache: write back
   SCSI device sdc: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdc: drive cache: write back
sdc: sdc1
   Attached scsi disk sdc at scsi4, channel 0, id 0, lun 0
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
   SCSI device sdd: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdd: drive cache: write back
   SCSI device sdd: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdd: drive cache: write back
sdd: sdd1
   Attached scsi disk sdd at scsi4, channel 0, id 1, lun 0
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
   SCSI device sde: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sde: drive cache: write back
   SCSI device sde: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sde: drive cache: write back
sde: sde1
   Attached scsi disk sde at scsi4, channel 0, id 2, lun 0
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
   SCSI device sdf: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdf: drive cache: write back
   SCSI device sdf: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdf: drive cache: write back
sdf: sdf1
   Attached scsi disk sdf at scsi4, channel 0, id 3, lun 0
   ACPI: PCI Interrupt :02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
   mptbase: Initiating ioc1 bringup
   ioc1: 53C1030: Capabilities={Initiator,Target}
   scsi5 : ioc1: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=225
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
   SCSI device sdg: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdg: drive cache: write back
   SCSI device sdg: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdg: drive cache: write back
sdg: sdg1
   Attached scsi disk sdg at scsi5, channel 0, id 0, lun 0
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
   SCSI device sdh: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdh: drive cache: write back
   SCSI device sdh: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdh: drive cache: write back
sdh: sdh1
   Attached scsi disk sdh at scsi5, channel 0, id 1, lun 0
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
   SCSI device sdi: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdi: drive cache: write back
   SCSI device sdi: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdi: drive cache: write back
sdi: sdi1
   Attached scsi disk sdi at scsi5, channel 0, id 2, lun 0
 Vendor: FUJITSU   Model: MAS3735NP Rev: 0104
 Type:   Direct-Access  ANSI SCSI revision: 03
   SCSI device sdj: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdj: drive cache: write back
   SCSI device sdj: 143552136 512-byte hdwr sectors (73499 MB)
   SCSI device sdj: drive cache: write back
sdj: sdj1
   Attached scsi disk sdj at scsi5, channel 0, id 3, lun 0

Anything else I can try or provide?

Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

As of 2.6.13-rc1 Fusion-MPT very slow

2005-07-26 Thread Holger Kiehl


Hello

On a four CPU Opteron with Fusion-MPT compiled in, I get the following
results (up to 2.6.13-rc3-git7) with hdparm on the first channel with
four disks:

   sdc74 MB/s
   sdd 2 MB/s
   sde 2 MB/s
   sdf 2 MB/s

On the second channel also with the same type of disks:

   sdg74 MB/s
   sdh74 MB/s
   sdi74 MB/s
   sdj74 MB/s

All disk are of the same type. Compiling Fusion-MPT as module for the
same kernel I get 74 MB/s for all eight disks. Taking kernel 2.6.12.2 and
compile it in, all eigth disks give the expected performance of 74 MB/s.
When I exchange the two cables, put the first cable on second channel and
second cable on first channel, always sdd, sde and sdf will only get
approx. 2 MB/s with any 2.6.13-* kernels.

Another problem observed with 2.6.13-rc3-git7 and Fusion-MPT compiled in
is when making a ext3 filesystem over those eight disks (software Raid10),
makes mke2fs hang for a very long time in D-state and /var/log/messages
writting a lot of these messages:

   mptscsih: ioc0: >> Attempting task abort! (sc=81014ead3ac0)
   mptscsih: ioc0: >> Attempting task abort! (sc=81014ead38c0)
   mptscsih: ioc0: >> Attempting task abort! (sc=81014ead36c0)
   mptscsih: ioc0: >> Attempting task abort! (sc=81014ead34c0)
  .
  .
  .

And finally, when I do a halt or powerdown just after all filesystems
are unmounted the fusion driver tells me that it puts the two controllers
in power save mode. Then kernel whants to flush the SCSI disks but
hangs forever. This does not happen when doing a reboot.

Holger
--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Fusion-MPT much faster as module

2005-03-22 Thread Holger Kiehl

On Tue, 22 Mar 2005, Chen, Kenneth W wrote:
On Mon, 21 Mar 2005, Andrew Morton wrote:
Holger, this problem remains unresolved, does it not?  Have you done any
more experimentation?
I must say that something funny seems to be happening here.  I have two
MPT-based Dell machines, neither of which is using a modular driver:
akpm:/usr/src/25> 0 hdparm -t /dev/sda
/dev/sda:
Timing buffered disk reads:  64 MB in  5.00 seconds = 12.80 MB/sec

Holger Kiehl wrote on Tuesday, March 22, 2005 12:31 AM
Got the same result when compiled in, always between 12 and 13 MB/s. As
module it is approx. 75 MB/s.

Half guess, half with data to prove: it must be the variable driver_setup
initialization.  If compiled as built-in, driver_setup is initialized to
zero for all of its member variables, which isn't the fastest setting. If
compiled as module, it gets first class treatment with shinny performance
setting.  Goofing around, this patch appears to be giving higher throughput.
Yes, that fixes it.
Many thanks!
Holger
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fusion-MPT much faster as module

2005-03-22 Thread Holger Kiehl

On Mon, 21 Mar 2005, Andrew Morton wrote:
Holger Kiehl <[EMAIL PROTECTED]> wrote:
Hello
On a four CPU Opteron compiling the Fusion-MPT as module gives much better
performance when compiling it in, here some bonnie++ results:
Version  1.03   --Sequential Output-- --Sequential Input- --Random-
 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
compiled in  15872M 38366 71  65602  22 18348   4 53276 84  57947   7 905.4   2
module   15872M 51246 96 204914  70 57236  14 59779 96 264171  33 923.0   2
This happens with 2.6.10, 2.6.11 and 2.6.11-bk2. Controller is a
Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI.
Why is there such a large difference?
Holger, this problem remains unresolved, does it not?  Have you done any
more experimentation?
No. For now I just leave it as module.
I must say that something funny seems to be happening here.  I have two
MPT-based Dell machines, neither of which is using a modular driver:
akpm:/usr/src/25> 0 hdparm -t /dev/sda
/dev/sda:
Timing buffered disk reads:  64 MB in  5.00 seconds = 12.80 MB/sec
Got the same result when compiled in, always between 12 and 13 MB/s. As
module it is approx. 75 MB/s.
Hope that LSI Logic will find the problem.
Another question I have is there a way in what SCSI mode (320, 160, etc)
Fusion-MPT is running? Could not find anything in proc or dmesg. Adaptec
has the following information in dmesg (and more in proc):
   (scsi1:A:0): 320.000MB/s transfers (160.000MHz DT|IU|QAS, 16bit)
Or has the Fusion-MPT some other tool to show this information?
Holger
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Fusion-MPT much faster as module

2005-03-08 Thread Holger Kiehl

Hello
On a four CPU Opteron compiling the Fusion-MPT as module gives much better
performance when compiling it in, here some bonnie++ results:
Version  1.03   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
compiled in  15872M 38366 71  65602  22 18348   4 53276 84  57947   7 905.4   2
module   15872M 51246 96 204914  70 57236  14 59779 96 264171  33 923.0   2
This happens with 2.6.10, 2.6.11 and 2.6.11-bk2. Controller is a
Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI.
Why is there such a large difference?
Holger
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH/RFC] IPMI watchdog more verbose

2005-02-21 Thread Holger Kiehl

Hello
This makes IPMI watchdog more verbose during initialization. It prints the
values of timeout and if nowayout is set or not. Currently there is no way
to see what these values are, onced initialzed.
Please check if this is the correct place to put the printk.
Holger--- linux-2.6.10/drivers/char/ipmi/ipmi_watchdog.c.original 2005-02-21 
10:02:38.289344538 +
+++ linux-2.6.10/drivers/char/ipmi/ipmi_watchdog.c  2005-02-21 
10:10:38.925872976 +
@@ -944,9 +944,6 @@
 {
int rv;
 
-   printk(KERN_INFO PFX "driver version "
-  IPMI_WATCHDOG_VERSION "\n");
-
if (strcmp(action, "reset") == 0) {
action_val = WDOG_TIMEOUT_RESET;
} else if (strcmp(action, "none") == 0) {
@@ -1031,6 +1028,9 @@
register_reboot_notifier(&wdog_reboot_notifier);
notifier_chain_register(&panic_notifier_list, &wdog_panic_notifier);
 
+   printk(KERN_INFO PFX "initialized (%s). timeout=%d sec (nowayout=%d)\n",
+  IPMI_WATCHDOG_VERSION, timeout, nowayout);
+
return 0;
 }

What do these SCSI error messages mean?

2001-07-06 Thread Holger Kiehl


Hello

I have a SW-Raid 5 running across 6 IBM DNES-309170W (one disk is hot spare)
on AIC-7890/1 Ultra2 SCSI host adapter (onboard) under 2.2.19. I use the aic
driver that comes with 2.2.19 and Tagged Command Queueing is enabled and
set to 24. This system was running for about 2 years without any problems
when one disk had medium errors and I had to exchange this disk with a
DPSS-309170N of the same size.  Another thing I did was to try 2.4.5 with
the new aic driver, but went back to 2.2.19. Since then I am getting the
following errors in my syslog when the system is under heavy disk load:

 scsi : aborting command due to timeout : pid 718083, scsi0, channel 0, id 1, lun 0 
Write (10) 00 00 c4 48 76 00 00 80 00
 (scsi0:0:1:0) SCSISIGI 0x4, SEQADDR 0x61, SSTAT0 0x0, SSTAT1 0x2
 (scsi0:0:1:0) SG_CACHEPTR 0x2c, SSTAT2 0x40, STCNT 0x5fc
 scsi : aborting command due to timeout : pid 718084, scsi0, channel 0, id 1, lun 0 
Write (10) 00 00 c4 48 f6 00 00 80 00
 scsi : aborting command due to timeout : pid 718085, scsi0, channel 0, id 1, lun 0 
Write (10) 00 00 c4 49 7e 00 00 30 00
 scsi : aborting command due to timeout : pid 718086, scsi0, channel 0, id 2, lun 0 
Write (10) 00 00 c4 47 76 00 00 80 00
 scsi : aborting command due to timeout : pid 718087, scsi0, channel 0, id 2, lun 0 
Write (10) 00 00 c4 47 f6 00 00 80 00
 scsi : aborting command due to timeout : pid 718088, scsi0, channel 0, id 2, lun 0 
Write (10) 00 00 c4 48 76 00 00 80 00
 scsi : aborting command due to timeout : pid 718089, scsi0, channel 0, id 2, lun 0 
Write (10) 00 00 c4 48 f6 00 00 80 00
 scsi : aborting command due to timeout : pid 718090, scsi0, channel 0, id 2, lun 0 
Read (10) 00 00 c4 49 76 00 00 08 00
 scsi : aborting command due to timeout : pid 718091, scsi0, channel 0, id 2, lun 0 
Write (10) 00 00 c4 49 7e 00 00 30 00
 scsi : aborting command due to timeout : pid 718092, scsi0, channel 0, id 3, lun 0 
Write (10) 00 00 c4 47 76 00 00 80 00
 scsi : aborting command due to timeout : pid 718093, scsi0, channel 0, id 3, lun 0 
Write (10) 00 00 c4 47 f6 00 00 80 00
 scsi : aborting command due to timeout : pid 718094, scsi0, channel 0, id 3, lun 0 
Write (10) 00 00 c4 48 76 00 00 80 00
 scsi : aborting command due to timeout : pid 718095, scsi0, channel 0, id 3, lun 0 
Write (10) 00 00 c4 48 f6 00 00 80 00
 scsi : aborting command due to timeout : pid 718096, scsi0, channel 0, id 3, lun 0 
Write (10) 00 00 c4 49 7e 00 00 30 00
 scsi : aborting command due to timeout : pid 718097, scsi0, channel 0, id 4, lun 0 
Write (10) 00 00 c4 47 76 00 00 80 00
 scsi : aborting command due to timeout : pid 718098, scsi0, channel 0, id 4, lun 0 
Write (10) 00 00 c4 47 f6 00 00 80 00
 scsi : aborting command due to timeout : pid 718099, scsi0, channel 0, id 4, lun 0 
Write (10) 00 00 c4 48 76 00 00 80 00
 scsi : aborting command due to timeout : pid 718100, scsi0, channel 0, id 4, lun 0 
Write (10) 00 00 c4 48 f6 00 00 80 00
 scsi : aborting command due to timeout : pid 718101, scsi0, channel 0, id 4, lun 0 
Write (10) 00 00 c4 49 7e 00 00 30 00
 scsi : aborting command due to timeout : pid 718102, scsi0, channel 0, id 2, lun 0 
Read (10) 00 00 c4 49 ae 00 00 08 00
 scsi : aborting command due to timeout : pid 718103, scsi0, channel 0, id 3, lun 0 
Read (10) 00 00 c3 76 86 00 00 20 00
 scsi : aborting command due to timeout : pid 718104, scsi0, channel 0, id 0, lun 0 
Read (10) 00 00 28 6b 76 00 00 80 00
 scsi : aborting command due to timeout : pid 718105, scsi0, channel 0, id 0, lun 0 
Read (10) 00 00 28 6b f6 00 00 80 00
 scsi : aborting command due to timeout : pid 718106, scsi0, channel 0, id 1, lun 0 
Read (10) 00 00 28 6b 76 00 00 80 00
 scsi : aborting command due to timeout : pid 718107, scsi0, channel 0, id 1, lun 0 
Read (10) 00 00 28 6b f6 00 00 40 00
 scsi : aborting command due to timeout : pid 718108, scsi0, channel 0, id 2, lun 0 
Read (10) 00 00 28 6b 76 00 00 80 00
 scsi : aborting command due to timeout : pid 718109, scsi0, channel 0, id 2, lun 0 
Read (10) 00 00 28 6c 36 00 00 40 00
 scsi : aborting command due to timeout : pid 718110, scsi0, channel 0, id 3, lun 0 
Read (10) 00 00 28 6b 4e 00 00 68 00
 scsi : aborting command due to timeout : pid 718111, scsi0, channel 0, id 3, lun 0 
Read (10) 00 00 28 6b f6 00 00 60 00
 scsi : aborting command due to timeout : pid 718112, scsi0, channel 0, id 4, lun 0 
Read (10) 00 00 28 6b 36 00 00 40 00
 scsi : aborting command due to timeout : pid 718113, scsi0, channel 0, id 4, lun 0 
Read (10) 00 00 28 6b b6 00 00 80 00
 scsi : aborting command due to timeout : pid 718114, scsi0, channel 0, id 1, lun 0 
Read (10) 00 00 c2 ed 1e 00 00 08 00
 SCSI host 0 abort (pid 718083) timed out - resetting
 SCSI bus is being reset for host 0 channel 0.
 (scsi0:0:0:0) Synchronous at 80.0 Mbyte/sec, offset 31.
 (scsi0:0:1:0) Synchronous at 80.0 Mbyte/sec, offset 31.
 (scsi0:0:3:0) Synchronous at 80.0 Mbyte/sec, offset 31.
 (scsi0:0:4:0) Synchronous at 80.0 Mbyte/sec, offset 31.
 (scsi0:0:2:0) Synchrono

VLAN in kernel?

2001-06-19 Thread Holger Kiehl


Hello

Some time ago Ben Greear has posted a patch to include VLAN support into
the 2.4 kernel. I and many others are using this patch with great success
and without any problems for a very long time. What is the reason that
this patch is not included into the kernel?

Thanks,
Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] - filesystem corruption on soft RAID5 in 2.4.0+

2001-01-23 Thread Holger Kiehl




On Mon, 22 Jan 2001, Neil Brown wrote:

>
> There have been assorted reports of filesystem corruption on raid5 in
> 2.4.0, and I have finally got a patch - see below.
> I don't know if it addresses everybody's problems, but it fixed a very
> really problem that is very reproducable.
>
> The problem is that parity can be calculated wrongly when doing a
> read-modify-write update cycle.  If you have a fully functional, you
> wont notice this problem as the parity block is never used to return
> data.  But if you have a degraded array, you will get corruption very
> quickly.
> So I think this will solve the reported corruption with ext2fs, as I
> think they were mostly on degradred arrays.  I have no idea whether it
> will address the reiserfs problems as I don't think anybody reporting
> those problems described their array.
>
> In any case, please apply, and let me know of any further problems.
>
I did test this patch with 2.4.1-pre9 for about 16 hours and I no
longer get the ext2 errors in syslog. Though I must say that both machines
I tested did not have any degradred arrays (but do have corruption
without the patch). During my last test on one of the node a disk
started to get "medium errors", however everything worked fine the
raid code removed the bad disk, started recalculating parity to setup
the spare disk and everything kept on running with no interaction
and no errors in syslog. Very nice! However, forcing a check with
e2fsck -f still produces the following:

   root@florix:~# !e2fsck
   e2fsck -f /dev/md2
   e2fsck 1.19, 13-Jul-2000 for EXT2 FS 0.5b, 95/08/09
   Pass 1: Checking inodes, blocks, and sizes
   Special (device/socket/fifo) inode 3630145 has non-zero size.  Fix? yes

   Special (device/socket/fifo) inode 3630156 has non-zero size.  Fix? yes

   Special (device/socket/fifo) inode 3630176 has non-zero size.  Fix? yes

   Special (device/socket/fifo) inode 3630184 has non-zero size.  Fix? yes

   Pass 2: Checking directory structure
   Pass 3: Checking directory connectivity
   Pass 4: Checking reference counts
   Pass 5: Checking group summary information
   Block bitmap differences:  -3394 -3395 -3396 -3397 -3398 -3399 -3400 -3429 -3430 
-3431 -3432 -3433 -3434 -3435 -3466 -3467 -3468 -3469 -3470 -3471 -3472 -3477 -3478 
-3479 -3480 -3481 -3482 -3483 -3586 -3587 -3588 -3589 -3590 -3591 -3592 -3627 -3628 
-3629 -3630 -3631 -3632 -3633 -3668 -3669 -3670 -3671 -3672 -3673 -3674 -3745 -3746 
-3747 -3748 -3749 -3750 -3751 -3756 -3757 -3758 -3759 -3760 -3761 -3762 -3765 -3766 
-3767 -3768 -3769 -3770 -3771 -3840 -3841 -3842 -3843 -3844 -3845 -3846
   Fix? yes

   Free blocks count wrong for group #0 (27874, counted=27951).
   Fix? yes

   Free blocks count wrong (7802000, counted=7802077).
   Fix? yes


   /dev/md2: * FILE SYSTEM WAS MODIFIED *
   /dev/md2: 7463/4006240 files (12.7% non-contiguous), 206243/8008320 blocks


Is this something I need to worry about? Yesterday I already reported
that I sometimes only do get the ones with "has non-zero size". What
is the meaning of this?

Another thing I observed in the syslog is the following:

   Jan 22 23:48:21 cube kernel: __alloc_pages: 2-order allocation failed.
   Jan 22 23:48:42 cube last message repeated 32 times
   Jan 22 23:49:54 cube last message repeated 48 times
   Jan 22 23:58:09 cube kernel: __alloc_pages: 2-order allocation failed.
   Jan 22 23:58:13 cube last message repeated 12 times
   Jan 23 00:11:08 cube kernel: __alloc_pages: 2-order allocation failed.
   Jan 23 00:11:10 cube last message repeated 43 times
   Jan 23 00:19:35 cube kernel: __alloc_pages: 2-order allocation failed.
   Jan 23 00:19:39 cube last message repeated 30 times
   Jan 23 00:40:05 cube -- MARK --
   Jan 23 00:53:36 cube kernel: __alloc_pages: 2-order allocation failed.
   Jan 23 00:53:50 cube last message repeated 16 times

This happens under a very high load (120) and is properly not raid related.
What's the meaning of this?

Thanks,
Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] - filesystem corruption on soft RAID5 in 2.4.0+

2001-01-22 Thread Holger Kiehl

On Sun, 21 Jan 2001, Manfred Spraul wrote:

> I've attached Holger's testcase (ext2, SMP, raid5)
> boot with "mem=64M" and run the attached script.
> The script creates and deletes 9 directories with 10.000 in each dir.
> Neil, could you run it? I don't have an raid 5 array - SMP+ext2 without
> raid5 is ok.
>
> Holger, what's your ext2 block size, and do you run with a degraded
> array?
>
No, I do not have a degraded array and the blocksize of ext2 is 4096. Here is
what /proc/mdstat looks like:

 afdbench@florix:~/testdir$ cat /proc/mdstat
 Personalities : [raid1] [raid5]
 read_ahead 1024 sectors
 md3 : active raid1 sdc1[1] sdb1[0]
   136448 blocks [2/2] [UU]

 md4 : active raid1 sde1[1] sdd1[0]
   136448 blocks [2/2] [UU]

 md0 : active raid1 sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1] sda2[0]
   24000 blocks [5/5] [U]

 md1 : active raid5 sdf3[5] sde3[4] sdd3[3] sdc3[2] sdb3[1] sda3[0]
   3148288 blocks level 5, 64k chunk, algorithm 0 [5/5] [U]

 md2 : active raid5 sdf4[5] sde4[4] sdd4[3] sdc4[2] sdb4[1] sda4[0]
   32033280 blocks level 5, 32k chunk, algorithm 0 [5/5] [U]

 unused devices: 

What I do have is a spare disk and I am running swap on raid1. However,
my machine at home, which experienes the same problems, does not have swap
on raid and is also not degraded.

I applied Neils patch to 2.4.1-pre9 and rerun the test, again with
filesystem corruption. I now pressed the reset button and had all parity
recalculated under 2.2.18 and rebooted again to 2.4.1-pre9 to rerun
the test. Now, I do not see anymore filesystem corruption in syslog,
however forcing a check with e2fsck produces the following:

   root@florix:~# !e2fsck
   e2fsck -f /dev/md2
   e2fsck 1.19, 13-Jul-2000 for EXT2 FS 0.5b, 95/08/09
   Pass 1: Checking inodes, blocks, and sizes
   Special (device/socket/fifo) inode 3630145 has non-zero size.  Fix? yes

   Special (device/socket/fifo) inode 3630156 has non-zero size.  Fix? yes

   Pass 2: Checking directory structure
   Pass 3: Checking directory connectivity
   Pass 4: Checking reference counts
   Pass 5: Checking group summary information

   /dev/md2: * FILE SYSTEM WAS MODIFIED *
   /dev/md2: 20002/4006240 files (4.8% non-contiguous), 219556/8008320 blocks

Doing this three times, two of them reported the same inodes with non-zero
size. One test went without any problem (first time ever under 2.4.x).
Now, I am not sure if this still is a filessytem corruption and why
the corruptions where so bad, before the parity recalculation under
2.2.18. I do remember the first time I run 2.4.x with a much larger
testset, it corrupted my system so badly that I had to push the reset
button and parity was recalculated under 2.4.1-pre3.

I will now run my other testset, but this always takes 8 hours. When
this is done I report back.

Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Serious file system corruption with RAID5+SMP and kernels above2.4.0

2001-01-20 Thread Holger Kiehl


On Sat, 20 Jan 2001, Otto Meier wrote:

> Two days ago I tried new kernels on my SMP SW RAID5 System
> and expirienced serous file system corruption with kernels 2.4.1-pre8,9 as 
>2.4.0-ac8,9,10.
> The same error has been reported by other people on this list. With 2.4.0 release
> everything runs fine. So I backsteped to it and had no error since.
>
I just tried 2.4.0 and still get filesystem corruption. My system is
also SMP and SW Raid5. So far I have tried 2.4.0, 2.4.1-pre3,8 and
2.4.0-ac10 and all corrupt my filesystem. 2.2.18 is ok.

With the help of Manfred Spraul I can now reproduce this problem
within 10 minutes.

Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: More filesystem corruption under 2.4.1-pre8 and SW Raid5

2001-01-19 Thread Holger Kiehl




On Fri, 19 Jan 2001, Manfred Spraul wrote:

>
> I don't see a corruption - neither with 192MB ram nor with 48 MB ram.
> SMP, no SW Raid, ext2, but only 1024 byte/file and only 12500
> files/directory.
>
>
> >
> > With 1 I also had no problem, my next step was 5.
> >
> 1 files need ~180MB, that fit's into the cache.
> 5 files need ~900MB, that doesn't fit into the cache.
>
> I'd try 1 files, but now with "mem=64m"
>
You are right! I first tried with 2 files and 256MB and it was ok.
Then I tried with 1 files and "mem=64m" and I get the corruption.

So if I conclude correctly: we both have SMP + ext2 and you do not have
SW raid and I do, that its definetly a SW raid bug?

Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: More filesystem corruption under 2.4.1-pre8 and SW Raid5

2001-01-19 Thread Holger Kiehl




On Fri, 19 Jan 2001, Manfred Spraul wrote:

> > Another thing I notice is that the responsiveness of the machine
> > decreases dramatically as the test progresses until it is nearly
> > useless. After the test is done everything is back to normal.
> > The same behavior was observed under 2.2.18.
>
> That's expected: ext2 performs linear searches through the directory,
> and with 50 000 entries that's very slow.
>
Would reiserfs be better and does it now work with SW Raid5?

> I'm running a few quick tests, but I don't have a large enough spare
> partition (~ 1GB?) for a full test.
>
> How much main memory do you have, how large is your raid5 partition?
>
On the two machines I have tried both have 256 MB of memory and one
has a 8GB Raid5 and the other has a 30GB Raid5 partition.

> Could you try to reproduce the problem with fewer files and less main
> memory?
>
I will try.

> I'm running your test with 48 MB ram, 12500 files, 9 processes in a 156
> MB partition (swapoff, here is the test partition ;-).
> With 192MB Ram I don't see the corruption.
>
I am not sure if I understand you correctly: with 48MB you do get
corruption and with 192MB not? And if you do see corruption are you
using SW Raid, SMP?

With 1 I also had no problem, my next step was 5.

Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

More filesystem corruption under 2.4.1-pre8 and SW Raid5

2001-01-18 Thread Holger Kiehl


Hello

Trying to find a quick way to reproduce the filesystem corruption
I reported earlier, I have written a short program that simply creates
a certain number of files in a given directory. Now if I start this
program 9 times each creating 5 files (each 2048 Bytes) in 9
different directories and then delete these files again I always get
filesystem corruption.

I admit that creating 5 files in one directory is not something
very common, but in my other test there are simply to many process
creating and deleting files and took too long to reproduce. My
assumption is that something goes wrong somewhere as soon as a
certain number files have been created.

The test where done on two different machines both SMP, SW Raid 5
and ext2 filesystem. Under 2.4.1-pre3 and pre8 I always get filesystem
corruption. This does NOT happen under 2.2.18.

I don't know if this is due to a problem in the Raid 5, ext2 filesystem
or in the kernel. Also, I do not currently have a system with 2.4.x
without raid5. For this reason I have attached two files (one C program
and a script) with the code that corrupts my filesystem. To run it you
need to issue the following commands:

cc -o fsd fsd.c
mkdir testdir
cp fsd start_fsd testdir
cd testdir
chmod 755 start_fsd
./start_fsd

now you need to wait 3 or 4 hours and you should see some
ext2 errors in your syslog.

WARNING: This corrupts you filesystem really badly! Sometimes
 only the files in the testdir are effected. However, I
 had cases where other files where also effected. The
 system sometimes behaves very strangely after the test,
 programs that always have worked just crash. Reconstruction
 with fsck does not always work properly, sometimes there are
 very strange files scattered over the whole filesystem
 afterwards.  So be warned, do this on a test filesystem and boot
 the machine after the test!

Another thing I notice is that the responsiveness of the machine
decreases dramatically as the test progresses until it is nearly
useless. After the test is done everything is back to normal.
The same behavior was observed under 2.2.18.

Holger


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void create_files(int, int, char *),
delete_files(char *);


/* fsd $$*/
int
main(int argc, char *argv[])
{
   int  no_of_files,
file_size;
   char dirname[1024];

   if (argc == 4)
   {
  no_of_files = atoi(argv[1]);
  file_size = atoi(argv[2]);
  (void)strcpy(dirname, argv[3]);
   }
   else
   {
  (void)fprintf(stderr,
"Usage: %s   \n",
argv[0]);
  exit(1);
   }
   create_files(no_of_files, file_size, dirname);
   delete_files(dirname);
   exit(0);
}


/*+++ create_files() */
static void
create_files(int no_of_files, int file_size, char *dirname)
{
   int  i, fd;
   char *ptr;

   ptr = dirname + strlen(dirname);
   *ptr++ = '/';
   for (i = 0; i < no_of_files; i++)
   {
  (void)sprintf(ptr, "this_is_dummy_file_%d", i);
  if ((fd = open(dirname, O_CREAT|O_RDWR, S_IRUSR|S_IWUSR)) == -1)
  {
 (void)fprintf(stderr, "Failed to open() %s : %s\n",
   dirname, strerror(errno));
 exit(1);
  }
  if (lseek(fd, file_size - 1, SEEK_SET) == -1)
  {
 (void)fprintf(stderr, "Failed to lseek() %s : %s\n",
   dirname, strerror(errno));
 exit(1);
  }
  if (write(fd, "", 1) != 1)
  {
 (void)fprintf(stderr, "Failed to write() to %s : %s\n",
   dirname, strerror(errno));
 exit(1);
  }
  if (close(fd) == -1)
  {
 (void)fprintf(stderr, "Failed to close() %s : %s\n",
   dirname, strerror(errno));
  }
   }
   ptr[-1] = 0;
   return;
}


/* delete_files +*/
static void
delete_files(char *dirname)
{
   char  *ptr;
   struct dirent *dirp;
   DIR   *dp;

   ptr = dirname + strlen(dirname);
   if ((dp = opendir(dirname)) == NULL)
   {
  (void)fprintf(stderr, "Failed to opendir() %s : %s\n",
dirname, strerror(errno));
  exit(1);
   }
   *ptr++ = '/';
   while ((dirp = readdir(dp)) != NULL)
   {
  if (dirp->d_name[0] != '.')
  {
 (void)strcpy(ptr, dirp->d_name);
 if (unlink(dirname) == -1)
 {
(void)fprintf(stderr, "Failed to open() %s : %s\n",
  dirname, strerror(errno));
exit(1);
 }
  }
   }
   ptr[-1] = 0;
   if (closedir(dp) == -1)
   {
  (void)fprintf(stderr, "Failed to closedir() %s : %s\n",

PROBLEM: More filesystem corruption with 2.4.1-pre3 and SW raid5

2001-01-15 Thread Holger Kiehl


Hello

Doing further tests I have experienced more filesystem corruption.
This time on another node, but also with SMP and SW raid5. The machine
has run the same test several times under 2.2.18, 2.2.17, 2.2.14 and
2.2.12 with no problems. This was the first time the test was run under
2.4.1 and gave me filesystem corruption. I observed the same thing on
my machine at home.

The test I am doing is copying/linking thousands of files around and delete
them again. The test starts of with 58 process copying 600 files (SMALL),
then 135 process copy around 9000 files (MEDIUM) and the in the last
test 325 process copy 8 files (BIG). Each of the three tests (SMALL,
MEDIUM, BIG) is further divided into one test where the files get transmitted
via FTP (localhost) and another where the files are just being linked
from one directory to another one. And it always starts when I come
to the linking test. The link rate is about 2000 files/s. Here follows
some data what syslog reported:

   Jan 13 17:09:03 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (1881249), 0
   Jan 13 17:09:03 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (1881250), 0
   Jan 13 17:09:03 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (1881251), 0
   .
   .
   .
   Jan 13 17:19:56 florix kernel: EXT2-fs error (device md(9,2)): ext2_free_blocks: 
bit already cleared for block 6688150
   Jan 13 17:19:57 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (3338561), 0
   Jan 13 17:19:57 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (3338562), 0
   Jan 13 17:19:57 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (3338563), 0
   .
   .
   .
   Jan 13 17:20:00 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (3338647), 0
   Jan 13 17:20:00 florix kernel: EXT2-fs error (device md(9,2)): ext2_free_blocks: 
bit already cleared for block 6688139
   Jan 13 17:20:00 florix kernel: EXT2-fs error (device md(9,2)): ext2_free_blocks: 
bit already cleared for block 6688136
   Jan 13 17:20:00 florix kernel: EXT2-fs error (device md(9,2)): ext2_free_blocks: 
bit already cleared for block 6688182
   Jan 13 17:26:34 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (3361022), 0
   Jan 13 17:26:34 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (3361023), 0
   Jan 13 17:26:34 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (3361024), 0
   .
   .
   .
   Jan 13 17:26:35 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (3361023), 0
   Jan 13 17:26:35 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (3361024), 0
   Jan 13 17:29:20 florix kernel: EXT2-fs error (device md(9,2)): ext2_free_blocks: 
bit already cleared for block 918960
   Jan 13 17:29:20 florix kernel: EXT2-fs error (device md(9,2)): ext2_free_blocks: 
bit already cleared for block 918961
   Jan 13 17:29:20 florix kernel: EXT2-fs error (device md(9,2)): ext2_free_blocks: 
bit already cleared for block 918962
   .
   .
   .
   Jan 13 17:30:57 florix kernel: EXT2-fs error (device md(9,2)): ext2_free_blocks: 
bit already cleared for block 3808052
   Jan 13 17:30:57 florix kernel: EXT2-fs error (device md(9,2)): ext2_free_blocks: 
bit already cleared for block 3808053
   Jan 13 17:30:57 florix kernel: EXT2-fs error (device md(9,2)): ext2_free_blocks: 
bit already cleared for block 3808054
   Jan 13 17:32:56 florix kernel: EXT2-fs error (device md(9,2)): ext2_readdir: bad 
entry in directory #2894349: rec_len % 4 != 0 - offset=0, inode=270105152, 
rec_len=1397, name_len=39
   Jan 13 17:32:56 florix kernel: EXT2-fs warning (device md(9,2)): empty_dir: bad 
directory (dir #2894349) - no `.' or `..'
   Jan 13 17:37:22 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (1940635), 0
   Jan 13 17:37:22 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (1940636), 0
   Jan 13 17:37:22 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (1940637), 0
   Jan 13 17:37:22 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (1940638), 0
   .
   .
   .
Jan 13 19:34:27 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (1933469), 0
Jan 13 19:34:27 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (1933471), 0
Jan 13 19:34:27 florix kernel: EXT2-fs warning (device md(9,2)): ext2_unlink: 
Deleting nonexistent file (1933472), 0


At this point I was not able t

PROBLEM: Filesystem corruption with 2.4.1-pre3 and raid5

2001-01-14 Thread Holger Kiehl


Hello

Doing some test where lots of small files get copied (and some large
ones) around, I experienced filesystem corruption with 2.4.1-pre3.

The system has a ASUS P2B-DS (onboard adaptec controller) with two P2-350,
256MB (one module) PC-100 222 SDRAM with ECC, with 4 SCSI disk and one IDE
disk put together as one big SW Raid5 disk, SuSE 6.4 with the following:
Linux cube 2.4.1-pre3 #3 SMP Sun Jan 14 14:19:02 CET 2001 i686 unknown
Kernel modules2.3.24
Gnu C 2.95.2
Gnu Make  3.78.1
Binutils  2.9.5.0.24
Linux C Library   x   1 root root  4061504 Mar 11  2000 /lib/libc.so.6
Dynamic linkerldd (GNU libc) 2.1.3
Procps2.0.6
Mount 2.10r
Net-tools 1.54
Kbd   0.99
Sh-utils  2.0
Modules Loaded

I know my modutilities are not up to date, but all relevant things (SCSI,
filesystem, raid) where compiled in.
Here are some messages from syslog:

Jan 14 18:50:00 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: 
Deleting nonexistent file (613512), 0
Jan 14 18:56:19 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: 
Deleting nonexistent file (613533), 0
Jan 14 18:56:20 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: 
Deleting nonexistent file (613510), 0
Jan 14 18:57:14 cube kernel: attempt to access beyond end of device
Jan 14 18:57:14 cube kernel: 09:01: rw=1, want=1753106892, limit=8449536
Jan 14 18:57:14 cube kernel: attempt to access beyond end of device
Jan 14 18:57:14 cube kernel: 09:01: rw=1, want=1635361196, limit=8449536
.
.
.
Jan 14 18:57:14 cube kernel: attempt to access beyond end of device
Jan 14 18:57:14 cube kernel: 09:01: rw=1, want=127799040, limit=8449536
Jan 14 18:57:14 cube kernel: attempt to access beyond end of device
Jan 14 18:57:14 cube kernel: 09:01: rw=1, want=1004451972, limit=8449536
Jan 14 19:09:05 cube -- MARK --
Jan 14 19:29:05 cube -- MARK --
Jan 14 19:32:55 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: 
Deleting nonexistent file (145947), 0
Jan 14 19:32:55 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: 
Deleting nonexistent file (145948), 0
Jan 14 19:32:55 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: 
Deleting nonexistent file (145949), 0
.
.
.
Jan 14 19:33:18 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: 
Deleting nonexistent file (145945), 0
Jan 14 19:33:18 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: 
Deleting nonexistent file (145946), 0
Jan 14 19:49:06 cube -- MARK --
Jan 14 19:53:36 cube kernel: __alloc_pages: 2-order allocation failed.
Jan 14 19:53:39 cube last message repeated 8 times
Jan 14 20:09:06 cube -- MARK --
Jan 14 20:10:52 cube kernel: EXT2-fs error (device md(9,1)): ext2_readdir: bad 
entry in directory #929061: rec_len is smaller than minimal - offset=4056, inode=0, 
rec_len=0, name_len=0
Jan 14 20:10:52 cube kernel: EXT2-fs error (device md(9,1)): empty_dir: bad entry 
in directory #929061: rec_len is smaller than minimal - offset=4056, inode=0, 
rec_len=0, name_len=0
Jan 14 20:30:20 cube -- MARK --
Jan 14 20:50:24 cube -- MARK --
Jan 14 21:10:06 cube kernel: EXT2-fs error (device md(9,1)): ext2_free_blocks: bit 
already cleared for block 1402395
Jan 14 21:10:06 cube kernel: EXT2-fs error (device md(9,1)): ext2_free_blocks: bit 
already cleared for block 1438368
Jan 14 21:11:57 cube kernel: EXT2-fs error (device md(9,1)): ext2_free_blocks: bit 
already cleared for block 1439021
Jan 14 21:11:57 cube kernel: EXT2-fs error (device md(9,1)): ext2_free_blocks: bit 
already cleared for block 1435690
Jan 14 21:27:01 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: 
Deleting nonexistent file (698429), 0
.
.
.
Jan 14 21:27:03 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: 
Deleting nonexistent file (698429), 0
Jan 14 21:30:02 cube nscd: 175: cannot stat() file `/etc/group': No such file or 
directory
Jan 14 21:35:38 cube /usr/sbin/gpm[113]: oops() invoked from gpm.c(508)
Jan 14 21:35:38 cube /usr/sbin/gpm[113]: get_shift_state: Inappropriate ioctl for 
device

At this point I could still log into the system.
I noticed after killing all process with SysRQ+i that something (I assume
the kernel) was eating my memory:

ps aux

USER   PID %CPU %MEM   VSZ  RSS TTY  STAT START   TIME COMMAND
root 1  0.0  0.0   344  200 ?S14:48   0:09 init
root 2  0.0  0.0 00 ?SW   14:48   0:00 [keventd]
root 4  0.0  0.0 00 ?SW   14:48   0:23 [kswapd]
root 5  0.0  0.0 00 ?SW   14:48   0:03 [kreclaimd]
root 6  0.7  0.0 00 ?SW   14:48   2:59 [bdflush]
root 7  0.3

Why is LINK_MAX so low?

2000-12-18 Thread Holger Kiehl


Hello

Why is LINK_MAX in linux only 127? The values for other operating
systems is as follows:

   solaris  32767
   hpux 32767
   irix 3

In reallity LINK_MAX for ext2 is 32000, so why is this only 127?

Please cc to me since I am not on this list.

Thanks,
Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

52 matches

Mail list logo