Re: Frequent SATA errors / port timeouts in 2.6.18.3?

2006-12-14 Thread David Greaves
Patrik Jonsson wrote:
 Hi all,
 this may not be the best list for this question, but I figure that the
 number of disks connected to users here should be pretty big...
 
 I upgraded from 2.6.17-rc4 to 2.6.18.3 about a week ago, and I've since
 had 3 drives kicked out of my 10-drive RAID5 array. Previously, I had no
 kicks over almost a year. The kernel message is:
 
 ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
 ata7.00: (BMDMA stat 0x20)
 ata7.00: tag 0 cmd 0xc8 Emask 0x1 stat 0x41 err 0x4 (device error)
 ata7: EH complete

 Any ideas or thought would be appreciated,
SMART?

Read the manpage and then try running:
smartctl -data -S on /dev/...
and
smartctl -data -s on /dev/...

Then look at your smartd timing and see if it's related; possibly just do a
manual smartd poll.

I've had smart/libata problems (well, no, glitches) for about 2 years now but as
the irq handler occasionally says no one cared ;)

It may well not be your problem but...

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disappointed with 3ware 9550sx

2006-12-14 Thread Rob Bray
 i want to say up front that i have several 3ware 7504 and 7508 cards
 which i am completely satisfied with.  i use them as JBOD, and they make
 stellar PATA controllers (not RAID controllers).  they're not perfect
 (they're slow), but they've been rock solid for years.

 not so the 9550sx.

 i've been a software raid devotee for years now.  i've never wanted to
 trust my data to hw raid, because i can't look under the covers and see
 what it's doing, and i'm at the mercy of the vendor when it comes to
 recovery situations.  so why did i even consider hw raid?  NVRAM.  i
 wanted the write performance of NVRAM.

 i debated between areca and 3ware, but given the areca driver wasn't in
 the kernel (it is now), the lack of smartmontools support for areca, and
 my experiences with the 7504/7508 i figured i'd stick with what i know.

 sure i am impressed with the hw raid i/o rates on the 9550sx, especially
 with the NVRAM.  but i am unimpressed with several failures which have
 occured which evidence suggests are 3ware's fault (or at worst would
 not have resulted in problems with sw raid).

 my configuration has 7 disks:

 - 3x400GB WDC WD4000YR-01PLB0 firmware 01.06A01
 - 4x250GB WDC WD2500YD-01NVB1 firmware 10.02E01

 those disks and firmwares are on the 3ware drive compatibility list:
 http://www.3ware.com/products/pdf/Drive_compatibility_list_9550SX_9590SE_2006_09.pdf

 note that the compatibility list has a column NCQ, which i read as an
 indication the drive supports NCQ or not.  as supporting evidence for this
 i refer to footnote number 4, which is specifically used on some drives
 which MUST NOT have NCQ enabled.

 i had NCQ enabled on all 7 drives.  perhaps this is the source of some of
 my troubles, i'll grant 3ware that.

 initially i had the firmware from the 9.3.0.4 release on the 9550sx
 (3.04.00.005) it was the most recent at the time i installed the system.
 (and the appropriate driver in the kernel -- i think i was using 2.6.16.x
 at the time.)

 my first disappointment came when i tried to create a 3-way raid1 on the
 3x400 disks.  it doesn't support it at all.  i had become so accustomed to
 using a 3-way raid1 with software raid it didn't even occur to me to find
 out up front if the 3ware could support this.  apparently this is so
 revolutionary an idea 3ware support was completely baffled when i opened a
 ticket regarding it.  why would you want that?  it will fail over to a
 spare disk automatically.

 still lured by the NVRAM i gave in and went with a 2-way mirror plus a
 spare.  (i prefer the 3-way mirror so i'm never without a redundant copy
 and don't have to rush to the colo with a replacement when a disk fails.)

 the 4x250GB were turned into a raid-10.

 install went fine, testing went fine, system was put into production.


 second disappointment:  within a couple weeks the 9550sx decided it
 didn't like one of the 400GB disks and knocked it out of the array.
 here's what the driver had to say about it:

 Sep  6 23:47:30 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive
 timeout detected:port=0.
 Sep  6 23:47:31 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded
 unit:unit=0, port=0.
 Sep  6 23:48:46 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x000B): Rebuild
 started:unit=0.
 Sep  7 00:02:12 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x003B): Rebuild
 paused:unit=0.
 Sep  7 00:02:27 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x000B): Rebuild
 started:unit=0.
 Sep  7 09:32:19 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0005): Rebuild
 completed:unit=0.

 the 9550sx could still communicate with the disk -- the SMART log
 had no indications of error.  i converted the drive to JBOD and read and
 overwrote the entire surface without a problem.  i ended up just just
 converting the drive to the spare disk... but remained worried about
 why it could have been knocked out of the array.

 maybe this is a WD bug, maybe it's a 3ware bug, who knows.


 third disappointment:  for a large data copy i inserted a disk into the
 remaining spare slot on the 3ware.  now i'm familiar with 750[48] where
 i run everything as JBOD and never let 3ware raid touch it.  when i
 inserted this 8th disk i found i had to ask tw_cli to create a JBOD.
 the disappointment comes here:  it zeroed the MBR!  fortunately the disk
 had a single full-sized partition and i could recreate the partition
 table, but there's no sane reason to zero the MBR just because i asked
 for the disk to be treated as JBOD (and don't tell me it'll reduce
 customer support cases because people might reuse a bad partition table
 from a previously raid disk -- i think it'll create even more problems
 than that explanation might solve).


 fourth disappointment:  heavy write traffic on one unit can affect
 other units even though they have separate spindles.  my educated
 guess is the 3ware does not share its cache fairly and the write
 traffic starves everything else.  i described this in a post here
 

Re: [RFC: 2.6 patch] simplify drivers/md/md.c:update_size()

2006-12-14 Thread Doug Ledford
On Fri, 2006-12-15 at 01:19 +0100, Adrian Bunk wrote:
 While looking at commit 8ddeeae51f2f197b4fafcba117ee8191b49d843e,
 I got the impression that this commit couldn't fix anything, since the 
 size variable can't be changed before fit gets used.
 
 Is there any big thinko, or is the patch below that slightly simplifies 
 update_size() semantically equivalent to the current code?

No, this patch is broken.  Where it fails is specifically the case where
you want to autofit the largest possible size, you have different size
devices, and the first device is not the smallest.  When you hit the
first device, you will set size, then as you repeat the ITERATE_RDEV
loop, when you hit the smaller device, size will be non-0 and you'll
then trigger the later if and return -ENOSPC.  In the case of autofit,
you have to preserve the fit variable instead of looking at size so you
know whether or not to modify the size when you hit a smaller device
later in the list.

 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
 
 ---
 
  drivers/md/md.c |3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)
 
 --- linux-2.6.19-mm1/drivers/md/md.c.old  2006-12-15 00:57:05.0 
 +0100
 +++ linux-2.6.19-mm1/drivers/md/md.c  2006-12-15 00:57:42.0 +0100
 @@ -4039,57 +4039,56 @@
* Generate a 128 bit UUID
*/
   get_random_bytes(mddev-uuid, 16);
  
   mddev-new_level = mddev-level;
   mddev-new_chunk = mddev-chunk_size;
   mddev-new_layout = mddev-layout;
   mddev-delta_disks = 0;
  
   mddev-dead = 0;
   return 0;
  }
  
  static int update_size(mddev_t *mddev, unsigned long size)
  {
   mdk_rdev_t * rdev;
   int rv;
   struct list_head *tmp;
 - int fit = (size == 0);
  
   if (mddev-pers-resize == NULL)
   return -EINVAL;
   /* The size is the amount of each device that is used.
* This can only make sense for arrays with redundancy.
* linear and raid0 always use whatever space is available
* We can only consider changing the size if no resync
* or reconstruction is happening, and if the new size
* is acceptable. It must fit before the sb_offset or,
* if that is data_offset, it must fit before the
* size of each device.
* If size is zero, we find the largest size that fits.
*/
   if (mddev-sync_thread)
   return -EBUSY;
   ITERATE_RDEV(mddev,rdev,tmp) {
   sector_t avail;
   avail = rdev-size * 2;
  
 - if (fit  (size == 0 || size  avail/2))
 + if (size == 0)
   size = avail/2;
   if (avail  ((sector_t)size  1))
   return -ENOSPC;
   }
   rv = mddev-pers-resize(mddev, (sector_t)size *2);
   if (!rv) {
   struct block_device *bdev;
  
   bdev = bdget_disk(mddev-gendisk, 0);
   if (bdev) {
   mutex_lock(bdev-bd_inode-i_mutex);
   i_size_write(bdev-bd_inode, (loff_t)mddev-array_size 
  10);
   mutex_unlock(bdev-bd_inode-i_mutex);
   bdput(bdev);
   }
   }
   return rv;
  }
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: [RFC: 2.6 patch] simplify drivers/md/md.c:update_size()

2006-12-14 Thread Adrian Bunk
On Thu, Dec 14, 2006 at 07:36:35PM -0500, Doug Ledford wrote:
 On Fri, 2006-12-15 at 01:19 +0100, Adrian Bunk wrote:
  While looking at commit 8ddeeae51f2f197b4fafcba117ee8191b49d843e,
  I got the impression that this commit couldn't fix anything, since the 
  size variable can't be changed before fit gets used.
  
  Is there any big thinko, or is the patch below that slightly simplifies 
  update_size() semantically equivalent to the current code?
 
 No, this patch is broken.  Where it fails is specifically the case where
 you want to autofit the largest possible size, you have different size
 devices, and the first device is not the smallest.  When you hit the
 first device, you will set size, then as you repeat the ITERATE_RDEV
 loop, when you hit the smaller device, size will be non-0 and you'll
 then trigger the later if and return -ENOSPC.  In the case of autofit,
 you have to preserve the fit variable instead of looking at size so you
 know whether or not to modify the size when you hit a smaller device
 later in the list.
...

OK, sorry, I've got my thinko:

ITERATE_RDEV() is a loop.

That's what I missed.

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Nikolai Joukov
 Nikolai Joukov wrote:
  We have designed a new stackable file system that we called RAIF:
  Redundant Array of Independent Filesystems.

 Great!

  We have performed some benchmarking on a 3GHz PC with 2GB of RAM and U320
  SCSI disks.  Compared to the Linux RAID driver, RAIF has overheads of
  about 20-25% under the Postmark v1.5 benchmark in case of striping and
  replication.  In case of RAID4 and RAID5-like configurations, RAIF
  performed about two times *better* than software RAID and even better than
  an Adaptec 2120S RAID5 controller.

 I am not surprised.  RAID 4/5/6 performance is highly sensitive to the
 underlying hw, and thus needs a fair amount of fine tuning.

Nevertheless, performance is not the biggest advantage of RAIF.  For
read-biased workloads RAID is always slightly faster than RAIF.  The
biggest advantages of RAIF are flexible configurations (e.g., can combine
NFS and local file systems), per-file-type storage policies, and the fact
that files are stored as files on the lower file systems (which is
convenient).

  This is because RAIF is located above
  file system caches and can cache parity as normal data when needed.  We
  have more performance details in a technical report, if anyone is
  interested.

 Definitely interested.  Can you give a link?

The main focus of the paper is on a general OS profiling method and not
on RAIF.  However, it has some details about the RAIF benchmarking with
Postmark in Chapter 9:

  http://www.fsl.cs.sunysb.edu/docs/joukov-phdthesis/thesis.pdf

Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
operation under the same Postmark workload.

Nikolai.
-
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Charles Manning
On Friday 15 December 2006 10:01, Nikolai Joukov wrote:
  Nikolai Joukov wrote:
   We have designed a new stackable file system that we called RAIF:
   Redundant Array of Independent Filesystems.
 
  Great!

Yes, definitely...

I see the major benefit being in the mobile, industrial and embedded systems 
arena. Perhaps this might come as a suprise to people, but a very large and 
ever growing number (perhaps even most) Linux devices don't use block devices 
for storage. Instead they use flash file systems or nfs, niether of which use 
local block devices.

It looks like RAIF gives a way to provide redundancy etc on these devices.


 
   We have performed some benchmarking on a 3GHz PC with 2GB of RAM and
   U320 SCSI disks.  Compared to the Linux RAID driver, RAIF has overheads
   of about 20-25% under the Postmark v1.5 benchmark in case of striping
   and replication.  In case of RAID4 and RAID5-like configurations, RAIF
   performed about two times *better* than software RAID and even better
   than an Adaptec 2120S RAID5 controller.
 
  I am not surprised.  RAID 4/5/6 performance is highly sensitive to the
  underlying hw, and thus needs a fair amount of fine tuning.

 Nevertheless, performance is not the biggest advantage of RAIF.  For
 read-biased workloads RAID is always slightly faster than RAIF.  The
 biggest advantages of RAIF are flexible configurations (e.g., can combine
 NFS and local file systems), per-file-type storage policies, and the fact
 that files are stored as files on the lower file systems (which is
 convenient).

   This is because RAIF is located above
   file system caches and can cache parity as normal data when needed.  We
   have more performance details in a technical report, if anyone is
   interested.
 
  Definitely interested.  Can you give a link?

 The main focus of the paper is on a general OS profiling method and not
 on RAIF.  However, it has some details about the RAIF benchmarking with
 Postmark in Chapter 9:

   http://www.fsl.cs.sunysb.edu/docs/joukov-phdthesis/thesis.pdf

 Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
 operation under the same Postmark workload.

 Nikolai.
 -
 Nikolai Joukov, Ph.D.
 Filesystems and Storage Laboratory
 Stony Brook University
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread berk walker



Nikolai Joukov wrote:


  http://www.fsl.cs.sunysb.edu/docs/joukov-phdthesis/thesis.pdf

Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
operation under the same Postmark workload.

Nikolai.
-
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University

  


Well, Congratulations, Doctor!!  [Must be nice to be exiled to Stony 
Brook!!  Oh, well, not I]


For some reason, I can not connect to the above link, but I may not need 
to.  Does [should] it contain a link/pointer to the underlying source 
code?  This concept sounds very interesting, and I am sure that many of 
us would like to look closer, and maybe even get a taste.



Here's hoping that source exists, and that it is available for us.

Thanks
b-

  
-

To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Nikolai Joukov
  We started the project in April 2004.  Right now I am using it as my
  /home/kolya file system at home.  We believe that at this stage RAIF is
  mature enough for others to try it out.  The code is available at:
 
  ftp://ftp.fsl.cs.sunysb.edu/pub/raif/
 
  The code requires no kernel patches and compiles for a wide range of
  kernels as a module.  The latest kernel we used it for is 2.6.13 and we
  are in the process of porting it to 2.6.19.
 
  We will be happy to hear your back.

 When removing a file from the underlying branch, the oops below happens.
 Wouldn't it be possible to just fail the branch instead of oopsing?

This is a known problem of all Linux stackable file systems.  Users are
not supposed to change the file systems below mounted stackable file
systems (but they can read them).  One of the ways to enforce it is to use
overlay mounts.  For example, mount the lower file systems at
/raif/b0 ... /raif/bN and then mount RAIF at /raif.  Stackable file
systems recently started getting into the kernel and we hope that there
will be a better solution for this problem in the future.  Having said
that, you are right: failing the branch would be the right thing to do.

Nikolai.
-
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Nikolai Joukov
 Well, Congratulations, Doctor!!  [Must be nice to be exiled to Stony
 Brook!!  Oh, well, not I]

Long Island is a very nice place with lots of vineries and perfect sand
beaches - don't envy :-)

 Here's hoping that source exists, and that it is available for us.

I guess, you are subscribed to the linux-raid list only.  Unfortunately, I
didn't CC my post to that list and one of the replies was CC'd there
without the link.  The original post is available here:

  http://marc.theaimsgroup.com/?l=linux-fsdevelm=116603282106036w=2

And the link to the sources is:

  ftp://ftp.fsl.cs.sunysb.edu/pub/raif/

Nikolai.
-
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC: 2.6 patch] simplify drivers/md/md.c:update_size()

2006-12-14 Thread Adrian Bunk
While looking at commit 8ddeeae51f2f197b4fafcba117ee8191b49d843e,
I got the impression that this commit couldn't fix anything, since the 
size variable can't be changed before fit gets used.

Is there any big thinko, or is the patch below that slightly simplifies 
update_size() semantically equivalent to the current code?

Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

---

 drivers/md/md.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- linux-2.6.19-mm1/drivers/md/md.c.old2006-12-15 00:57:05.0 
+0100
+++ linux-2.6.19-mm1/drivers/md/md.c2006-12-15 00:57:42.0 +0100
@@ -4039,57 +4039,56 @@
 * Generate a 128 bit UUID
 */
get_random_bytes(mddev-uuid, 16);
 
mddev-new_level = mddev-level;
mddev-new_chunk = mddev-chunk_size;
mddev-new_layout = mddev-layout;
mddev-delta_disks = 0;
 
mddev-dead = 0;
return 0;
 }
 
 static int update_size(mddev_t *mddev, unsigned long size)
 {
mdk_rdev_t * rdev;
int rv;
struct list_head *tmp;
-   int fit = (size == 0);
 
if (mddev-pers-resize == NULL)
return -EINVAL;
/* The size is the amount of each device that is used.
 * This can only make sense for arrays with redundancy.
 * linear and raid0 always use whatever space is available
 * We can only consider changing the size if no resync
 * or reconstruction is happening, and if the new size
 * is acceptable. It must fit before the sb_offset or,
 * if that is data_offset, it must fit before the
 * size of each device.
 * If size is zero, we find the largest size that fits.
 */
if (mddev-sync_thread)
return -EBUSY;
ITERATE_RDEV(mddev,rdev,tmp) {
sector_t avail;
avail = rdev-size * 2;
 
-   if (fit  (size == 0 || size  avail/2))
+   if (size == 0)
size = avail/2;
if (avail  ((sector_t)size  1))
return -ENOSPC;
}
rv = mddev-pers-resize(mddev, (sector_t)size *2);
if (!rv) {
struct block_device *bdev;
 
bdev = bdget_disk(mddev-gendisk, 0);
if (bdev) {
mutex_lock(bdev-bd_inode-i_mutex);
i_size_write(bdev-bd_inode, (loff_t)mddev-array_size 
 10);
mutex_unlock(bdev-bd_inode-i_mutex);
bdput(bdev);
}
}
return rv;
 }
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html