Re: [PATCH] Use new sb type

2008-02-11 Thread Bill Davidsen

David Greaves wrote:

Jan Engelhardt wrote:
  

Feel free to argue that the manpage is clear on this - but as we know, not
everyone reads the manpages in depth...
  

That is indeed suboptimal (but I would not care since I know the
implications of an SB at the front)



Neil cares even less and probably  doesn't even need mdadm - heck he probably
just echos the raw superblock into place via dd...

http://xkcd.com/378/
  


I don't know why this makes me think of APL...

--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use new sb type

2008-02-10 Thread Bill Davidsen

Jan Engelhardt wrote:

On Feb 10 2008 12:27, David Greaves wrote:
  

I do not see anything wrong by specifying the SB location as a metadata
version. Why should not location be an element of the raid type?
It's fine the way it is IMHO. (Just the default is not :)
  

There was quite a discussion about it.

For me the main argument is that for most people seeing superblock versions
(even the manpage terminology is version and subversion) will correlate
incremental versions with improvement.
They will therefore see v1.2 as 'the latest and best'.
Feel free to argue that the manpage is clear on this - but as we know, not
everyone reads the manpages in depth...



That is indeed suboptimal (but I would not care since I know the
implications of an SB at the front);

Naming it "[EMAIL PROTECTED]" / "[EMAIL PROTECTED]" / "[EMAIL PROTECTED]" or so 
would address this.

  
We have already discussed names and Neil has expressed satisfaction with 
my earlier suggestion. Since "@" is sort of a semi-special character to 
the shell, I suspect we are better off avoiding it.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any inexpensive hardware recommendations for PCI interface cards?

2008-02-08 Thread Bill Davidsen

Steve Fairbairn wrote:

Can anyone see any issues with what I'm trying to do?
  


No.


Are there any known issues with IT8212 cards (They worked as straight
disks on linux fine)?
  


No idea, don't have that card.


Is anyone using an array with disks on PCI interface cards?
  


Works. I've mixed PATA, SATA, onboard, PCI, and firewire (lack of 
controllers is the mother of invention). As long as the device under the 
raid works, the raid should work.



Is there an issue with mixing motherboard interfaces and PCI card based
ones?
  


Not that I've found.


Does anyone recommend any inexpensive (probably SATA-II) PCI interface
cards?
  


Not I. Large drives have have cured me of FrankenRAID setups recently, 
other than to build little arrays out of USB devices for backup.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-08 Thread Bill Davidsen

Marcin Krol wrote:

Thursday 07 February 2008 22:35:45 Bill Davidsen napisał(a):
  

As you may remember, I have configured udev to associate /dev/d_* devices with
serial numbers (to keep them from changing depending on boot module loading 
sequence). 
  


  
Why do you care? 



Because /dev/sd* devices get swapped randomly depending on boot module insertion
sequence, as I explained earlier.

  

So there's no functional problem, just cosmetic?
If you are using UUID for all the arrays and mounts  
does this buy you anything? 



This is exactly what is not clear for me: what is it that identifies drive/partition as part of 
the array? /dev/sd name? UUID as part of superblock? /dev/d_n?


If it's UUID I should be safe regardless of /dev/sd* designation? Yes or no?

  

Yes, absolutely.
And more to the point, the first time a  
drive fails and you replace it, will it cause you a problem? Require 
maintaining the serial to name data manually?



That's not the problem. I just want my array to be intact.

  
I miss the benefit of forcing this instead of just building the 
information at boot time and dropping it in a file.



I would prefer that, too - if it worked. I was getting both arrays messed 
up randomly on boot. "messed up" in the sense of arrays being composed

of different /dev/sd devices.

  
Different devices? Or just different names for the same devices? I 
assume just the names change, and I still don't see why you care... 
subtle beyond my understanding.
  
And I made *damn* sure I zeroed all the superblocks before reassembling 
the arrays. Yet it still shows the old partitions on those arrays!
  
  
As I noted before, you said you had these on whole devices before, did 
you zero the superblocks on the whole devices or the partitions? From 
what I read, it was the partitions.



I tried it both ways actually (rebuilt arrays a few times, just udev didn't want
to associate WD-serialnumber-part1 as /dev/d_1p1 as it was told, it still 
claimed
it was /dev/d_1). 
  


I'm not talking about building the array, but zeroing the superblocks. 
Did you use the partition name, /dev/sdb1, when you ran mdadm with 
"zero-super" or did you zero the whole device, /dev/sdb, which is what 
you were using when you first built the array with whole devices. If you 
didn't zero the superblock for the whole device it may explain why a 
superblock is still found.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: using update-initramfs: how to get new mdadm.conf into the /boot? Or is it XFS?

2008-02-07 Thread Bill Davidsen

Bill Davidsen wrote:

Moshe Yudkowsky wrote:

maximilian attems wrote:


error 15 is an *grub* error.

grub is known for it's dislike of xfs, so with this whole setup use 
ext3

rerun grub-install and you should be fine.


I should mention that something *did* change. When attempting to use 
XFS, grub would give me a note about "18 partitions used" (I forget 
the exact language). This was different than I'd remembered; when I 
switched back to using reiserfs, grub reports using 19 partitions.


So there's something definitely interesting about XFS and booting.

As an additional note, if I use the grub boot-time commands to edit 
root to read, e.g., root=/dev/sda2 or root=/dev/sdb2, I get the same 
Error 15 error message.


It may be that grub is complaining about grub and resiserfs, but I 
suspect that it has a true complain about the file system and what's 
on the partitions.


I think you have two choices, convert /boot to ext2 and be sure you 
are going down the best-tested code path, or fight and debug, read 
code, learn grub source, play with the init parts of the boot 
sequence, and then convert /boot to ext2 anyway. No matter how 
"better" something else might be, /boot has nothing I use except at 
boot, I don't need features or performance, I just want it to work.


Unless you are so frustrated you have entered "I am going to make this 
*work* if it takes forever" mode, I would try the easy solution first. 
Just my take on it.


Or you can get lucky and someone will have seen this before and hand you 
a solution...  ;-)


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-07 Thread Bill Davidsen

Marcin Krol wrote:

Thursday 07 February 2008 03:36:31 Neil Brown napisał(a):

  

   8 0  390711384 sda
   8 1  390708801 sda1
   816  390711384 sdb
   817  390708801 sdb1
   832  390711384 sdc
   833  390708801 sdc1
   848  390710327 sdd
   849  390708801 sdd1
   864  390711384 sde
   865  390708801 sde1
   880  390711384 sdf
   881  390708801 sdf1
   364   78150744 hdb
   3651951866 hdb1
   3667815622 hdb2
   3674883760 hdb3
   368  1 hdb4
   369 979933 hdb5
   370 979933 hdb6
   371   61536951 hdb7
   9 1  781417472 md1
   9 0  781417472 md0
  

So all the expected partitions are known to the kernel - good.



It 's not good really!!

I can't trust /dev/sd* devices - they get swapped randomly depending 
on sequence of module loading!! I have two drivers, ahci for onboard

SATA controllers and sata_sil for additional controller.

Sometimes the system boots ahci first and sata_sil later, sometimes 
in reverse sequence. 

Then, sda becomes sdc, sdb becomes sdd, etc. 


It is exactly the problem that I cannot rely on kernel's information which
physical drive is which logical drive!

  

Then
  mdadm /dev/md0 -f /dev/d_1

will fail d_1, abort the recovery, and release d_1.

Then
  mdadm --zero-superblock /dev/d_1

should work.



Thanks, though I managed to fail the drives, remove them, zero superblocks 
and reassemble the arrays anyway. 

The problem I have now is that mdadm seems to be of 'two minds' when it comes 
to where it gets the info on which disk is what part of the array. 


As you may remember, I have configured udev to associate /dev/d_* devices with
serial numbers (to keep them from changing depending on boot module loading 
sequence). 

  
Why do you care? If you are using UUID for all the arrays and mounts 
does this buy you anything? And more to the point, the first time a 
drive fails and you replace it, will it cause you a problem? Require 
maintaining the serial to name data manually?


I miss the benefit of forcing this instead of just building the 
information at boot time and dropping it in a file.


Now, when I swap two (random) drives in order to test if it keeps device names 
associated with serial numbers I get the following effect:


1. mdadm -Q --detail /dev/md* gives correct results before *and* after the 
swapping:

% mdadm -Q --detail /dev/md0
/dev/md0:
[...]
Number   Major   Minor   RaidDevice State
   0   810  active sync   /dev/d_1
   1   8   171  active sync   /dev/d_2
   2   8   812  active sync   /dev/d_3

% mdadm -Q --detail /dev/md1
/dev/md1:
[...]
Number   Major   Minor   RaidDevice State
   0   8   490  active sync   /dev/d_4
   1   8   651  active sync   /dev/d_5
   2   8   332  active sync   /dev/d_6


2. However, cat /proc/mdstat gives shows different layout of the arrays!

BEFORE the swap:

% cat mdstat-16_51
Personalities : [raid6] [raid5] [raid4]
md1 : active raid5 sdb1[2] sdf1[0] sda1[1]
  781417472 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md0 : active raid5 sde1[2] sdc1[0] sdd1[1]
  781417472 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

unused devices: 


AFTER the swap:

% cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md1 : active(auto-read-only) raid5 sdd1[0] sdc1[2] sde1[1]
  781417472 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md0 : active(auto-read-only) raid5 sda1[0] sdf1[2] sdb1[1]
  781417472 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

unused devices: 

I have no idea now if the array is functioning (it keeps the drives
according to /dev/d_* devices and superblock info is unimportant)
or if my arrays fell apart because of that swapping. 

And I made *damn* sure I zeroed all the superblocks before reassembling 
the arrays. Yet it still shows the old partitions on those arrays!
  
As I noted before, you said you had these on whole devices before, did 
you zero the superblocks on the whole devices or the partitions? From 
what I read, it was the partitions.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: using update-initramfs: how to get new mdadm.conf into the /boot? Or is it XFS?

2008-02-07 Thread Bill Davidsen

Moshe Yudkowsky wrote:

maximilian attems wrote:


error 15 is an *grub* error.

grub is known for it's dislike of xfs, so with this whole setup use ext3
rerun grub-install and you should be fine.


I should mention that something *did* change. When attempting to use 
XFS, grub would give me a note about "18 partitions used" (I forget 
the exact language). This was different than I'd remembered; when I 
switched back to using reiserfs, grub reports using 19 partitions.


So there's something definitely interesting about XFS and booting.

As an additional note, if I use the grub boot-time commands to edit 
root to read, e.g., root=/dev/sda2 or root=/dev/sdb2, I get the same 
Error 15 error message.


It may be that grub is complaining about grub and resiserfs, but I 
suspect that it has a true complain about the file system and what's 
on the partitions.


I think you have two choices, convert /boot to ext2 and be sure you are 
going down the best-tested code path, or fight and debug, read code, 
learn grub source, play with the init parts of the boot sequence, and 
then convert /boot to ext2 anyway. No matter how "better" something else 
might be, /boot has nothing I use except at boot, I don't need features 
or performance, I just want it to work.


Unless you are so frustrated you have entered "I am going to make this 
*work* if it takes forever" mode, I would try the easy solution first. 
Just my take on it.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm 2.6.4 : How i can check out current status of reshaping ?

2008-02-07 Thread Bill Davidsen

Andreas-Sokov wrote:

Hello, Neil.

.
  

Possible you have bad memory, or a bad CPU, or you are overclocking
the CPU, or it is getting hot, or something.



As seems to me all my problems has been started after i have started update 
MDADM.
This is server worked normaly (but only not like soft-raid) more 2-3 years.
Last 6 months it worked as soft-raid. All was normaly, Even I have added 
successfully
4th hdd into raid5 )when it stared was 3 hdd). And then Reshaping have been 
passed fine.

Yesterday i have did memtest86 onto it server and 10 passes was WITH OUT any 
errors.
Temperature of server is about 25 grad celsius.
No overlocking, all set to default.

  
What did you find when you loaded the module with gdb as Neil suggested? 
If the code in the module doesn't match the code in memory you have a 
hardware error. memtest86 is a useful tool, but it is not a definitive 
test because it doesn't use all CPUs and do i/o at the same time to load 
the memory bus.



Realy i do not know what to do because off wee nedd grow our storage, and we 
can not.
unfortunately, At this moment - Mdadm do not help us in this decision, but very 
want
it get.
  


I would pull out half my memory and retest. If it still fails I would 
swap to the other half of memory. If that didn't show a change I would 
check that the code in the module is what Neil showed in his last 
message (I assume you already have), and then reseat all of the cables, etc.


I agree with Neil:

But you clearly have a hardware error.



  

NeilBrown
    




  



--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Bill Davidsen

Wolfgang Denk wrote:

In message <[EMAIL PROTECTED]> you wrote:
  

I actually  think the kernel should operate with block sizes
like this and not wth 4 kiB blocks. It is the readahead and the elevator
algorithms that save us from randomly reading 4 kb a time.

  
  

Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk.



Indeed kernel page size is an important factor in such optimizations.
But you have to keep in mind that this is mostly efficient for (very)
large strictly sequential I/O operations only -  actual  file  system
traffic may be *very* different.

  
That was actually what I meant by page size, that of the file system 
rather than the memory, ie. the "block size" typically used for writes. 
Or multiples thereof, obviously.

We implemented the option to select kernel page sizes of  4,  16,  64
and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
graphics of the effect can be found here:

https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

  
I started that online and pulled a download to print, very neat stuff. 
Thanks for the link.

Best regards,

Wolfgang Denk

  



--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-06 Thread Bill Davidsen

Jon Nelson wrote:
On Feb 6, 2008 12:43 PM, Bill Davidsen <[EMAIL PROTECTED] 
<mailto:[EMAIL PROTECTED]>> wrote:


Can you create a raid10 with one drive "missing" and add it later? I
know, I should try it when I get a machine free... but I'm being
lazy today.


Yes you can. With 3 drives, however, performance will be awful (at 
least with layout far, 2 copies).



Well, the question didn't include being fast. ;-)

But if he really wants to create the array now and be able to add to it 
later, it might still be useful, particularly if "later" is a small time 
like "when my other drive ships." Thanks for the input, I thought that 
was possible, but reading code isn't the same as testing.

IMO raid10,f2 is a great balance of speed and redundancy.
it''s faster than raid5 for reading, about the same for writing. it's 
even potentially faster than raid0 for reading, actually.
With 3 disks one should be able to get 3.0 times the speed of one 
disk, or slightly more, and each stripe involves only *one* disk 
instead of 2 as it does with raid5.


I have used raid10 swap on 3 or more drives fairly often. Other than the 
Fedora rescue CD not using the space until I start it manually, I find 
it really fast, and helpful for huge image work.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Bill Davidsen

Keld Jørn Simonsen wrote:

Hi

I am looking at revising our howto. I see a number of places where a
chunk size of 32 kiB is recommended, and even recommendations on
maybe using sizes of 4 kiB. 

  
Depending on the raid level, a write smaller than the chunk size causes 
the chunk to be read, altered, and rewritten, vs. just written if the 
write is a multiple of chunk size. Many filesystems by default use a 4k 
page size and writes. I believe this is the reasoning behind the 
suggestion of small chunk sizes. Sequential vs. random and raid level 
are important here, there's no one size to work best in all cases.
My own take on that is that this really hurts performance. 
Normal disks have a rotation speed of between 5400 (laptop)

7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average
spinning time for one round of 6 to 12 ms, and average latency of half
this, that is 3 to 6 ms. Then you need to add head movement which
is something like 2 to 20 ms - in total average seek time 5 to 26 ms,
averaging around 13-17 ms. 

  
Having a write not some multiple of chunk size would seem to require a 
read-alter- wait_for_disk_rotation-write, and for large sustained 
sequential i/o using multiple drives helps transfer. for small random 
i/o small chunks are good, I find little benefit to chunks over 256 or 
maybe 1024k.
in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 
something like between 600 to 1200 kB, actual transfer rates of

80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck,
and transfer some data you should have something like 256/512 kiB
chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB
giving about a time of 20 ms per transaction
you should be able with random reads to transfer 12 MB/s  - my
actual figures is about 30 MB/s which is possibly because of the
elevator effect of the file system driver. With a size of 4 kb per chunk 
you should have a time of 15 ms per transaction, or 66 transactions per 
second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up
the transfer by a factor of 50. 

  
If you actually see anything like this your write caching and readahead 
aren't doing what they should!



I actually  think the kernel should operate with block sizes
like this and not wth 4 kiB blocks. It is the readahead and the elevator
algorithms that save us from randomly reading 4 kb a time.

  

Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk.

I also see that there are some memory constrints on this.
Having maybe 1000 processes reading, as for my mirror service,
256 kib buffers would be acceptable, occupying 256 MB RAM.
That is reasonable, and I could even tolerate 512 MB ram used.
But going to 1 MiB buffers would be overdoing it for my configuration.

What would be the recommended chunk size for todays equipment?

  

I think usage is more important than hardware. My opinion only.


Best regards
Keld



--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread Bill Davidsen

Neil Brown wrote:

On Tuesday February 5, [EMAIL PROTECTED] wrote:
  

% mdadm --zero-superblock /dev/sdb1
mdadm: Couldn't open /dev/sdb1 for write - not zeroing



That's weird.
Why can't it open it?

  
I suspect that (a) he's not root and has read-only access to the device 
(I have group read for certain groups, too). And since he had the arrays 
on raw devices, shouldn't he zero the superblocks using the whole device 
as well? Depending on what type of superblock it might not be found 
otherwise.


It sure can't hurt to zero all the superblocks of the whole devices and 
then check the partitions to see if they are present, then create the 
array again with --force and be really sure the superblock is present 
and sane.



Maybe you aren't running as root (The '%' prompt is suspicious).
Maybe the kernel has  been told to forget about the partitions of
/dev/sdb.
mdadm will sometimes tell it to do that, but only if you try to
assemble arrays out of whole components.

If that is the problem, then
   blockdev --rereadpt /dev/sdb

will fix it.

NeilBrown


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 or raid10 for /boot

2008-02-06 Thread Bill Davidsen

Keld Jørn Simonsen wrote:

I understand that lilo and grub only can boot partitions that look like
a normal single-drive partition. And then I understand that a plain
raid10 has a layout which is equivalent to raid1. Can such a raid10
partition be used with grub or lilo for booting?
And would there be any advantages in this, for example better disk
utilization in the raid10 driver compared with raid?
  


I don't know about you, but my /boot goes with zero use between boots, 
efficiency and performance improvements strike as a distinction without 
a difference, while adding complexity without benefit is always a bad idea.


I suggest that you avoid having a "learning experience" and stick with 
raid1.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-06 Thread Bill Davidsen

Neil Brown wrote:

On Sunday February 3, [EMAIL PROTECTED] wrote:
  

Hi,

Maybe I'll buy three HDDs to put a raid10 on them. And get the total
capacity of 1.5 of a disc. 'man 4 md' indicates that this is possible
and should work.

I'm wondering - how a single disc failure is handled in such configuration?

1. does the array continue to work in a degraded state?



Yes.

  

2. after the failure I can disconnect faulty drive, connect a new one,
   start the computer, add disc to array and it will sync automatically?




Yes.

  

Question seems a bit obvious, but the configuration is, at least for
me, a bit unusual. This is why I'm asking. Anybody here tested such
configuration, has some experience?


3. Another thing - would raid10,far=2 work when three drives are used?
   Would it increase the read performance?



Yes.

  

4. Would it be possible to later '--grow' the array to use 4 discs in
   raid10 ? Even with far=2 ?




No.

Well if by "later" you mean "in five years", then maybe.  But the
code doesn't currently exist.
  


That's a reason to avoid raid10 for certain applications, then, and go 
with a more manual 1+0 or similar.


Can you create a raid10 with one drive "missing" and add it later? I 
know, I should try it when I get a machine free... but I'm being lazy today.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 and raid 10 always writes all data to all disks?

2008-02-04 Thread Bill Davidsen

Keld Jørn Simonsen wrote:

On Sun, Feb 03, 2008 at 10:56:01AM -0500, Bill Davidsen wrote:
  

Keld Jørn Simonsen wrote:


I found a sentence in the HOWTO:

"raid1 and raid 10 always writes all data to all disks"

I think this is wrong for raid10.

eg

a raid10,f2 of 4 disks only writes to two of the disks -
not all 4 disks. Is that true?
 
  
I suspect that really should have read "all mirror copies," in the 
raid10 case.



OK, I changed the text to:

raid1 always writes all data to all disks.
  


Just to be really pedantic, you might say "devices" instead of disks, 
since many or most arrays are on partitions. Otherwise I like this, it's 
much clearer.

raid10 always writes all data to the number of copies that the raid holds.
For example on a raid10,f2 or raid10,o2 of 6 disks, the data will only
be written 2 times.

Best regards
Keld

  



--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: draft howto on making raids for surviving a disk crash

2008-02-04 Thread Bill Davidsen

Keld Jørn Simonsen wrote:

On Sun, Feb 03, 2008 at 10:53:51AM -0500, Bill Davidsen wrote:
  

Keld Jørn Simonsen wrote:


This is intended for the linux raid howto. Please give comments.
It is not fully ready /keld

Howto prepare for a failing disk

6. /etc/mdadm.conf

Something here on /etc/mdadm.conf. What would be safe, allowing
a system to boot even if a disk has crashed?
 
  

Recommend PARTITIONS by used



Thanks Bill for your suggestions, which I have incorporated in the text.

However, I do not understand what to do with the remark above.
Please explain.
  


The mdadm.conf file should contain the "DEVICE partitions" statement to 
identify all possible partitions regardless of name changes. See "man 
mdadm.conf" for more discussion. This protects against udev doing 
something innovative in device naming.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 and raid 10 always writes all data to all disks?

2008-02-03 Thread Bill Davidsen

Keld Jørn Simonsen wrote:

 I found a sentence in the HOWTO:

"raid1 and raid 10 always writes all data to all disks"

I think this is wrong for raid10.

eg

a raid10,f2 of 4 disks only writes to two of the disks -
not all 4 disks. Is that true?
  


I suspect that really should have read "all mirror copies," in the 
raid10 case.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: draft howto on making raids for surviving a disk crash

2008-02-03 Thread Bill Davidsen
s for the system, or vital jobs on the system. You can prevent 
the failing of the processes by having the swap partitions on a raid. The swap area

needed is normally relatively small compared to the overall disk space 
available,
so we recommend the faster raid types over the more space economic. The 
raid10,f2
type seems to be the fastest here, other relevant raid types could be raid10,o2 
or raid1.

Given that you have created a raid array, you can just make the swap partition 
directly
on it:
 
   mdadm --create /dev/md2 --chunk=256 -R -l 10 -n 2 -p f2 /dev/sda3 /dev/sdb3

   sfdisk -c /dev/md 2 82
   mkswap /dev/md2

  
WARNING: some "recovery" CDs will not use raid10 as swap. This may be a 
problem on small memory systems, and the swap may need to be started and 
enabled manually.

Maybe something on /var and /tmp could go here.

5. The rest of the file systems.

Other file systems can also be protected against one failing disk.
Which technique to recommend depends on your purpose with the
disk space. You may mix the different raid types if you have different types
of use on the same server, eg a data base and servicing of large files
from the same server. (This is one of the advantages of software raid
over hardware raid: you may have different types of raids on
a disk with a software raid, where a hardware raid only may take one
type for the whole disk.)

Is disk capacity the main priority, and you have more than 2 drives,
then raid5 is recommended. Raid5 only uses 1 drive for securing the
data, while raid1 and raid10 use at least half the capacity.
For example with 4 drives, raid5 provides 75 % of the total disk
space as usable, while raid1 and raid10 at most (dependent on the number
of copies) give a 50 % usability of the disk space. This becomes even better
for raid5 with more disks, with 10 disks you only use 10 % for security.

Is speed your main priority, then raid10,f2   raid10,o2 or raid1 would give you
most speed during normal operation. This even works if you only have 2 drives.

Is speed with a failed disk a concern, then raid10,o2 could be the choice, as
raid10,f2 is somewhat slower in operation, when a disk has failed.


Examples:

   mdadm --create /dev/md3 --chunk=256 -R -l 10 -n 2 -p f2 /dev/sda5 /dev/sdb5
   mdadm --create /dev/md3 --chunk=256 -R -l 10 -n 2 -p o2 /dev/sd[ab]5
   mdadm --create /dev/md3 --chunk=256 -R -l  5 -n 4   /dev/sd[abcd]5

6. /etc/mdadm.conf

Something here on /etc/mdadm.conf. What would be safe, allowing
a system to boot even if a disk has crashed?
  


Recommend PARTITIONS by used

7. Recommendation for the setup of larger servers.

Given a larger server setup, with more disks, it is possible to
survive more than one disk crash. The raid6 array type can be used
to be able to survive 2 disk crashes, at the expense of the space of 2 disks.
The /boot, root and swap partitions can be set up with more disks, eg a 
/boot partition made up from a raid1 of 3 disks, and root and swap partitons 
made up from raid10,f3 arrays. Given that raid6 cannot survive more than the chashes
  

TYPO: s/chashes/crashes/and "failure" would be better

of 2 disks, the system disks need not be prepared for more than 2 craches
  

TYPO: s/craches/crashes/   or "disk failures"

either, and you can use the rest of the disk IO capacity to speed up the system.
  


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 1 and grub

2008-02-03 Thread Bill Davidsen

Richard Scobie wrote:

A followup for the archives:

I found this document very useful:

http://lists.us.dell.com/pipermail/linux-poweredge/2003-July/008898.html

After modifying my grub.conf to refer to (hd0,0), reinstalling grub on 
hdc with:


grub> device (hd0) /dev/hdc

grub> root (hd0,0)

grub> (hd0)

and rebooting with the bios set to boot off hdc, everything burst back 
into life.


I shall now be checking all my Fedora/Centos RAID1 installs for grub 
installed on both drives.


Have you actually tested this by removing the first hd and booting? 
Depending on the BIOS I believe that the fallback drive will be called 
hdc by the BIOS but will be hdd in the system. That was with RHEL3, but 
worth testing.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid problem: after every reboot /dev/sdb1 is removed?

2008-02-03 Thread Bill Davidsen

Berni wrote:

Hi

I created the raid arrays during install with the text-installer-cd. 
So first the raid array was created and then the system was installed on it. 



I don't have a extra /boot partition its on the root (/) partition and the root 
is the md0 in the raid. Every partition for ubuntu (also swap) is in the raid.

What exactly means rerunning grub? (to put both hdd into the mbr)? 
I can't find the "mkinitrd" into ubuntu. I made a update-initramfs but it didn't help.


  
I think you need some ubuntu guru to help, I always create a small raid1 
for /boot and then use other arrays for whatever the system is doing. I 
don't know if ubuntu uses mkinitrd or what, but it clearly didn't get it 
right without a little help from you.
thanks 
  


How about some input, ubuntu users (or Debian, isn't ubuntu really Debian?).


On Sat, 02 Feb 2008 14:47:50 -0500
Bill Davidsen <[EMAIL PROTECTED]> wrote:

  

Berni wrote:


Hi!

I have the following problem with my softraid (raid 1). I'm running
Ubuntu 7.10 64bit with kernel 2.6.22-14-generic.

After every reboot my first boot partition in md0 is not synchron. One
of the disks (the sdb1) is removed. 
After a resynch every partition is synching. But after a reboot the
state is "removed". 

The disks are new and both seagate 250gb with exactly the same partition table. 

  
  
Did you create the raid arrays and then install on them? Or add them 
after the fact? I have seen this type of problem when the initrd doesn't 
start the array before pivotroot, usually because the raid capabilities 
aren't in the boot image. In that case rerunning grub and mkinitrd may help.


I run raid on Redhat distributions, and some Slackware, so I can't speak 
for Ubuntu from great experience, but that's what it sounds like. When 
you boot, is the /boot mounted on a degraded array or on the raw partition?




--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-02-02 Thread Bill Davidsen

Bill Davidsen wrote:

Moshe Yudkowsky wrote:

Michael Tokarev wrote:



To return to that peformance question, since I have to create at 
least 2 md drives using different partitions, I wonder if it's 
smarter to create multiple md drives for better performance.


/dev/sd[abcd]1 -- RAID1, the /boot, /dev, /bin/, /sbin

/dev/sd[abcd]2 -- RAID5, most of the rest of the file system

/dev/sd[abcd]3 -- RAID10 o2, a drive that does a lot of downloading 
(writes)


I think the speed of downloads is so far below the capacity of an 
array that you won't notice, and hopefully you will use things you 
download more than once, so you still get more reads than writes.



For typical filesystem usage, raid5 works good for both reads
and (cached, delayed) writes.  It's workloads like databases
where raid5 performs badly.


Ah, very interesting. Is this true even for (dare I say it?) 
bittorrent downloads?


What do you have for bandwidth? Probably not more than a T3 (145Mbit) 
which will max out at ~15MB/s, far below the write performance of a 
single drive, much less an array (even raid5).
It has been pointed out that I have a double typo there, I meant OC3 not 
T3, and 155Mbit.  Still, the most someone is likely to have, even in a 
large company.  Still not a large chance of being faster than the disk 
in raid-10 mode.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid problem: after every reboot /dev/sdb1 is removed?

2008-02-02 Thread Bill Davidsen

Berni wrote:

Hi!

I have the following problem with my softraid (raid 1). I'm running
Ubuntu 7.10 64bit with kernel 2.6.22-14-generic.

After every reboot my first boot partition in md0 is not synchron. One
of the disks (the sdb1) is removed. 
After a resynch every partition is synching. But after a reboot the
state is "removed". 

The disks are new and both seagate 250gb with exactly the same partition table. 

  
Did you create the raid arrays and then install on them? Or add them 
after the fact? I have seen this type of problem when the initrd doesn't 
start the array before pivotroot, usually because the raid capabilities 
aren't in the boot image. In that case rerunning grub and mkinitrd may help.


I run raid on Redhat distributions, and some Slackware, so I can't speak 
for Ubuntu from great experience, but that's what it sounds like. When 
you boot, is the /boot mounted on a degraded array or on the raw partition?
Here some config files: 

#cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10] 
md2 : active raid1 sda6[0] sdb6[1]

  117185984 blocks [2/2] [UU]
  
md1 : active raid1 sda5[0] sdb5[1]

  1951744 blocks [2/2] [UU]
  
md0 : active raid1 sda1[0]

  19534912 blocks [2/1] [U_]<<<<<<<< this is the problem: looks 
like U_ after reboot
  
unused devices: 


#fdisk /dev/sda
  Device Boot  Start End  Blocks   Id  System
/dev/sda1   1243219535008+  fd  Linux raid
autodetect
/dev/sda22433   17264   1191380405  Extended
/dev/sda3   *   17265   2045125599577+   7  HPFS/NTFS
/dev/sda4   20452   3040079915342+   7  HPFS/NTFS
/dev/sda524332675 1951866   fd  Linux raid
autodetect
/dev/sda62676   17264   117186111   fd  Linux raid
autodetect

#fdisk /dev/sdb
 Device Boot  Start End  Blocks   Id  System
/dev/sdb1   1243219535008+  fd  Linux raid
autodetect
/dev/sdb22433   17264   1191380405  Extended
/dev/sdb3   17265   30400   1055149207  HPFS/NTFS
/dev/sdb524332675 1951866   fd  Linux raid
autodetect
/dev/sdb62676   17264   117186111   fd  Linux raid
autodetect

# mount
/dev/md0 on / type reiserfs (rw,notail)
proc on /proc type proc (rw,noexec,nosuid,nodev)
/sys on /sys type sysfs (rw,noexec,nosuid,nodev)
varrun on /var/run type tmpfs (rw,noexec,nosuid,nodev,mode=0755)
varlock on /var/lock type tmpfs (rw,noexec,nosuid,nodev,mode=1777)
udev on /dev type tmpfs (rw,mode=0755)
devshm on /dev/shm type tmpfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
lrm on /lib/modules/2.6.22-14-generic/volatile type tmpfs (rw)
/dev/md2 on /home type reiserfs (rw)
securityfs on /sys/kernel/security type securityfs (rw)

Could anyone help me to solve this problem? 
thanks

greets
Berni
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  



--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-01-31 Thread Bill Davidsen

Janek Kozicki wrote:

Hello,

Yes, I know that some levels give faster reading and slower writing, etc.

I want to talk here about a typical workstation usage: compiling
stuff (like kernel), editing openoffice docs, browsing web, reading
email (email: I have a webdir format, and in boost mailing list
directory I have 14000 files (posts), opening this directory takes
circa 10 seconds in sylpheed). Moreover, opening .pdf files, more
compiling of C++ stuff, etc...

  
In other words, like most systems, more reads than writes. And while 
write can be (and usually are) cached and buffered, when you need the 
next bit of data the program waits for it, far more user visible. If 
this suggests tuning for acceptable write and max read speed, and 
setting the readahead higher than default, then you have reached the 
same conclusion as I did.



I have a remote backup system configured (with rsnapshot), which does
backups two times a day. So I'm not afraid to lose all my data due to
disc failure. I want absolute speed.

Currently I have Raid-0, because I was thinking that this one is
fastest. But I also don't need twice the capacity. I could use Raid-1
as well, if it was faster.

Due to recent discussion about Raid-10,f2 I'm getting worried that
Raid-0 is not the fastest solution, but instead a Raid-10,f2 is
faster.

So how really is it, which level gives maximum overall speed?


I would like to make a benchmark, but currently, technically, I'm not
able to. I'll be able to do it next month, and then - as a result of
this discussion - I will switch to other level and post here
benchmark results.

How does overall performance change with the number of available drives?

Perhaps Raid-0 is best for 2 drives, while Raid-10 is best for 3, 4
and more drives?


best regards
  



--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-31 Thread Bill Davidsen

Moshe Yudkowsky wrote:

Michael Tokarev wrote:


You only write to root (including /bin and /lib and so on) during
software (re)install and during some configuration work (writing
/etc/password and the like).  First is very infrequent, and both
needs only a few writes, -- so write speed isn't important.


Thanks, but I didn't make myself clear. The preformance problem I'm 
concerned about was having different md drives accessing different 
partitions.


For example, I can partition the drives as follows:

/dev/sd[abcd]1 -- RAID1, /boot

/dev/sd[abcd]2 -- RAID5, the rest of the file system

I originally had asked, way back when, if having different md drives 
on different partitions of the *same* disk was a problem for 
perfomance --  or if, for some reason (e.g., threading) it was 
actually smarter to do it that way. The answer I received was from 
Iustin Pop, who said :


Iustin Pop wrote:

md code works better if it's only one array per physical drive,
because it keeps statistics per array (like last accessed sector,
etc.) and if you combine two arrays on the same drive these
statistics are not exactly true anymore


So if I use /boot on its own drive and it's only accessed at startup, 
the /boot will only be accessed that one time and afterwards won't 
cause problems for the drive statistics. However, if I use put /boot, 
/bin, and /sbin on this RAID1 drive, it will always be accessed and it 
might create a performance issue.




I always put /boot on a separate partition, just to run raid1 which I 
don't use elsewhere.


To return to that peformance question, since I have to create at least 
2 md drives using different partitions, I wonder if it's smarter to 
create multiple md drives for better performance.


/dev/sd[abcd]1 -- RAID1, the /boot, /dev, /bin/, /sbin

/dev/sd[abcd]2 -- RAID5, most of the rest of the file system

/dev/sd[abcd]3 -- RAID10 o2, a drive that does a lot of downloading 
(writes)


I think the speed of downloads is so far below the capacity of an array 
that you won't notice, and hopefully you will use things you download 
more than once, so you still get more reads than writes.



For typical filesystem usage, raid5 works good for both reads
and (cached, delayed) writes.  It's workloads like databases
where raid5 performs badly.


Ah, very interesting. Is this true even for (dare I say it?) 
bittorrent downloads?


What do you have for bandwidth? Probably not more than a T3 (145Mbit) 
which will max out at ~15MB/s, far below the write performance of a 
single drive, much less an array (even raid5).



What you do care about is your data integrity.  It's not really
interesting to reinstall a system or lose your data in case if
something goes wrong, and it's best to have recovery tools as
easily available as possible.  Plus, amount of space you need.


Sure, I understand. And backing up in case someone steals your server. 
But did you have something specific in mind when you wrote this? Don't 
all these configurations (RAID5 vs. RAID10) have equal recovery tools?


Or were you referring to the file system? Reiserfs and XFS both seem 
to have decent recovery tools. LVM is a little tempting because it 
allows for snapshots, but on the other hand I wonder if I'd find it 
useful.


If you are worried about performance, perhaps some reading of comments 
on LVM would be in order. I personally view it as a trade-off of 
performance for flexibility.



Also, placing /dev on a tmpfs helps alot to minimize number of writes
necessary for root fs.

Another interesting idea. I'm not familiar with using tmpfs (no need,
until now); but I wonder how you create the devices you need when 
you're

doing a rescue.


When you start udev, your /dev will be on tmpfs.


Sure, that's what mount shows me right now -- using a standard Debian 
install. What did you suggest I change?






--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-31 Thread Bill Davidsen

Moshe Yudkowsky wrote:

Keld Jørn Simonsen wrote:

On Tue, Jan 29, 2008 at 06:32:54PM -0600, Moshe Yudkowsky wrote:

Hmm, why would you put swap on a raid10? I would in a production
environment always put it on separate swap partitions, possibly a 
number,

given that a number of drives are available.
In a production server, however, I'd use swap on RAID in order to 
prevent server downtime if a disk fails -- a suddenly bad swap can 
easily (will absolutely?) cause the server to crash (even though you 
can boot the server up again afterwards on the surviving swap 
partitions).


I see. Which file system type would be good for this?
I normally use XFS but maybe other FS is better, given that swap is used
very randomly 8read/write).

Will a bad swap crash the system?


Well, Peter says it will, and that's good enough for me. :-)

I've done unplanned research into this, it will crash the system, and if 
you're unlucky some part of what's needed for a graceful crash will be 
swapped out :-(
As for which file system: I would use fdisk to partition the md disk 
and then use mkswap on the partition to make it into a swap partition. 
It's a naive approach but I suspect it's almost certainly the correct 
one.


I generally dedicate a partition of each drive to swap, but the type is 
"raid array." Then I create a raid10 on that set of partitions and 
mkswap on the md device. While raid10 is fast and reliable, raid[56] 
have similar reliability and a higher usable space from any given 
configuration.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-30 Thread Bill Davidsen

Peter Rabbitson wrote:

Keld Jørn Simonsen wrote:

On Tue, Jan 29, 2008 at 06:44:20PM -0500, Bill Davidsen wrote:

Depending on near/far choices, raid10 should be faster than raid5, 
with far read should be quite a bit faster. You can't boot off 
raid10, and if you put your swap on it many recovery CDs won't use 
it. But for general use and swap on a normally booted system it is 
quite fast.


Hmm, why would you put swap on a raid10? I would in a production
environment always put it on separate swap partitions, possibly a 
number,

given that a number of drives are available.



Because you want some redundancy for the swap as well. A swap 
partition/file becoming inaccessible is equivalent to yanking out a 
stick of memory out of your motherboard.


I can't say it better. Losing a swap area will make the system fail in 
one way or the other, in my systems typicalls expressed as a crash of 
varying severity. I use raid10 because it is the fastest reliable level 
I've found.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-30 Thread Bill Davidsen

Moshe Yudkowsky wrote:

Bill Davidsen wrote:

According to man md(4), the o2 is likely to offer the best 
combination of read and write performance. Why would you consider f2 
instead?


f2 is faster for read, most systems spend more time reading than 
writing.


According to md(4), offset "should give similar read characteristics 
to 'far' if a suitably large chunk size is used, but without as much 
seeking for writes."


Is the man page not correct, conditionally true, or simply not 
understood by me (most likely case)?


I wonder what "suitably large" is...

My personal experience is that as chunk gets larger random write gets 
slower, sequential gets faster. I don't have numbers any more, but 
20-30% is sort of the limit of what I saw for any chunk size I consider 
reasonable. f2 is faster for sequential reading, tune your system to 
annoy you least. ;-)


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Bill Davidsen

Moshe Yudkowsky wrote:

Keld Jørn Simonsen wrote:

Based on your reports of better performance on RAID10 -- which are 
more significant that I'd expected -- I'll just go with RAID10. The 
only question now is if LVM is worth the performance hit or not.



I would be interested if you would experiment with this wrt boot time,
for example the difference between /root on a raid5, raid10,f2 and 
raid10,o2.


According to man md(4), the o2 is likely to offer the best combination 
of read and write performance. Why would you consider f2 instead?



f2 is faster for read, most systems spend more time reading than writing.

I'm unlike to do any testing beyond running bonnie++ or something 
similar once it's installed.






--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Bill Davidsen

Moshe Yudkowsky wrote:
I'd like to thank everyone who wrote in with comments and 
explanations. And in particular it's nice to see that I'm not the only 
one who's confused.


I'm going to convert back to the RAID 1 setup I had before for /boot, 
2 hot and 2 spare across four drives. No, that's wrong: 4 hot makes 
the most sense.


And given that RAID 10 doesn't seem to confer (for me, as far as I can 
tell) advantages in speed or reliability -- or the ability to mount 
just one surviving disk of a mirrored pair -- over RAID 5, I think 
I'll convert back to RAID 5, put in a hot spare, and do regular 
backups (as always). Oh, and use reiserfs with data=journal.


Depending on near/far choices, raid10 should be faster than raid5, with 
far read should be quite a bit faster. You can't boot off raid10, and if 
you put your swap on it many recovery CDs won't use it. But for general 
use and swap on a normally booted system it is quite fast.

Comments back:

Peter Rabbitson wrote:

Maybe you are, depending on your settings, but this is beyond the 
point. No matter what 1+0 you have (linux, classic, or otherwise) you 
can not boot from it, as there is no way to see the underlying 
filesystem without the RAID layer.


Sir, thank you for this unequivocal comment. This comment clears up 
all my confusion. I had a wrong mental model of how file system maps 
work.


With the current state of affairs (available mainstream bootloaders) 
the rule is:

Block devices containing the kernel/initrd image _must_ be either:
* a regular block device (/sda1, /hda, /fd0, etc.)
* or a linux RAID 1 with the superblock at the end of the device 
(0.9 or 1.2)


Thaks even more: 1.2 it is.


This is how you find the actual raid version:

mdadm -D /dev/md[X] | grep Version

This will return a string of the form XX.YY.ZZ. Your superblock 
version is XX.YY.


Ah hah!

Mr. Tokarev wrote:

By the way, on all our systems I use small (256Mb for small-software 
systems,
sometimes 512M, but 1G should be sufficient) partition for a root 
filesystem

(/etc, /bin, /sbin, /lib, and /boot), and put it on a raid1 on all...
... doing [it]
this way, you always have all the tools necessary to repair a damaged 
system
even in case your raid didn't start, or you forgot where your root 
disk is

etc etc.


An excellent idea. I was going to put just /boot on the RAID 1, but 
there's no reason why I can't add a bit more room and put them all 
there. (Because I was having so much fun on the install, I'm using 4GB 
that I was going to use for swap space to mount base install and I'm 
working from their to build the RAID. Same idea.)


Hmmm... I wonder if this more expansive /bin, /sbin, and /lib causes 
hits on the RAID1 drive which ultimately degrade overall performance? 
/lib is hit only at boot time to load the kernel, I'll guess, but /bin 
includes such common tools as bash and grep.



Also, placing /dev on a tmpfs helps alot to minimize number of writes
necessary for root fs.


Another interesting idea. I'm not familiar with using tmpfs (no need, 
until now); but I wonder how you create the devices you need when 
you're doing a rescue.


Again, my thanks to everyone who responded and clarified.




--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use new sb type

2008-01-29 Thread Bill Davidsen

David Greaves wrote:

Jan Engelhardt wrote:
  

This makes 1.0 the default sb type for new arrays.




IIRC there was a discussion a while back on renaming mdadm options (google "Time
to  deprecate old RAID formats?") and the superblocks to emphasise the location
and data structure. Would it be good to introduce the new names at the same time
as changing the default format/on-disk-location?
  


Yes, I suggested some layout names, as did a few other people, and a few 
changes to separate metadata type and position were discussed. BUT, 
changing the default layout, no matter how "better" it seems, is trumped 
by "breaks existing setups and user practice." For all of the reasons 
something else is preferable, 1.0 *works*.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-29 Thread Bill Davidsen

Carlos Carvalho wrote:

Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29:
 >Subtitle: Patch to mainline yet?
 >
 >Hi
 >
 >I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand
 >on my server.

I applied all 4 pending patches to .24. It's been better than .22 and
.23... Unfortunately the bitmap and rai1 patch don't go in .22.16.


Neil, have these been sent up against 24-stable and 23-stable?

--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: striping of a 4 drive raid10

2008-01-28 Thread Bill Davidsen

Keld Jørn Simonsen wrote:

On Mon, Jan 28, 2008 at 07:13:30AM +1100, Neil Brown wrote:
  

On Sunday January 27, [EMAIL PROTECTED] wrote:


Hi

I have tried to make a striping raid out of my new 4 x 1 TB
SATA-2 disks. I tried raid10,f2 in several ways:

1: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, md2 = raid0
of md0+md1

2: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, md2 = raid01,f2
of md0+md1

3: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, chunksize of 
md0 =md1 =128 KB,  md2 = raid0 of md0+md1 chunksize = 256 KB


4: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, chunksize
of md0 = md1 = 128 KB, md2 = raid01,f2 of md0+md1 chunksize = 256 KB

5: md0= raid10,f4 of sda1+sdb1+sdc1+sdd1
  

Try
  6: md0 = raid10,f2 of sda1+sdb1+sdc1+sdd1



That I already tried, (and I wrongly stated that I used f4 in stead of
f2). I had two times a thruput of about 300 MB/s but since then I could
not reproduce the behaviour. Are there errors on this that has been
corrected in newer kernels?


  

Also try raid10,o2 with a largeish chunksize (256KB is probably big
enough).



I tried that too, but my mdadm did not allow me to use the o flag.

My kernel is 2.6.12  and mdadm is v1.12.0 - 14 June 2005.
can I upgrade the mdadm alone to a newer version, and then which is
recommendable?
  


I doubt that updating the mdadm is going to help, the kernel is old and 
lacks a number of improvements in the last few years. I don't think you 
will see any major improvements without a kernel upgrade.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: idle array consuming cpu ??!!

2008-01-23 Thread Bill Davidsen

Carlos Carvalho wrote:

Bill Davidsen ([EMAIL PROTECTED]) wrote on 22 January 2008 17:53:
 >Carlos Carvalho wrote:
 >> Neil Brown ([EMAIL PROTECTED]) wrote on 21 January 2008 12:15:
 >>  >On Sunday January 20, [EMAIL PROTECTED] wrote:
 >>  >> A raid6 array with a spare and bitmap is idle: not mounted and with no
 >>  >> IO to it or any of its disks (obviously), as shown by iostat. However
 >>  >> it's consuming cpu: since reboot it used about 11min in 24h, which is 
quite
 >>  >> a lot even for a busy array (the cpus are fast). The array was cleanly
 >>  >> shutdown so there's been no reconstruction/check or anything else.
 >>  >> 
 >>  >> How can this be? Kernel is 2.6.22.16 with the two patches for the

 >>  >> deadlock ("[PATCH 004 of 4] md: Fix an occasional deadlock in raid5 -
 >>  >> FIX") and the previous one.
 >>  >
 >>  >Maybe the bitmap code is waking up regularly to do nothing.
 >>  >
 >>  >Would you be happy to experiment?  Remove the bitmap with
 >>  >   mdadm --grow /dev/mdX --bitmap=none
 >>  >
 >>  >and see how that affects cpu usage?
 >>
 >> Confirmed, removing the bitmap stopped cpu consumption.
 >
 >Looks like quite a bit of CPU going into idle arrays here, too.

I don't mind the cpu time (in the machines where we use it here), what
worries me is that it shouldn't happen when the disks are completely
idle. Looks like there's a bug somewhere.


That's my feeling, I have one array with an internal bitmap and one with 
no bitmap, and the internal bitmap uses CPU even when the machine is 
idle. I have *not* tried an external bitmap.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: idle array consuming cpu ??!!

2008-01-22 Thread Bill Davidsen

Carlos Carvalho wrote:

Neil Brown ([EMAIL PROTECTED]) wrote on 21 January 2008 12:15:
 >On Sunday January 20, [EMAIL PROTECTED] wrote:
 >> A raid6 array with a spare and bitmap is idle: not mounted and with no
 >> IO to it or any of its disks (obviously), as shown by iostat. However
 >> it's consuming cpu: since reboot it used about 11min in 24h, which is quite
 >> a lot even for a busy array (the cpus are fast). The array was cleanly
 >> shutdown so there's been no reconstruction/check or anything else.
 >> 
 >> How can this be? Kernel is 2.6.22.16 with the two patches for the

 >> deadlock ("[PATCH 004 of 4] md: Fix an occasional deadlock in raid5 -
 >> FIX") and the previous one.
 >
 >Maybe the bitmap code is waking up regularly to do nothing.
 >
 >Would you be happy to experiment?  Remove the bitmap with
 >   mdadm --grow /dev/mdX --bitmap=none
 >
 >and see how that affects cpu usage?

Confirmed, removing the bitmap stopped cpu consumption.


Looks like quite a bit of CPU going into idle arrays here, too.

--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: One Large md or Many Smaller md for Better Peformance?

2008-01-22 Thread Bill Davidsen

Moshe Yudkowsky wrote:

Carlos Carvalho wrote:


I use reiser3 and xfs. reiser3 is very good with many small files. A
simple test shows interactively perceptible results: removing large
files is faster with xfs, removing large directories (ex. the kernel
tree) is faster with reiser3.


My current main concern about XFS and reiser3 is writebacks. The 
default mode for ext3 is "journal," which in case of power failure is 
more robust than the writeback modes of XFS, reiser3, or JFS -- or so 
I'm given to understand.


On the other hand, I have a UPS and it should shut down gracefully 
regardless if there's a power failure. I wonder if I'm being too 
cautious?



No.

If you haven't actually *tested* the UPS failover code to be sure your 
system is talking to the UPS properly, and that the UPS is able to hold 
up power long enough for a shutdown after the system detects the 
problem, then you don't know if you actually have protection.  Even 
then, if you don't proactively replace batteries on schedule, etc, then 
you aren't as protected as you might wish to be.


And CPU fans fail, capacitors pop, power supplies fail, etc. These are 
things which have happened here in the last ten years. I also had a 
charging circuit in a UPS half-fail (from full wave rectifier to half). 
So the UPS would discharge until it ran out of power, then the system 
would fail hard. By the time I got on site and rebooted the UPS had 
trickle charged and would run the system. After replacing two 
"intermittent power supplies" in the system, the UPS was swapped on 
general principles and the real problem was isolated.


Shit happens, don't rely on graceful shutdowns (or recovery, have backups).

--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: One Large md or Many Smaller md for Better Peformance?

2008-01-20 Thread Bill Davidsen

Moshe Yudkowsky wrote:
Question: with the same number of physical drives,  do I get better 
performance with one large md-based drive, or do I get better 
performance if I have several smaller md-based drives?


Situation: dual CPU, 4 drives (which I will set up as RAID-1 after 
being terrorized by the anti-RAID-5 polemics included in the Debian 
distro of mdadm).


I've two choices:

1. Allocate all the drive space into a single large partition, place 
into a single RAID array (either 10 or 1 + LVM, a separate question).


One partitionable RAID-10, perhaps, then partition as needed. Read the 
discussion here about performance of LVM and RAID. I personally don't do 
LVM unless I know I will have to have great flexibility of configuration 
and can give up performance to get it. Other report different results, 
so make up your own mind.
2. Allocate each drive into several smaller partitions. Make each set 
of smaller partitions into a separate RAID 1 array and use separate 
RAID md drives for the various file systems.


Example use case:

While working other problems, I download a large torrent in the 
background. The torrent writes to its own, separate file system called 
/foo. If /foo is mounted on its own RAID 10 or 1-LVM array, will that 
help or hinder overall system responsiveness?


It would seem a "no brainer" that giving each major filesystem its own 
array would allow for better threading and responsiveness, but I'm 
picking up hints in various piece of documentation that the 
performance can be counter-intuitive. I've even considered the 
possibility of giving /var and /usr separate RAID arrays (data vs. 
executables).


If an expert could chime in, I'd appreciate it a great deal.





--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to create a degraded raid1 with only 1 of 2 drives ??

2008-01-20 Thread Bill Davidsen

Mitchell Laks wrote:

Hi mdadm raid gurus,

I wanted to make a raid1 array, but at the moment I have only 1 drive 
available. The other disk is
in the mail. I wanted to make a raid1 that i will use as a backup.

But I need to do the backup now, before the second drive comes.

So I did this.

formated /dev/sda creating /dev/sda1 with type fd.

then I tried to run
mdadm -C /dev/md0 --level=1 --raid-devices=1 /dev/sda1

but I got an error message
mdadm: 1 is an unusual numner of drives for an array so it is probably a 
mistake. If you really
mean it you will need to specify --force before setting the number of drives

so then i tried

mdadm -C /dev/md0 --level=1 --force --raid-devices=1 /dev/sda1
mdadm: /dev/sda1 is too small:0K
mdadm: create aborted

now what does that mean?

fdisk -l /dev/sda  shows

device boot start   end blocks  Id  System
/dev/sda1   1   60801   488384001 fdlinux raid autodetect

so what do I do?


I need to back up my data.
If I simply format /dev/sda1 as an ext3 file system then I can't "add" the 
second drive later on.

How can I  set it up as a `degraded` raid1 array so I can later on add in the 
second drive and sync?
  


Specify two drives and one as missing.

--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-18 Thread Bill Davidsen

Justin Piszcz wrote:



On Thu, 17 Jan 2008, Al Boldi wrote:


Justin Piszcz wrote:

On Wed, 16 Jan 2008, Al Boldi wrote:

Also, can you retest using dd with different block-sizes?


I can do this, moment..


I know about oflag=direct but I choose to use dd with sync and 
measure the

total time it takes.
/usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero
of=/r1/bigfile bs=1M count=10240; sync'

So I was asked on the mailing list to test dd with various chunk sizes,
here is the length of time it took
to write 10 GiB and sync per each chunk size:

4=chunk.txt:0:25.46
8=chunk.txt:0:25.63
16=chunk.txt:0:25.26
32=chunk.txt:0:25.08
64=chunk.txt:0:25.55
128=chunk.txt:0:25.26
256=chunk.txt:0:24.72
512=chunk.txt:0:24.71
1024=chunk.txt:0:25.40
2048=chunk.txt:0:25.71
4096=chunk.txt:0:27.18
8192=chunk.txt:0:29.00
16384=chunk.txt:0:31.43
32768=chunk.txt:0:50.11
65536=chunk.txt:2:20.80


What do you get with bs=512,1k,2k,4k,8k,16k...


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



root  4621  0.0  0.0  12404   760 pts/2D+   17:53   0:00 mdadm 
-S /dev/md3

root  4664  0.0  0.0   4264   728 pts/5S+   17:54   0:00 grep D

Tried to stop it when it was re-syncing, DEADLOCK :(

[  305.464904] md: md3 still in use.
[  314.595281] md: md_do_sync() got signal ... exiting

Anyhow, done testing, time to move data back on if I can kill the 
resync process w/out deadlock.


So does that indicate that there is still a deadlock issue, or that you 
don't have the latest patches installed?


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md rotates RAID5 spare at boot

2008-01-10 Thread Bill Davidsen
 Sense: 00 3a 00 00
   [   40.949396] sd 7:0:1:0: [sdg] Write cache: enabled, read cache:
   enabled, doesn't support DPO or FUA
   [   40.949519] sd 7:0:1:0: [sdg] 488397168 512-byte hardware sectors
   (250059 MB)
   [   40.949588] sd 7:0:1:0: [sdg] Write Protect is off
   [   40.949640] sd 7:0:1:0: [sdg] Mode Sense: 00 3a 00 00
   [   40.949668] sd 7:0:1:0: [sdg] Write cache: enabled, read cache:
   enabled, doesn't support DPO or FUA
   [   40.949734]  sdg: sdg1 sdg2
   [   40.969827] sd 7:0:1:0: [sdg] Attached SCSI disk
   [   40.969926] sd 7:0:1:0: Attached scsi generic sg7 type 0

   [   41.206078] md: md0 stopped.
   [   41.206137] md: unbind
   [   41.206187] md: export_rdev(sdb1)
   [   41.206253] md: unbind
   [   41.206302] md: export_rdev(sdc1)
   [   41.206360] md: unbind
   [   41.206408] md: export_rdev(sda1)
   [   41.247389] md: bind
   [   41.247584] md: bind
   [   41.247787] md: bind
   [   41.247971] md: bind
   [   41.248151] md: bind
   [   41.248325] md: bind
   [   41.256718] raid5: device sde1 operational as raid disk 0
   [   41.256771] raid5: device sdc1 operational as raid disk 4
   [   41.256821] raid5: device sda1 operational as raid disk 3
   [   41.256870] raid5: device sdb1 operational as raid disk 2
   [   41.256919] raid5: device sdf1 operational as raid disk 1
   [   41.257426] raid5: allocated 5245kB for md0
   [   41.257476] raid5: raid level 5 set md0 active with 5 out of 5
   devices, algorithm 2
   [   41.257538] RAID5 conf printout:
   [   41.257584]  --- rd:5 wd:5
   [   41.257631]  disk 0, o:1, dev:sde1
   [   41.257677]  disk 1, o:1, dev:sdf1
   [   41.257724]  disk 2, o:1, dev:sdb1
   [   41.257771]  disk 3, o:1, dev:sda1
   [   41.257817]  disk 4, o:1, dev:sdc1

   [   41.257952] md: md1 stopped.
   [   41.258009] md: unbind
   [   41.258060] md: export_rdev(sdb2)
   [   41.258128] md: unbind
   [   41.258179] md: export_rdev(sda2)
   [   41.258248] md: unbind
   [   41.258306] md: export_rdev(sdc2)
   [   41.283067] md: bind
   [   41.283297] md: bind
   [   41.285235] md: bind
   [   41.306753] md: md1 stopped.
   [   41.306818] md: unbind
   [   41.306878] md: export_rdev(sdb2)
   [   41.306956] md: unbind
   [   41.307007] md: export_rdev(sda2)
   [   41.307075] md: unbind
   [   41.307130] md: export_rdev(sdc2)
   [   41.312250] md: bind
   [   41.312476] md: bind
   [   41.312711] md: bind
   [   41.312922] md: bind
   [   41.313138] md: bind
   [   41.313343] md: bind
   [   41.313452] md: md1: raid array is not clean -- starting
   background reconstruction
   [   41.322189] raid5: device sde2 operational as raid disk 0
   [   41.322243] raid5: device sdc2 operational as raid disk 4
   [   41.322292] raid5: device sdg2 operational as raid disk 3
   [   41.322342] raid5: device sdb2 operational as raid disk 2
   [   41.322391] raid5: device sdf2 operational as raid disk 1
   [   41.322823] raid5: allocated 5245kB for md1
   [   41.322872] raid5: raid level 5 set md1 active with 5 out of 5
   devices, algorithm 2
   [   41.322934] RAID5 conf printout:
   [   41.322980]  --- rd:5 wd:5
   [   41.323026]  disk 0, o:1, dev:sde2
   [   41.323073]  disk 1, o:1, dev:sdf2
   [   41.323119]  disk 2, o:1, dev:sdb2
   [   41.323165]  disk 3, o:1, dev:sdg2
   [   41.323212]  disk 4, o:1, dev:sdc2

   [   41.323316] md: resync of RAID array md1
   [   41.323364] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
   [   41.323415] md: using maximum available idle IO bandwidth (but
   not more than 20 KB/sec) for resync.
   [   41.323492] md: using 128k window, over a total of 231978496 
blocks.




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread Bill Davidsen

Neil Brown wrote:

On Wednesday January 9, [EMAIL PROTECTED] wrote:
  

On Jan 9, 2008 5:09 PM, Neil Brown <[EMAIL PROTECTED]> wrote:


On Wednesday January 9, [EMAIL PROTECTED] wrote:

Can you test it please?
  

This passes my failure case.



Thanks!

  

Does it seem reasonable?
  

What do you think about limiting the number of stripes the submitting
thread handles to be equal to what it submitted?  If I'm a stripe that
only submits 1 stripe worth of work should I get stuck handling the
rest of the cache?



Dunno
Someone has to do the work, and leaving it all to raid5d means that it
all gets done on one CPU.
I expect that most of the time the queue of ready stripes is empty so
make_request will mostly only handle it's own stripes anyway.
The times that it handles other thread's stripes will probably balance
out with the times that other threads handle this threads stripes.

So I'm incline to leave it as "do as much work as is available to be
done" as that is simplest.  But I can probably be talked out of it
with a convincing argument
  


How about "it will perform better (defined as faster) during conditions 
of unusual i/o activity?" Is that a convincing argument to use your 
solution as offered? How about "complexity and maintainability are a 
zero-sum problem?"


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid 1, can't get the second disk added back in.

2008-01-08 Thread Bill Davidsen

Neil Brown wrote:

On Monday January 7, [EMAIL PROTECTED] wrote:
  
Problem is not raid, or at least not obviously raid related.  The 
problem is that the whole disk, /dev/hdb is unavailable. 



Maybe check /sys/block/hdb/holders ?  lsof /dev/hdb ?

good luck :-)

  
losetup -a may help, lsof doesn't seem to show files used in loop 
mounts. Yes, long shot...


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Last ditch plea on remote double raid5 disk failure

2007-12-31 Thread Bill Davidsen

Michael Tokarev wrote:

Neil Brown wrote:
  

On Monday December 31, [EMAIL PROTECTED] wrote:


I'm hoping that if I can get raid5 to continue despite the errors, I
can bring back up enough of the server to continue, a bit like the
remount-ro option in ext2/ext3.

If not, oh well...
  

Sorry, but it is "oh well".




And another thought around all this.  Linux sw raid definitely need
a way to proactively replace a (probably failing) drive, without removing
it from the array first.  Something like,
  mdadm --add /dev/md0 /dev/sdNEW --inplace /dev/sdFAILING
so that sdNEW will be a mirror of sdFAILING, and once the "recovery"
procedure finishes (which may use data from other drives in case of
I/O error reading sdFAILING - unlike described scenario of making a
superblock-less mirror of sdNEW and sdFAILING),
  mdadm --remove /dev/md0 /dev/sdFAILING,
which does not involve any further reconstructions anymore.
  


I really like that idea, it addresses the same problem as the various 
posts regarding creating little raid1 arrays of the old and new drive, etc.


I would like an option to keep a drive with bad sectors in an array if 
removing the drive would prevent the array from running (or starting). I 
don't think that should be default, but there are times when some data 
is way better than none. I would think the options are fail the drive, 
set the array r/o, and return an error and keep going.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New XFS benchmarks using David Chinner's recommendations for XFS-based optimizations.

2007-12-31 Thread Bill Davidsen

Justin Piszcz wrote:

Dave's original e-mail:

# mkfs.xfs -f -l lazy-count=1,version=2,size=128m -i attr=2 -d 
agcount=4 

# mount -o logbsize=256k  



And if you don't care about filsystem corruption on power loss:



# mount -o logbsize=256k,nobarrier  



Those mkfs values (except for log size) will be hte defaults in the next
release of xfsprogs.



Cheers,



Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group


-

I used his mkfs.xfs options verbatim but I use my own mount options:
noatime,nodiratime,logbufs=8,logbsize=26214

Here are the results, the results of 3 bonnie++ averaged together for 
each test:

http://home.comcast.net/~jpiszcz/xfs1/result.html

Thanks Dave, this looks nice--the more optimizations the better!

---

I also find it rather pecuilar that in some of my (other) benchmarks 
my RAID 5 is just as fast as RAID 0 for extracting large files 
(uncompressed) files:


RAID 5 (1024k CHUNK)
26.95user 6.72system 0:37.89elapsed 88%CPU (0avgtext+0avgdata 
0maxresident)k0inputs+0outputs (6major+526minor)pagefaults 0swaps


Compare with RAID 0 for the same operation:

(as with RAID5, it appears 256k-1024k..2048k possibly) is the sweet spot.

Why does mdadm still use 64k for the default chunk size?


Write performance with small files, I would think. There is some 
information in old posts, but I don't seem to find them as quickly as I 
would like.


And another quick question, would there be any benefit to use (if it 
were possible) a block size of > 4096 bytes with XFS (I assume only 
IA64/similar arch can support it), e.g. x86_64 cannot because the 
page_size is 4096.


[ 8265.407137] XFS: only pagesize (4096) or less will currently work.



--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: On the subject of RAID-6 corruption recovery

2007-12-28 Thread Bill Davidsen

H. Peter Anvin wrote:
I got a private email a while ago from Thiemo Nagel claiming that some 
of the conclusions in my RAID-6 paper was incorrect.  This was 
combined with a "proof" which was plain wrong, and could easily be 
disproven using basic enthropy accounting (i.e. how much information 
is around to play with.)


However, it did cause me to clarify the text portion a little bit.  In 
particular, *in practice* in may be possible to *probabilistically* 
detect multidisk corruption.  Probabilistic detection means that the 
detection is not guaranteed, but it can be taken advantage of 
opportunistically.


If this means that there can be no false positives for multidisk 
corruption but may be false negatives, fine. If it means something else, 
please restate one more time.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 grow reshaping speed is unchangeable

2007-12-27 Thread Bill Davidsen

Cody Yellan wrote:

I forgot the version information:

mdadm - v2.5.4 - 13 October 2006
kernel 2.6.18-53.el5 #1 SMP

Would anyone consider it unsafe to upgrade to the latest version of
mdadm on a production machine using Neil Brown's srpm?
  


I wouldn't expect any problems, although I don't think there will be a 
benefit, either. The problem with RHEL is that while it's stable in 
terms of bug fixes, it also doesn't get any performance benefits.


Why not wait until Neil gets back from the holiday and see if he has any 
words of advice?


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 grow reshaping speed is unchangeable

2007-12-27 Thread Bill Davidsen

Cody Yellan wrote:

I had a 4x500GB SATA2 array, md0.  I added one 500GB drive and
reshaping began at ~2500K/sec.  Changing
/proc/sys/dev/raid/speed_limit_m{in,ax} or
/sys/block/md0/md/sync_speed_m{in,ax} had no effect.  I shut down all
unnecessary services and the array is offline (not mounted).  I have
read that the throttling code is "fragile" (esp. with regard to
raid5) but does this make sense?  I will wait (in)patiently for it to
finish, but I do wonder why the configuration parameters have no
effect.  This is a dual quad 2GHz Xeon machine with 8GB of memory
running RHEL5.  Is this the maximum speed?
  


Something else going on, I do better than that adding drives on USB!

I don't have a clue what the issue is, and I don't see anything in your 
information which looks unusual.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 performance question

2007-12-26 Thread Bill Davidsen

Peter Grandi wrote:

On Tue, 25 Dec 2007 19:08:15 +,
[EMAIL PROTECTED] (Peter Grandi) said:



[ ... ]

  

It's the raid10,f2 *read* performance in degraded mode that is
strange - I get almost exactly 50% of the non-degraded mode
read performance. Why is that?
  


  

[ ... ] the mirror blocks have to be read from the inner
cylinders of the next disk, which are usually a lot slower
than the outer ones. [ ... ]



Just to be complete there is of course the other issue that
affect sustained writes too, which is extra seeks. If disk B
fails the situation becomes:

DISK
   A X C D

   1 X 3 4
   . . . .
   . . . .
   . . . .
   ---
   4 X 2 3   
   . . . .

   . . . .
   . . . .

Not only must block 2 be read from an inner cylinder, but to
read block 3 there must be a seek to an outer cylinder on the
same disk. Which is the same well known issue when doing
sustained writes with RAID10 'f2'.


I have often wondered why the elevator code doesn't do better on this 
sustained load, grouping the writes at the drive extremities so there 
would be lots of writes to nearby cylinders then a big seek and lots of 
writes near the next position. I tried bumping the stripe_cache, 
changing to alternate elevators, and just increasing the physical 
memory, and never saw any serious improvement beyond the speed with 
default settings.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks

2007-12-25 Thread Bill Davidsen

Peter Grandi wrote:

On Wed, 19 Dec 2007 07:28:20 +1100, Neil Brown
<[EMAIL PROTECTED]> said:



[ ... what to do with 48 drive Sun Thumpers ... ]

neilb> I wouldn't create a raid5 or raid6 on all 48 devices.
neilb> RAID5 only survives a single device failure and with that
neilb> many devices, the chance of a second failure before you
neilb> recover becomes appreciable.

That's just one of the many problems, other are:

* If a drive fails, rebuild traffic is going to hit hard, with
  reading in parallel 47 blocks to compute a new 48th.

* With a parity strip length of 48 it will be that much harder
  to avoid read-modify before write, as it will be avoidable
  only for writes of at least 48 blocks aligned on 48 block
  boundaries. And reading 47 blocks to write one is going to be
  quite painful.

[ ... ]

neilb> RAID10 would be a good option if you are happy wit 24
neilb> drives worth of space. [ ... ]

That sounds like the only feasible option (except for the 3
drive case in most cases). Parity RAID does not scale much
beyond 3-4 drives.

neilb> Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use
neilb> RAID0 to combine them together. This would give you
neilb> adequate reliability and performance and still a large
neilb> amount of storage space.

That sounds optimistic to me: the reason to do a RAID50 of
8x(5+1) can only be to have a single filesystem, else one could
have 8 distinct filesystems each with a subtree of the whole.
With a single filesystem the failure of any one of the 8 RAID5
components of the RAID0 will cause the loss of the whole lot.

So in the 47+1 case a loss of any two drives would lead to
complete loss; in the 8x(5+1) case only a loss of two drives in
the same RAID5 will.

It does not sound like a great improvement to me (especially
considering the thoroughly inane practice of building arrays out
of disks of the same make and model taken out of the same box).
  


Quality control just isn't that good that "same box" make a big 
difference, assuming that you have an appropriate number of hot spares 
online. Note that I said "big difference," is there some clustering of 
failures? Some, but damn little. A few years ago I was working with 
multiple 6TB machines and 20+ 1TB machines, all using small, fast, 
drives in RAID5E. I can't remember a case where a drive failed before 
rebuild was complete, and only one or two where there was a failure to 
degraded mode before the hot spare was replaced.


That said, RAID5E typically can rebuild a lot faster than a typical hot 
spare as a unit drive, at least for any given impact on performance. 
This undoubtedly reduce our exposure time.

There are also modest improvements in the RMW strip size and in
the cost of a rebuild after a single drive loss. Probably the
reduction in the RMW strip size is the best improvement.

Anyhow, let's assume 0.5TB drives; with a 47+1 we get a single
23.5TB filesystem, and with 8*(5+1) we get a 20TB filesystem.
With current filesystem technology either size is worrying, for
example as to time needed for an 'fsck'.
  


Given that someone is putting a typical filesystem full of small files 
on a big raid, I agree. But fsck with large files is pretty fast on a 
given filesystem (200GB files on a 6TB ext3, for instance), due to the 
small number of inodes in play. While the bitmap resolution is a factor, 
it's pretty linear, fsck with lots of files gets really slow. And let's 
face it, the objective of raid is to avoid doing that fsck in the first 
place ;-)


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10: unfair disk load?

2007-12-25 Thread Bill Davidsen

Richard Scobie wrote:

Jon Nelson wrote:


My own tests on identical hardware (same mobo, disks, partitions,
everything) and same software, with the only difference being how
mdadm is invoked (the only changes here being level and possibly
layout) show that raid0 is about 15% faster on reads than the very
fast raid10, f2 layout. raid10,f2 is approx. 50% of the write speed of
raid0.


This more or less matches my testing.


Have you tested a stacked RAID 10 made up of 2 drive RAID1 arrays, 
striped together into a RAID0.


That is not raid10, that's raid1+0. See man md.


I have found this configuration to offer very good performance, at the 
cost of slightly more complexity.


It does, raid0 can be striped over many configurations, raid[156] being 
most common.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-25 Thread Bill Davidsen

Robin Hill wrote:

On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:

  

The (up to) 30% percent figure is mentioned here:
http://insights.oetiker.ch/linux/raidoptimization.html



That looks to be referring to partitioning a RAID device - this'll only
apply to hardware RAID or partitionable software RAID, not to the normal
use case.  When you're creating an array out of standard partitions then
you know the array stripe size will align with the disks (there's no way
it cannot), and you can set the filesystem stripe size to align as well
(XFS will do this automatically).

I've actually done tests on this with hardware RAID to try to find the
correct partition offset, but wasn't able to see any difference (using
bonnie++ and moving the partition start by one sector at a time).

  

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

   Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect




This looks to be a normal disk - the partition offsets shouldn't be
relevant here (barring any knowledge of the actual physical disk layout
anyway, and block remapping may well make that rather irrelevant).
  
The issue I'm thinking about is hardware sector size, which on modern 
drives may be larger than 512b and therefore entail a read-alter-rewrite 
(RAR) cycle when writing a 512b block. With larger writes, if the 
alignment is poor and the write size is some multiple of 512, it's 
possible to have an RAR at each end of the write. The only way to have a 
hope of controlling the alignment is to write a raw device or use a 
filesystem which can be configured to have blocks which are a multiple 
of the sector size and to do all i/o in block size starting each file on 
a block boundary. That may be possible with ext[234] set up properly.


Why this is important: the physical layout of the drive is useful, but 
for a large write the drive will have to make some number of steps from 
on cylinder to another. By carefully choosing the starting point, the 
best improvement will be to eliminate 2 track-to-track seek times, one 
at the start and one at the end. If the writes are small only one t2t 
saving is possible.


Now consider a RAR process. The drive is spinning typically at 7200 rpm, 
or 8.333 ms/rev. A read might take .5 rev on average, and a RAR will 
take 1.5 rev, because it takes a full revolution after the original data 
is read before the altered data can be rewritten. Larger sectors give 
more capacity, but reduced performance for write. And doing small writes 
can result in paying the RAR penalty on every write. So there may be a 
measurable benefit to getting that alignment right at the drive level.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks

2007-12-20 Thread Bill Davidsen

Thiemo Nagel wrote:

Bill Davidsen wrote:

16k read64k write
  chunk
  sizeRAID 5RAID 6RAID 5RAID 6
  128k492497268270
  256k615530288270
  512k625607230174
  1024k   65062017075
  


What is your stripe cache size?


I didn't fiddle with the default when I did these tests.

Now (with 256k chunk size) I had

# cat stripe_cache_size
256

but increasing that to 1024 didn't show a noticeable improvement for 
reading.  Still around 550MB/s.


You can use blockdev to raise the readahead, either on the drives or the 
array. That may make a difference, I use 4-8MB on the drive, more on the 
array depending on how I use it.


Kind regards,

Thiemo




--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Bill Davidsen

Justin Piszcz wrote:



On Wed, 19 Dec 2007, Bill Davidsen wrote:

I'm going to try another approach, I'll describe it when I get 
results (or not).


http://home.comcast.net/~jpiszcz/align_vs_noalign/

Hardly any difference at whatsoever, only on the per char for 
read/write is it any faster..?


Am I misreading what you are doing here... you have the underlying data 
on the actual hardware devices 64k aligned by using either the whole 
device or starting a partition on a 64k boundary? I'm dubious that you 
will see a difference any other way, after all the translations take place.


I'm trying creating a raid array using loop devices created with the 
"offset" parameter, but I suspect that I will wind up doing a test after 
just repartitioning the drives, painful as that will be.


Average of 3 runs taken:

$ cat align/*log|grep ,
p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:10:16/64,1334210,10,330,2,120,1,3978,10,312,2 

p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:10:16/64,1252548,6,296,1,115,1,7927,20,373,2 

p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:10:16/64,1242445,6,303,1,117,1,6767,17,359,2 



$ cat noalign/*log|grep ,
p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:10:16/64,1353180,8,314,1,117,1,8684,22,283,2 

p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:10:16/64,12211519,29,297,1,113,1,3218,8,325,2 

p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:10:16/64,1243229,8,307,1,120,1,4247,11,313,2 






--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help diagnosing bad disk

2007-12-20 Thread Bill Davidsen

Jon Sabo wrote:

I think I got it now.  Thanks for your help!
  


Now just make our holiday cheer complete by waiting until the resync is 
complete and rebooting to be sure that everything is *really* back as it 
should be.  ;-)

[EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03
  Creation Time : Mon Jul 30 21:47:14 2007
 Raid Level : raid1
 Array Size : 1951744 (1906.32 MiB 1998.59 MB)
Device Size : 1951744 (1906.32 MiB 1998.59 MB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Wed Dec 19 14:15:31 2007
  State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

   UUID : 157f716c:0e7aebca:c20741f6:bb6099c9
 Events : 0.48

Number   Major   Minor   RaidDevice State
   0   810  active sync   /dev/sda1
   1   001  removed
[EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md1
/dev/md1:
Version : 00.90.03
  Creation Time : Mon Jul 30 21:47:47 2007
 Raid Level : raid1
 Array Size : 974808064 (929.65 GiB 998.20 GB)
Device Size : 974808064 (929.65 GiB 998.20 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Wed Dec 19 14:19:06 2007
  State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

   UUID : 156a030e:9a6f8eb3:9b0c439e:d718e744
 Events : 0.1498998

Number   Major   Minor   RaidDevice State
   0   000  removed
   1   8   181  active sync   /dev/sdb2
[EMAIL PROTECTED]:/home/illsci# mdadm /dev/md0 -a /dev/sdb1
mdadm: re-added /dev/sdb1
[EMAIL PROTECTED]:/home/illsci# mdadm /dev/md1 -a /dev/sda2
mdadm: re-added /dev/sda2
[EMAIL PROTECTED]:/home/illsci# cat /proc/mdstat
Personalities : [multipath] [raid1]
md1 : active raid1 sda2[2] sdb2[1]
  974808064 blocks [2/1] [_U]
resync=DELAYED

md0 : active raid1 sdb1[2] sda1[0]
  1951744 blocks [2/1] [U_]
  [=>...]  recovery = 86.6% (1693504/1951744)
finish=0.0min speed=80643K/sec

unused devices: 
[EMAIL PROTECTED]:/home/illsci# cat /proc/mdstat
Personalities : [multipath] [raid1]
md1 : active raid1 sda2[2] sdb2[1]
  974808064 blocks [2/1] [_U]
  [>]  recovery =  0.0% (86848/974808064)
finish=186.9min speed=86848K/sec

md0 : active raid1 sdb1[1] sda1[0]
  1951744 blocks [2/2] [UU]

unused devices: 
  

--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Bill Davidsen

Justin Piszcz wrote:



On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the 
correct start and end size if I wanted to make sure the RAID5 was 
stripe aligned?


Or is there a better way to do this, does parted handle this 
situation better?


From that setup it seems simple, scrap the partition table and use 
the 
disk device for raid. This is what we do for all data storage disks 
(hw raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take 
my machine apart for a BIOS downgrade when I plugged in the sata 
devices again I did not plug them back in the same order, everything 
worked of course but when I ran LILO it said it was not part of the 
RAID set, because /dev/sda had become /dev/sdg and overwrote the MBR 
on the disk, if I had not used partitions here, I'd have lost (or 
more of the drives) due to a bad LILO run?


As other posts have detailed, putting the partition on a 64k aligned 
boundary can address the performance problems. However, a poor choice 
of chunk size, cache_buffer size, or just random i/o in small sizes 
can eat up a lot of the benefit.


I don't think you need to give up your partitions to get the benefit 
of alignment.


--
Bill Davidsen <[EMAIL PROTECTED]>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark


Hrmm..

I am doing a benchmark now with:

6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.

unligned, just fdisk /dev/sdc, mkpartition, fd raid.
 aligned, fdisk, expert, start at 512 as the off-set

Per a Microsoft KB:

Example of alignment calculations in kilobytes for a 256-KB stripe 
unit size:

(63 * .5) / 256 = 0.123046875
(64 * .5) / 256 = 0.125
(128 * .5) / 256 = 0.25
(256 * .5) / 256 = 0.5
(512 * .5) / 256 = 1
These examples shows that the partition is not aligned correctly for a 
256-KB stripe unit size until the partition is created by using an 
offset of 512 sectors (512 bytes per sector).


So I should start at 512 for a 256k chunk size.

I ran bonnie++ three consecutive times and took the average for the 
unaligned, rebuilding the RAID5 now and then I will re-execute the 
test 3 additional times and take the average of that.


I'm going to try another approach, I'll describe it when I get results 
(or not).


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help diagnosing bad disk

2007-12-19 Thread Bill Davidsen

Jon Sabo wrote:

So I was trying to copy over some Indiana Jones wav files and it
wasn't going my way.  I noticed that my software raid device showed:

/dev/md1 on / type ext3 (rw,errors=remount-ro)

Is this saying that it was remounted, read only because it found a
problem with the md1 meta device?  That's what it looks like it's
saying but I can still write to /.

mdadm --detail showed:

[EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03
  Creation Time : Mon Jul 30 21:47:14 2007
 Raid Level : raid1
 Array Size : 1951744 ( 1906.32 MiB 1998.59 MB)
Device Size : 1951744 (1906.32 MiB 1998.59 MB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Wed Dec 19 12:59:56 2007
  State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

   UUID : 157f716c:0e7aebca:c20741f6
:bb6099c9
 Events : 0.28

 Number   Major   Minor   RaidDevice State
   0   810  active sync   /dev/sda1
   1   001  removed

[EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md1
 /dev/md1:
Version : 00.90.03
  Creation Time : Mon Jul 30 21:47:47 2007
 Raid Level : raid1
 Array Size : 974808064 (929.65 GiB 998.20 GB)
Device Size : 974808064 (929.65 GiB 998.20 GB)
Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Wed Dec 19 13:14:53 2007
  State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

   UUID : 156a030e:9a6f8eb3:9b0c439e:d718e744
 Events : 0.1990

Number   Major   Minor   RaidDevice State
   0   820  active sync   /dev/sda2
   1   001  removed


I have two 1 terabyte sata drives in this box.  From what I was
reading wouldn't it show an F for the failed drive?  I thought I would
see that /dev/sdb1 and /dev/sdb2 were failed and it would show an F.
What is this saying and how do you know that its /dev/sdb and not some
other drive?  It shows removed and that the state is clean, degraded.
Is that something you can recover from with out returning this disk
and putting in a new one to add to the raid1 array?
  


You can try adding the partitions back to your array, but I suspect 
something bad has happened to your sdb drive, since it's failed out of 
both arrays. You can use dmesg to look for any additional information.


Justin gave you the rest of the info you need to investigate, I'll not 
repeat it. ;-)


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Bill Davidsen

Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the 
correct start and end size if I wanted to make sure the RAID5 was 
stripe aligned?


Or is there a better way to do this, does parted handle this 
situation better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks 
(hw raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices 
again I did not plug them back in the same order, everything worked of 
course but when I ran LILO it said it was not part of the RAID set, 
because /dev/sda had become /dev/sdg and overwrote the MBR on the 
disk, if I had not used partitions here, I'd have lost (or more of the 
drives) due to a bad LILO run?


As other posts have detailed, putting the partition on a 64k aligned 
boundary can address the performance problems. However, a poor choice of 
chunk size, cache_buffer size, or just random i/o in small sizes can eat 
up a lot of the benefit.


I don't think you need to give up your partitions to get the benefit of 
alignment.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ERROR] scsi.c: In function 'scsi_get_serial_number_page'

2007-12-19 Thread Bill Davidsen

Thierry Iceta wrote:

Hi

I would like to use raidtools-1.00.3 on Rhel5 distribution
but I got thie error
Could you tell me if a new version is available or if a patch exists
to use raidtools on Rhel5


raidtools is old and unmaintained. Use mdadm.

--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks

2007-12-19 Thread Bill Davidsen

Mattias Wadenstein wrote:

On Wed, 19 Dec 2007, Neil Brown wrote:


On Tuesday December 18, [EMAIL PROTECTED] wrote:

We're investigating the possibility of running Linux (RHEL) on top of
Sun's X4500 Thumper box:

http://www.sun.com/servers/x64/x4500/

Basically, it's a server with 48 SATA hard drives. No hardware RAID.
It's designed for Sun's ZFS filesystem.

So... we're curious how Linux will handle such a beast. Has anyone run
MD software RAID over so many disks? Then piled LVM/ext3 on top of
that? Any suggestions?


There are those that have run Linux MD RAID on thumpers before. I 
vaguely recall some driver issues (unrelated to MD) that made it less 
suitable than solaris, but that might be fixed in recent kernels.



Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use RAID0 to
combine them together.  This would give you adequate reliability and
performance and still a large amount of storage space.


My personal suggestion would be 5 9-disk raid6s, one raid1 root mirror 
and one hot spare. Then raid0, lvm, or separate filesystem on those 5 
raidsets for data, depending on your needs.


Other than thinking raid-10 better than  raid-1for performance, I like it.


You get almost as much data space as with the 6 8-disk raid6s, and 
have a separate pair of disks for all the small updates (logging, 
metadata, etc), so this makes alot of sense if most of the data is 
bulk file access.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks

2007-12-19 Thread Bill Davidsen

Thiemo Nagel wrote:

Performance of the raw device is fair:
# dd if=/dev/md2 of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s

Somewhat less through ext3 (created with -E stride=64):
# dd if=largetestfile of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s


Quite slow?

10 disks (raptors) raid 5 on regular sata controllers:

# dd if=/dev/md3 of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s

# dd if=bigfile of=/dev/zero bs=128k count=64k
3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s


Interesting.  Any ideas what could be the reason?  How much do you get 
from a single drive?  -- The Samsung HD501LJ that I'm using gives 
~84MB/s when reading from the beginning of the disk.


With RAID 5 I'm getting slightly better results (though I really 
wonder why, since naively I would expect identical read performance) 
but that does only account for a small part of the difference:


16k read64k write
  
chunk
  
sizeRAID 5RAID 6RAID 5RAID 6
  
128k492497268270
  
256k615530288270
  
512k625607230174
  
1024k   65062017075
  


What is your stripe cache size?

--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm break / restore soft mirror

2007-12-13 Thread Bill Davidsen

Brett Maton wrote:
Hi, 

  Question for you guys. 

  A brief history: 
  RHEL 4 AS 
  I have a partition with way to many small files on (Usually around a couple of million) that needs to be backed up, standard


  methods mean that a restore is impossibly slow due to the sheer volume of files. 
  Solution, raw backup /restore of the device.  However the partition is permanently being accessed. 


  Proposed solution is to use software raid mirror.  Before backup starts, 
break the soft mirror unmount and backup partition

  restore soft mirror and let it resync / rebuild itself. 

  Would the above intentional break/fix of the mirror cause any problems? 
  


Probably. If by "accessed" you mean read-only, you can do this, but if 
the data is changing you have a serious problem that the data on the 
disk and queued in memory may leave that part on the disk as an 
inconsistent data set. If there is a means of backing up a set of files 
which are changing other than stopping the updates in a known valid 
state, it's not something which I've seen really work in all cases.


DM has some snapshot capabilities, but in fact they have the same 
limitation, the data on a partition can be backed up, but unless you can 
ensure that the data is in a consistent state when it's frozen, your 
backup will have some small possibility of failure. Database programs 
have ways to freeze the data to do backups, but if an application 
doesn't have a means to force the data on the disk valid, it will only 
be a "pretty good" backup.


I suggest looking at things like rsync, which will not solve the 
changing data problem, but may do the backup quickly enough to be as 
useful as what you propose. Of course a full backup is likely to take a 
long time however you do it.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-13 Thread Bill Davidsen

Tejun Heo wrote:

Bill Davidsen wrote:
  

Jan Engelhardt wrote:


On Dec 1 2007 06:26, Justin Piszcz wrote:
  

I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)


Why would you care about what's on the disk? fdisk, mkfs and
the day-to-day operation will overwrite it _anyway_.

(If you think the disk is not empty, you should look at it
and copy off all usable warez beforehand :-)

  

Do you not test your drive for minimum functionality before using them?



I personally don't.

  

Also, if you have the tools to check for relocated sectors before and
after doing this, that's a good idea as well. S.M.A.R.T is your friend.
And when writing /dev/zero to a drive, if it craps out you have less
emotional attachment to the data.



Writing all zero isn't too useful tho.  Drive failing reallocation on
write is catastrophic failure.  It means that the drive wanna relocate
but can't because it used up all its extra space which usually indicates
something else is seriously wrong with the drive.  The drive will have
to go to the trash can.  This is all serious and bad but the catch is
that in such cases the problem usually stands like a sore thumb so
either vendor doesn't ship such drive or you'll find the failure very
early.  I personally haven't seen any such failure yet.  Maybe I'm lucky.
  


The problem is usually not with what the vendor ships, but what the 
carrier delivers. Bad handling does happen, "drop ship" can have several 
meanings, and I have received shipments with the "G sensor" in the case 
triggered. Zero is a handy source of data, but the important thing is to 
look at the relocated sector count before and after the write. If there 
are a lot of bad sectors initially, the drive is probably a poor choice 
for anything critical.

Most data loss occurs when the drive fails to read what it thought it
wrote successfully and the opposite - reading and dumping the whole disk
to /dev/null periodically is probably much better than writing zeros as
it allows the drive to find out deteriorating sector early while it's
still readable and relocate.  But then again I think it's an overkill.

Writing zeros to sectors is more useful as cure rather than prevention.
 If your drive fails to read a sector, write whatever value to the
sector.  The drive will forget about the data on the damaged sector and
reallocate and write new data to it.  Of course, you lose data which was
originally on the sector.

I personally think it's enough to just throw in an extra disk and make
it RAID0 or 5 and rebuild the array if read fails on one of the disks.
If write fails or read fail continues, replace the disk.  Of course, if
you wanna be extra cautious, good for you.  :-)

  

--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to remove failed drive

2007-12-10 Thread Bill Davidsen

Jeff Breidenbach wrote:

... and all access to array hangs indefinitely, resulting in unkillable zombie
processes. Have to hard reboot the machine. Any thoughts on the matter?

===

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sde1[6](F) sdg1[1] sdb1[4] sdd1[3] sdc1[2]
  488383936 blocks [6/4] [__]

unused devices: 

# mdadm --fail /dev/md1 /dev/sde1
mdadm: set /dev/sde1 faulty in /dev/md1

# mdadm --remove /dev/md1 /dev/sde1
mdadm: hot remove failed for /dev/sde1: Device or resource busy

# mdadm -D /dev/md1
/dev/md1:
Version : 00.90.03
  Creation Time : Sun Dec 25 16:12:34 2005
 Raid Level : raid1
 Array Size : 488383936 (465.76 GiB 500.11 GB)
Device Size : 488383936 (465.76 GiB 500.11 GB)
   Raid Devices : 6
  Total Devices : 5
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Fri Dec  7 11:37:46 2007
  State : active, degraded
 Active Devices : 4
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 0

   UUID : f3ee6aa3:2f1d5767:f443dfc0:c23e80af
 Events : 0.22331500

Number   Major   Minor   RaidDevice State
   0   00-  removed
   1   8   971  active sync   /dev/sdg1
   2   8   332  active sync   /dev/sdc1
   3   8   493  active sync   /dev/sdd1
   4   8   174  active sync   /dev/sdb1
   5   00-  removed

   6   8   650  faulty   /dev/sde1

  
This is without doubt really messed up! You have four active devices, 
four working devices, five total devices, and six(!) raid devices. And 
at the end of the output seven(!!) devices, four active, two removed, 
and one faulty. I wouldn't even be able to make a guess how you go to 
this point, but I would guess that some system administration was involved.


If this is an array you can live without and still have a working system 
I do have a thought, however. If you can unmount everything on this 
device and then stop it, you may be able to assemble (-A) it with just 
the four working drives. If that succeeds you may be able to remove 
sde1, although I suspect that the two removed drives shown are really 
caused by partially removal of sde1 in the past. Either that or you have 
a serious problem with reliability...


I'm sure others will have some ideas on this, if it were mine a backup 
would be my first order of business.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-06 Thread Bill Davidsen

Justin Piszcz wrote:
root  2206 1  4 Dec02 ?00:10:37 dd if /dev/zero of 
1.out bs 1M
root  2207 1  4 Dec02 ?00:10:38 dd if /dev/zero of 
2.out bs 1M
root  2208 1  4 Dec02 ?00:10:35 dd if /dev/zero of 
3.out bs 1M
root  2209 1  4 Dec02 ?00:10:45 dd if /dev/zero of 
4.out bs 1M
root  2210 1  4 Dec02 ?00:10:35 dd if /dev/zero of 
5.out bs 1M
root  2211 1  4 Dec02 ?00:10:35 dd if /dev/zero of 
6.out bs 1M
root  2212 1  4 Dec02 ?00:10:30 dd if /dev/zero of 
7.out bs 1M
root  2213 1  4 Dec02 ?00:10:42 dd if /dev/zero of 
8.out bs 1M
root  2214 1  4 Dec02 ?00:10:35 dd if /dev/zero of 
9.out bs 1M
root  2215 1  4 Dec02 ?00:10:37 dd if /dev/zero of 
10.out bs 1M
root  3080 24.6  0.0  10356  1672 ?D01:22   5:51 dd if 
/dev/md3 of /dev/null bs 1M


Was curious if when running 10 DD's (which are writing to the RAID 5) 
fine, no issues, suddenly all go into D-state and let the read/give it 
100% priority?


Is this normal?


I'm jumping back to the start of this thread, because after reading all 
the discussion I noticed that you are mixing apples and oranges here. 
Your write programs are going to files in the filesystem, and your read 
is going against the raw device. That may explain why you see something 
I haven't noticed doing all filesystem i/o.


I am going to do a large rsync to another filesystem in the next two 
days, I will turn on some measurements when I do. But if you are just 
investigating this behavior, perhaps you could retry with a single read 
from a file rather than the device.


   [...snip...]

--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 check/repair

2007-12-05 Thread Bill Davidsen

Peter Grandi wrote:

[ ... on RAID1, ... RAID6 error recovery ... ]

tn> The use case for the proposed 'repair' would be occasional,
tn> low-frequency corruption, for which many sources can be
tn> imagined:

tn> Any piece of hardware has a certain failure rate, which may
tn> depend on things like age, temperature, stability of
tn> operating voltage, cosmic rays, etc. but also on variations
tn> in the production process.  Therefore, hardware may suffer
tn> from infrequent glitches, which are seldom enough, to be
tn> impossible to trace back to a particular piece of equipment.
tn> It would be nice to recover gracefully from that.

What has this got to do with RAID6 or RAID in general? I have
been following this discussion with a sense of bewilderment as I
have started to suspect that parts of it are based on a very
large misunderstanding.

tn> Kernel bugs or just plain administrator mistakes are another
tn> thing.

The biggest administrator mistakes are lack of end-to-end checking
and backups. Those that don't have them wish their storage systems
could detect and recover from arbitrary and otherwise undetected
errors (but see below for bad news on silent corruptions).

tn> But also the case of power-loss during writing that you have
tn> mentioned could profit from that 'repair': With heterogeneous
tn> hardware, blocks may be written in unpredictable order, so
tn> that in more cases graceful recovery would be possible with
tn> 'repair' compared to just recalculating parity.

Redundant RAID levels are designed to recover only from _reported_
errors that identify precisely where the error is. Recovering from
random block writing is something that seems to me to be quite
outside the scope of a low level virtual storage device layer.

ms> I just want to give another suggestion. It may or may not be
ms> possible to repair inconsistent arrays but in either way some
ms> code there MUST at least warn the administrator that
ms> something (may) went wrong.

tn> Agreed.

That sounds instead quite extraordinary to me because it is not
clear how to define ''inconsistency'' in the general case never
mind detect it reliably, and never mind knowing when it is found
how to determine which are the good data bits and which are the
bad.

Now I am starting to think that this discussion is based on the
curious assumption that storage subsystems should solve the so
called ''byzantine generals'' problem, that is to operate reliably
in the presence of unreliable communications and storage.
  
I had missed that. In fact, after rereading most of the thread I *still* 
miss that, so perhaps it's not there. What the OP proposed was that in 
the case where there is incorrect data on exactly one chunk in a raid-6 
slice that the incorrect chunk be identified and rewritten with correct 
data. This is based on the assumptions that (a) this case can be 
identified, (b) the correct data value for the chunk can be calculated, 
(c) this only adds processing or i/o overhead when an error condition is 
identified by the existing code, and (d) this can be done without 
significant additional i/o other than rewriting the corrected data.


Given these assumptions the reasons for not adding this logic would seem 
to be (a) one of the assumptions is wrong, (b) it would take a huge 
effort to code or maintain, or (c) it's wrong for raid to fix errors 
other than hardware, even if it could do so. Although I've looked at the 
logic in metadata form, and the code for doing the check now, I realize 
that the assumptions could be wrong, and invite enlightenment. But 
Thiemo posted metacode which I find appears correct, so I don't think 
it's a huge job to code, and since it is in a code path which currently 
always hides an error, it's hard to understand how added code could make 
things worse than they are.


I can actually see the philosophical argument about doing only disk 
errors in raid code, but at least it should be a clear decision made for 
that reason, and not hidden by arguments that this happens rarely. Given 
the state of current hardware, I think virtually all errors happen 
rarely, the problem is that all problems happen occasionally (ref. 
Murphy's Law). We have a tool (check) which finds these problems, why 
not a tools to fix them?


BTW: if this can be done in a user program, mdadm, rather than by code 
in the kernel, that might well make everyone happy. Okay, realistically 
"less unhappy."


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Bill Davidsen

Jan Engelhardt wrote:

On Dec 1 2007 06:26, Justin Piszcz wrote:

I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)


Why would you care about what's on the disk? fdisk, mkfs and
the day-to-day operation will overwrite it _anyway_.

(If you think the disk is not empty, you should look at it
and copy off all usable warez beforehand :-)

Do you not test your drive for minimum functionality before using them? 
Also, if you have the tools to check for relocated sectors before and 
after doing this, that's a good idea as well. S.M.A.R.T is your friend. 
And when writing /dev/zero to a drive, if it craps out you have less 
emotional attachment to the data.


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Abysmal write performance on HW RAID5

2007-11-29 Thread Bill Davidsen
changes and rerun?


So, the RAID5 device has a huge queue of write requests with an average wait
time of more than 2 seconds @ 100% utilization?  Or is this a bug in iostat?

At this point, I'm all ears...I don't even know where to start.  Is ext2 not
a good format for volumes of this size?  Then how to explain the block
device xfer rate being so bad, too?  Is it that I have one drive in the
array that's a different brand?  Or that it has a different cache size?

Anyone have any ideas?
  
You will get more and maybe better, but this is a start just to see if 
the problem responds to obvious changes.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 check/repair

2007-11-29 Thread Bill Davidsen

Neil Brown wrote:

On Thursday November 22, [EMAIL PROTECTED] wrote:
  

Dear Neil,

thank you very much for your detailed answer.

Neil Brown wrote:


While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is 
wrong, it is *not* possible to deduce which block or blocks are wrong

if it is possible that more than 1 data block is wrong.
  

If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
it *is* possible, to distinguish three cases:
a) exactly zero bad blocks
b) exactly one bad block
c) more than one bad block

Of course, it is only possible to recover from b), but one *can* tell,
whether the situation is a) or b) or c) and act accordingly.



It would seem that either you or Peter Anvin is mistaken.

On page 9 of 
  http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

at the end of section 4 it says:

  Finally, as a word of caution it should be noted that RAID-6 by
  itself cannot even detect, never mind recover from, dual-disk
  corruption. If two disks are corrupt in the same byte positions,
  the above algorithm will in general introduce additional data
  corruption by corrupting a third drive.

  

The point that I'm trying to make is, that there does exist a specific
case, in which recovery is possible, and that implementing recovery for
that case will not hurt in any way.



Assuming that it true (maybe hpa got it wrong) what specific
conditions would lead to one drive having corrupt data, and would
correcting it on an occasional 'repair' pass be an appropriate
response?

Does the value justify the cost of extra code complexity?

  
RAID is not designed to protect again bad RAM, bad cables, chipset 
bugs drivers bugs etc.  It is only designed to protect against drive 
failure, where the drive failure is apparent.  i.e. a read must 
return either the same data that was last written, or a failure 
indication. Anything else is beyond the design parameters for RAID.
  

I'm taking a more pragmatic approach here.  In my opinion, RAID should
"just protect my data", against drive failure, yes, of course, but if it
can help me in case of occasional data corruption, I'd happily take
that, too, especially if it doesn't cost extra... ;-)



Everything costs extra.  Code uses bytes of memory, requires
maintenance, and possibly introduced new bugs.  I'm not convinced the
failure mode that you are considering actually happens with a
meaningful frequency.
  


People accept the hardware and performance costs of raid-6 in return for 
the better security of their data. If I run a check and find that I have 
an error, right now I have to treat that the same way as an 
unrecoverable failure, because the "repair" function doesn't fix the 
data, it just makes the symptom go away by redoing the p and q values.


This makes the naive user thinks the problem is solved, when in fact 
it's now worse, he has corrupt data with no indication of a problem. The 
fact that (most) people who read this list are advanced enough to 
understand the issue does not protect the majority of users from their 
ignorance. If that sounds elitist, many of the people on this list are 
the elite, and even knowing that you need to learn and understand more 
is a big plus in my book. It's the people who run repair and assume the 
problem is fixed who get hurt by the current behavior.


If you won't fix the recoverable case by recovering, then maybe for 
raid-6 you could print an error message like

 can't recover data, fix parity and hide the problem (y/N)?
or require a --force flag, and at least give a heads up to the people 
who just picked the "most reliable raid level" because they're trying to 
do it right, but need a clue that they have a real and serious problem, 
and just a "repair" can't fix it.


Recovering a filesystem full of "just files" is pretty easy, that's what 
backups with CRC are for, but a large database recovery often takes 
hours to restore and run journal files. I personally consider it the job 
of the kernel to do recovery when it is possible, absent that I would 
like the tools to tell me clearly that I have a problem and what it is.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 check/repair

2007-11-26 Thread Bill Davidsen

Thiemo Nagel wrote:

Dear Neil,

thank you very much for your detailed answer.

Neil Brown wrote:

While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is 
wrong, it is *not* possible to deduce which block or blocks are wrong

if it is possible that more than 1 data block is wrong.


If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
it *is* possible, to distinguish three cases:
a) exactly zero bad blocks
b) exactly one bad block
c) more than one bad block

Of course, it is only possible to recover from b), but one *can* tell,
whether the situation is a) or b) or c) and act accordingly.
I was waiting for a response before saying "me too," but that's exactly 
the case, there is a class of failures other than power failure or total 
device failure which result in just the "one identifiable bad sector" 
result. Given that the data needs to be read to realize that it is bad, 
why not go the extra inch and fix it properly instead of redoing the p+q 
which just makes the problem invisible rather than fixing it.


Obviously this is a subset of all the things which can go wrong, but I 
suspect it's a sizable subset.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HELP! New disks being dropped from RAID 6 array on every reboot

2007-11-24 Thread Bill Davidsen

Joshua Johnson wrote:

Greetings, long time listener, first time caller.

I recently replaced a disk in my existing 8 disk RAID 6 array.
Previously, all disks were PATA drives connected to the motherboard
IDE and 3 promise Ultra 100/133 controllers.  I replaced one of the
Promise controllers with a Via 64xx based controller, which has 2 SATA
ports and one PATA port.  I connected a new SATA drive to the new
card, partitioned the drive and added it to the array.  After 5 or 6
hours the resyncing process finished and the array showed up complete.
 Upon rebooting I discovered that the new drive had not been added to
the array when it was assembled on boot.   I resynced it and tried
again -- still would not persist after a reboot.  I moved one of the
existing PATA drives to the new controller (so I could have the slot
for network), rebooted and rebuilt the array.  Now when I reboot BOTH
disks are missing from the array (sda and sdb).  Upon examining the
disks it appears they think they are part of the array, but for some
reason they are not being added when the array is being assembled.
For example, this is a disk on the new controller which was not added
to the array after rebooting:

# mdadm --examine /dev/sda1
/dev/sda1:
  Magic : a92b4efc
Version : 00.90.03
   UUID : 63ee7d14:a0ac6a6e:aef6fe14:50e047a5
  Creation Time : Thu Sep 21 23:52:19 2006
 Raid Level : raid6
Device Size : 191157248 (182.30 GiB 195.75 GB)
 Array Size : 1146943488 (1093.81 GiB 1174.47 GB)
   Raid Devices : 8
  Total Devices : 8
Preferred Minor : 0

Update Time : Fri Nov 23 10:22:57 2007
  State : clean
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0
   Checksum : 50df590e - correct
 Events : 0.96419878

 Chunk Size : 256K

  Number   Major   Minor   RaidDevice State
this 6   816  active sync   /dev/sda1

   0 0   320  active sync   /dev/hda2
   1 1  5721  active sync   /dev/hdk2
   2 2  3322  active sync   /dev/hde2
   3 3  3423  active sync   /dev/hdg2
   4 4  2224  active sync   /dev/hdc2
   5 5  5625  active sync   /dev/hdi2
   6 6   816  active sync   /dev/sda1
   7 7   8   177  active sync   /dev/sdb1


Everything there seems to be correct and current up to the last
shutdown.  But the disk is not being added on boot.  Examining a disk
that is currently running in the array shows:

# mdadm --examine /dev/hdc2
/dev/hdc2:
  Magic : a92b4efc
Version : 00.90.03
   UUID : 63ee7d14:a0ac6a6e:aef6fe14:50e047a5
  Creation Time : Thu Sep 21 23:52:19 2006
 Raid Level : raid6
Device Size : 191157248 (182.30 GiB 195.75 GB)
 Array Size : 1146943488 (1093.81 GiB 1174.47 GB)
   Raid Devices : 8
  Total Devices : 6
Preferred Minor : 0

Update Time : Fri Nov 23 10:23:52 2007
  State : clean
 Active Devices : 6
Working Devices : 6
 Failed Devices : 2
  Spare Devices : 0
   Checksum : 50df5934 - correct
 Events : 0.96419880

 Chunk Size : 256K

  Number   Major   Minor   RaidDevice State
this 4  2224  active sync   /dev/hdc2

   0 0   320  active sync   /dev/hda2
   1 1  5721  active sync   /dev/hdk2
   2 2  3322  active sync   /dev/hde2
   3 3  3423  active sync   /dev/hdg2
   4 4  2224  active sync   /dev/hdc2
   5 5  5625  active sync   /dev/hdi2
   6 6   006  faulty removed
   7 7   007  faulty removed


Here is my /etc/mdadm/mdadm.conf:

DEVICE partitions
PROGRAM /bin/echo
MAILADDR 
ARRAY /dev/md0 level=raid6 num-devices=8
UUID=63ee7d14:a0ac6a6e:aef6fe14:50e047a5


Can anyone see anything that is glaringly wrong here?  Has anybody
experienced similar behavior?  I am running Debian using kernel
2.6.23.8.  All partitions are set to type 0xFD and it appears the
superblocks on the sd* disks were written, why wouldn't they be added
to the array on boot?  Any help is greatly appreciated!


Does that match what's in the init files used at boot? By any chance 
does the information there explicitly list partitions by name? If you 
change to "PARTITIONS" in /etc/mdadm.conf it won't bite you until you 
change the detected partitions so they no longer match what was correct 
at install time.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md RAID 10 on Linux 2.6.20?

2007-11-23 Thread Bill Davidsen

[EMAIL PROTECTED] wrote:

Hi all,

I am running a home-grown Linux 2.6.20.11 SMP 64-bit build, and I am 
wondering if there is indeed a RAID 10 "personality" defined in md 
that can be implemented using mdadm. If so, is it available in 
2.6.20.11, or is it in a later kernel version? In the past, to create 
RAID 10, I created RAID 1's and a RAID 0, so an 8 drive RAID 10 would 
actually consist of 5 md devices (four RAID 1's and one RAID 0). But 
if I could just use RAID 10 natively, and simply create one RAID 10, 
that would of course be better both in terms of management and 
probably performance I would guess. Is this possible?


Yes, and you are correct on the performance. Read the man page section 
on "near" and "far" copies of data carefully, and some back posts to 
this list. Most of us find that using far copies generates slightly 
slower write performance and significantly better read performance.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 failure + libata irq: nobody cared

2007-11-19 Thread Bill Davidsen

Vincze, Tamas wrote:

Hi,

Last night a drive failed in my RAID5 array and it was kicked
out of the array, continuing with 3 drives as expected.

However a few minutes later this was logged:

irq 18: nobody cared (try booting with the "irqpoll" option)
Call Trace:  {__report_bad_irq+48}
   {note_interrupt+433} 
{__do_IRQ+191}


IRQ 18 belongs to the SATA controller where all 4 drives are connected.


The troubling thing is that the controller was still in use, and there 
should have been handling for the "nobody cared" interrupt. It sounds as 
if the failed drive didn't get marked to be ignored, or logged then 
ignored. I'd love to know what generated the IRQ in the first place.


Nothing more was logged, probably because the interrupt got disabled,
making it impossible to talk to the drives anymore. It's bad because
I ended up with a dirty degraded array the second time this year.

How would a RAID-6 handle a crash when a drive is missing?
Would that also lead to possible silent corruptions?
Or is the only option to avoid silent corruptions is a battery
backed hardware controller?


Kernel is 2.6.16-1.2133_FC5


There have been a lot of improvements in raid since then.


Here's the full log:

Nov 16 00:43:10 p4 kernel: ata1: command 0xea timeout, stat 0xd0 
host_stat 0x0

Nov 16 00:43:10 p4 kernel: ata1: status=0xd0 { Busy }
Nov 16 00:43:10 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407
Nov 16 00:43:10 p4 last message repeated 2 times
Nov 16 01:30:06 p4 kernel: ata1: command 0xea timeout, stat 0xd0 
host_stat 0x0

Nov 16 01:30:06 p4 kernel: ata1: status=0xd0 { Busy }
Nov 16 01:30:06 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407
Nov 16 01:30:06 p4 last message repeated 2 times
Nov 16 01:34:13 p4 kernel: ata1: command 0xea timeout, stat 0xd0 
host_stat 0x0

Nov 16 01:34:13 p4 kernel: ata1: status=0xd0 { Busy }
Nov 16 01:34:13 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407
Nov 16 01:34:13 p4 last message repeated 2 times
Nov 16 01:35:13 p4 kernel: ata1: command 0x35 timeout, stat 0xd0 
host_stat 0x61

Nov 16 01:35:13 p4 kernel: ata1: status=0xd0 { Busy }
Nov 16 01:35:13 p4 kernel: sd 0:0:0:0: SCSI error: return code = 
0x802

Nov 16 01:35:13 p4 kernel: sda: Current: sense key: Aborted Command
Nov 16 01:35:13 p4 kernel: Additional sense: Scsi parity error
Nov 16 01:35:13 p4 kernel: end_request: I/O error, dev sda, sector 
781015848

Nov 16 01:35:43 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407
Nov 16 01:35:44 p4 last message repeated 2 times
Nov 16 01:35:44 p4 kernel: ata1: command 0xea timeout, stat 0xd0 
host_stat 0x0

Nov 16 01:35:44 p4 kernel: ata1: status=0xd0 { Busy }
Nov 16 01:35:44 p4 kernel: raid5: Disk failure on sda3, disabling 
device. Operation continuing on 3 devices

Nov 16 01:35:44 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407
Nov 16 01:35:44 p4 kernel: RAID5 conf printout:
Nov 16 01:35:44 p4 kernel:  --- rd:4 wd:3 fd:1
Nov 16 01:35:44 p4 kernel:  disk 0, o:0, dev:sda3
Nov 16 01:35:44 p4 kernel:  disk 1, o:1, dev:sdc3
Nov 16 01:35:44 p4 kernel:  disk 2, o:1, dev:sdb3
Nov 16 01:35:44 p4 kernel:  disk 3, o:1, dev:sdd3
Nov 16 01:35:44 p4 kernel: RAID5 conf printout:
Nov 16 01:35:44 p4 kernel:  --- rd:4 wd:3 fd:1
Nov 16 01:35:44 p4 kernel:  disk 1, o:1, dev:sdc3
Nov 16 01:35:44 p4 kernel:  disk 2, o:1, dev:sdb3
Nov 16 01:35:44 p4 kernel:  disk 3, o:1, dev:sdd3
Nov 16 01:37:36 p4 kernel: irq 18: nobody cared (try booting with the 
"irqpoll" option)

Nov 16 01:37:36 p4 kernel:
Nov 16 01:37:36 p4 kernel: Call Trace:  
{__report_bad_irq+48}


Nov 16 01:37:36 p4 kernel:
{note_interrupt+433} {__do_IRQ+191}



-Tamas
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: raid5 hangs

2007-11-14 Thread Bill Davidsen

Justin Piszcz wrote:
This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the 
RAID5 bio* patches are applied.


Note below he's running 2.6.22.3 which doesn't have the bug unless 
-STABLE added it. So should not really be in 2.6.22.anything. I assume 
you're talking the endless write or bio issue?


Justin.

On Wed, 14 Nov 2007, Peter Magnusson wrote:


Hey.

[1.] One line summary of the problem:

raid5 hangs and use 100% cpu

[2.] Full description of the problem/report:

I have used 2.6.18 for 284 days or something until my powersupply 
died, no problem what so ever duing that time. After that forced 
reboot I did these changes; Put in 2 GB more memory so I have 3 GB 
instead of 1 GB, two disks in the raid5 got badblocks so I didnt 
trust them anymore so I bought new disks (I managed to save the 
raid5). I have 6x300 GB in a raid5. Two of them are now 320 GB so 
created a small raid1 also. That raid5 is encrypted with 
aes-cbc-plain. The raid1 is encrypted with aes-cbc-essiv:sha256.


I compiled linux-2.6.22.3 and started to use that. I used the same 
.config
as in default FC5, I think i just selected P4 cpu and preemptive 
kernel type.


After 11 or 12 days the computer froze, I wasnt home when it happend and
couldnt fix it for like 3 days. It was just to reboot it as it wasnt 
possible to login remotely or on console. It did respond to ping 
however.

After reboot it rebuilded the raid5.

Then it happend again after approx the same time, 11 or 12 days. I 
noticed

that the process md1_raid5 used 100% cpu all the time. After reboot it
rebuilded the raid5.

I compiled linux-2.6.23.

And then... it happend again... After about the same time as before.
md1_raid5 used 100% cpu. I also noticed that I wasnt able to save
anything in my homedir, it froze during save. I could read from it 
however. My homedir isnt on raid5 but its encrypted. Its not on any 
disk that has to do with raid. This problem didnt happend when I used 
2.6.18. Currently I use 2.6.18 as I kinda need the computer stable.

After reboot it rebuilded the raid5.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal: non-striping RAID4

2007-11-14 Thread Bill Davidsen

James Lee wrote:

>From a quick search through this mailing list, it looks like I can
answer my own question regarding RAID1 --> RAID5 conversion.  Instead
of creating a RAID1 array for the partitions on the two biggest
drives, it should just create a 2-drive RAID5 (which is identical, but
can be expanded as with any other RAID5 array).

So it looks like this should work I guess.


I believe what you want to create might be a three drive raid-5 with one 
failed drive. That way you can just add a drive when you want.


 mdadm -C -c32 -l5 -n3 -amd /dev/md7 /dev/loop[12] missing

Then you can add another drive:

 mdadm --add /dev/md7 /dev/loop3

The output are at the end of this message.

But in general think it would be really great to be able to have a 
format which would do raid-5 or raid-6 over all the available parts of 
multiple drives, and since there's some similar logic for raid-10 over a 
selection of drives it is clearly possible. But in terms of the benefit 
to be gained, unless it fails out of the code and someone feels the 
desire to do it, I can't see much joy to ever having such a thing.


The feature I would really like to have is raid5e, distributed spare so 
head motion is spread over all drives. Don't have time to look at that 
one, either, but it really helps performance under load with small arrays.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Building a new raid6 with bitmap does not clear bits during resync

2007-11-14 Thread Bill Davidsen

Neil Brown wrote:

On Monday November 12, [EMAIL PROTECTED] wrote:
  

Neil Brown wrote:


However there is value in regularly updating the bitmap, so add code
to periodically pause while all pending sync requests complete, then
update the bitmap.  Doing this only every few seconds (the same as the
bitmap update time) does not notciable affect resync performance.
  
  
I wonder if a minimum time and minimum number of stripes would be 
better. If a resync is going slowly because it's going over a slow link 
to iSCSI, nbd, or a box of cheap drives fed off a single USB port, just 
writing the updated bitmap may represent as much data as has been 
resynced in the time slice.


Not a suggestion, but a request for your thoughts on that.



Thanks for your thoughts.
Choosing how often to update the bitmap during a sync is certainly not
trivial.   In different situations, different requirements might rule.

I chose to base it on time, and particularly on the time we already
have for "how soon to write back clean bits to the bitmap" because it
is fairly easy to users to understand the implications (if I set the
time to 30 seconds, then I might have to repeat 30second of resync)
and it is already configurable (via the "--delay" option to --create
--bitmap).
  


Sounds right, that part of it is pretty user friendly.

Presumably if someone has a very slow system and wanted to use
bitmaps, they would set --delay relatively large to reduce the cost
and still provide significant benefits.  This would effect both normal
clean-bit writeback and during-resync clean-bit-writeback.

Hope that clarifies my approach.
  


Easy to implement and understand is always a strong point, and a user 
can make an informed decision. Thanks for the discussion.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Building a new raid6 with bitmap does not clear bits during resync

2007-11-12 Thread Bill Davidsen

Neil Brown wrote:

On Thursday November 8, [EMAIL PROTECTED] wrote:
  

Hi,

I have created a new raid6:

md0 : active raid6 sdb1[0] sdl1[5] sdj1[4] sdh1[3] sdf1[2] sdd1[1]
  6834868224 blocks level 6, 512k chunk, algorithm 2 [6/6] [UU]
  [>]  resync = 21.5% (368216964/1708717056) 
finish=448.5min speed=49808K/sec
  bitmap: 204/204 pages [816KB], 4096KB chunk

The raid is totally idle, not mounted and nothing.

So why does the "bitmap: 204/204" not sink? I would expect it to clear
bits as it resyncs so it should count slowly down to 0. As a side
effect of the bitmap being all dirty the resync will restart from the
beginning when the system is hard reset. As you can imagine that is
pretty anoying.

On the other hand on a clean shutdown it seems the bitmap gets updated
before stopping the array:

md3 : active raid6 sdc1[0] sdm1[5] sdk1[4] sdi1[3] sdg1[2] sde1[1]
  6834868224 blocks level 6, 512k chunk, algorithm 2 [6/6] [UU]
  [===>.]  resync = 38.4% (656155264/1708717056) 
finish=17846.4min speed=982K/sec
  bitmap: 187/204 pages [748KB], 4096KB chunk

Consequently the rebuild did restart and is already further along.




Thanks for the report.

  

Any ideas why that is so?



Yes.  The following patch should explain (a bit tersely) why this was
so, and should also fix it so it will no longer be so.  Test reports
always welcome.

NeilBrown

Status: ok

Update md bitmap during resync.

Currently and md array with a write-intent bitmap does not updated
that bitmap to reflect successful partial resync.  Rather the entire
bitmap is updated when the resync completes.

This is because there is no guarentee that resync requests will
complete in order, and tracking each request individually is
unnecessarily burdensome.

However there is value in regularly updating the bitmap, so add code
to periodically pause while all pending sync requests complete, then
update the bitmap.  Doing this only every few seconds (the same as the
bitmap update time) does not notciable affect resync performance.
  


I wonder if a minimum time and minimum number of stripes would be 
better. If a resync is going slowly because it's going over a slow link 
to iSCSI, nbd, or a box of cheap drives fed off a single USB port, just 
writing the updated bitmap may represent as much data as has been 
resynced in the time slice.


Not a suggestion, but a request for your thoughts on that.

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal: non-striping RAID4

2007-11-11 Thread Bill Davidsen
 cost of maintaining it is justified by the benefit, 
but not my decision. If you were to set up such a thing using FUSE, 
keeping it out of the kernel but still providing the functionality, it 
might be worth doing. On the other hand, setting up the partitions and 
creating the arrays could probably be done by a perl script which would 
take only a few hours to write.

PS: In case it wasn't clear, the attached code is simply the code the
author has released under GPL - it's intended just for reference, not
as proposed code for review.
  
Much as I generally like adding functionality, I *really* can't see much 
in this idea. It seems to me to be in the "clever but not useful" category.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 assemble after dual sata port failure

2007-11-11 Thread Bill Davidsen

David Greaves wrote:

Chris Eddington wrote:
  

Yes, there is some kind of media error message in dmesg, below.  It is
not random, it happens at exactly the same moments in each xfs_repair -n
run.
Nov 11 09:48:25 altair kernel: [37043.300691]  res
51/40:00:01:00:00/00:00:00:00:00/e1 Emask 0x9 (media error)
Nov 11 09:48:25 altair kernel: [37043.304326] ata4.00: ata_hpa_resize 1:
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:48:25 altair kernel: [37043.307672] ata4.00: ata_hpa_resize 1:
sectors = 976773168, hpa_sectors = 976773168



I'm not sure what an ata_hpa_resize error is...
  


HPA = Hardware Protected Area.

By any chance is this disk partitioned such that the partition size 
includes the HPA? If it does, this sounds at least familiar, this 
mailing list post may get you started: 
http://osdir.com/ml/linux.ataraid/2005-09/msg2.html


In any case, run "fdisk -l" and look at the claimed total disk size and 
the end point of the last partition. The HPA is not included in the 
"disk size" so nothing should be trying to do so.

It probably explains the problems you've been having with the raid not 'just
recovering' though.

I saw this:
http://www.linuxquestions.org/questions/linux-kernel-70/sata-issues-568894/
  


May be the same thing. Let us know what fdisk reports.


What does smartctl say about your drive?

IMO the spare drive is no longer useful for data recovery - you may want to use
ddrescue to try and copy this drive to the spare drive.

David
PS Don't get the ddrescue parameters the wrong way round if you go that route...
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  



--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software raid - controller options

2007-11-08 Thread Bill Davidsen

Lyle Schlueter wrote:

Hello,

I just started looking into software raid with linux a few weeks ago. I
am outgrowing the commercial NAS product that I bought a while back.
I've been learning as much as I can, suscribing to this mailing list,
reading man pages, experimenting with loopback devices setting up and
expanding test arrays. 


I have a few questions now that I'm sure someone here will be able to
enlighten me about.
First, I want to run a 12 drive raid 6, honestly, would I be better of
going with true hardware raid like the areca ARC-1231ML vs software
raid? I would prefer software raid just for the sheer cost savings. But
what kind of processing power would it take to match or exceed a mid to
high-level hardware controller?

I haven't seen much, if any, discussion of this, but how many drives are
people putting into software arrays? And how are you going about it?
Motherboards seem to max out around 6-8 SATA ports. Do you just add SATA
controllers? Looking around on newegg (and some googling) 2-port SATA
controllers are pretty easy to find, but once you get to 4 ports the
cards all seem to include some sort of built in *raid* functionality.
Are there any 4+ port PCI-e SATA controllers cards? 
  


Depending on your needs for transfer rate vs. capacity, newegg has at 
least one external enclosure which holds (from memory) 8 drives, and 
brings the data down on a single SATA connector. If you need lots of 
data online but not a high transfer rate, this might be useful. I was 
offered the enclosure as a "deal" with a DVD burner, don't know what 
they were thinking there. Ordered the DVD Tues, arrived Wed, but I don't 
need the hot swap case.

Are there any specific chipsets/brands of motherboards or controller
cards that you software raid veterans prefer?

Thank you for your time and any info you are able to give me!

Lyle

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  



--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-08 Thread Bill Davidsen

Jeff Lessem wrote:

Dan Williams wrote:
> The following patch, also attached, cleans up cases where the code 
looks

> at sh->ops.pending when it should be looking at the consistent
> stack-based snapshot of the operations flags.

I tried this patch (against a stock 2.6.23), and it did not work for
me.  Not only did I/O to the effected RAID5 & XFS partition stop, but
also I/O to all other disks.  I was not able to capture any debugging
information, but I should be able to do that tomorrow when I can hook
a serial console to the machine.


That can't be good! This is worrisome because Joel is giddy with joy 
because it fixes his iSCSI problems. I was going to try it with nbd, but 
perhaps I'll wait a week or so and see if others have more information. 
Applying patches before a holiday weekend is a good way to avoid time 
off. :-(


I'm not sure if my problem is identical to these others, as mine only
seems to manifest with RAID5+XFS.  The RAID rebuilds with no problem,
and I've not had any problems with RAID5+ext3.


Hopefully it's not the raid which is the issue.

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Was: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing

2007-11-08 Thread Bill Davidsen

Goswin von Brederlow wrote:

Konstantin Sharlaimov <[EMAIL PROTECTED]> writes:

  

On Wed, 2007-11-07 at 10:15 +0100, Goswin von Brederlow wrote:


I wonder if there shouldn't be a way to turn this off (or if there
already is one).

Or more generaly an option to say what is "near". Specifically I would
like to teach the raid1 layer that I have 2 external raid boxes with a
16k chunk size. So read/write within a 16k chunk will be the same disk
but the next 16k are a different disk and "near" doesn't apply
anymore.
  

Currently there is no way to turn this feature off (this is only a
"request for comments" patch), but I'm planning to make it configurable
via sysfs and module parameters.

Thanks for suggestion for the "near" definition. What do you think about
adding the "chunk_size" parameter (with the default value of 1 chunk = 1
sector). Setting it to 32 will make all reads within 16k chunk to be
considered "near" (with zero distance) so they will go to the same disk.

Max distance will also be configurable (after this distance the "read"
operation is considered "far" and will go to randomly chosen disk)

Regards,
Konstantin



Maybe you need more parameter:

chunk_size- size of a continious chunk on the (multi disk) device
stripe_size   - size of a stripe of chunks spanning all disks
rotation_size - size of multiple stripes before parity rotates to a
new disk (sign gives direction of rotation)
near_size - size that is considered to be near on a disk

I would give all sizes in blocks of 512 bytes or bytes.
  


Why? Would there ever be a time when there  would be a significant (or 
any) benefit from a size other than a multiple of chunk size? If you 
give the rest of the sizes in multiples of chunk size it invites less 
human math.

Default would be:
  


Default would be "zero" to indicate that the raid system should figure 
out what to use, allowing the value of "one" to actually mean what it 
says. It also invites use of zero for the rest of the calculated sizes, 
indicating the raid subsystem should select values. With coming SSD 
hardware you may actually want one to mean one.

chunk_size= 1 (block)
stripe_size   = 1 (block)
rotation_size = 0 (no rotation)
near_size = 256

That would reflect that you have all chunks continious on a normal
disk and read/writes are done in 128K chunks.

For raid 1 on raid 0:

chunk_size  = raid chunk size
stripe_size = num disks * chunk_size
rotation_size = 0
near_size = 256

For raid 1 on raid 5:

chunk_size  = raid chunk size
stripe_size = (num disks - 1) * chunk_size
rotation_size = (num disks - 1) * chunk_size  (?)
near_size = 256

and so on.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  



--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Was: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing

2007-11-08 Thread Bill Davidsen

Rik van Riel wrote:

On Thu, 08 Nov 2007 17:28:37 +0100
Goswin von Brederlow <[EMAIL PROTECTED]> wrote:

  

Maybe you need more parameter:



Generally a bad idea, unless you can come up with sane defaults (which
do not need tuning 99% of the time) or you can derive these parameters
automatically from the RAID configuration (unlikely with RAID 1?).

Before turning Konstantin's nice and simple code into something complex,
it would be good to see benchmark results show that the complexity is
actually needed.

  
I was about to post a question about the benefit of all this logic, I'll 
trim before posting...


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: switching root fs '/' to boot from RAID1 with grub

2007-11-03 Thread Bill Davidsen

H. Peter Anvin wrote:

Bill Davidsen wrote:


Depends how "bad" the drive is.  Just to align the thread on this -  
If the boot sector is bad - the bios on newer boxes will skip to the 
next one.  But if it is "good", and you boot into garbage - - could 
be Windows.. does it crash?


Right, if the drive is dead almost every BIOS will fail over, if the 
read gets a CRC or similar most recent BIOS will fail over, but if an 
error-free read returns bad data, how can the BIOS know.




Unfortunately the Linux boot format doesn't contain any sort of 
integrity check.  Otherwise the bootloader could catch this kind of 
error and throw a failure, letting the next disk boot (or another 
kernel.)


I don't understand your point, unless there's a Linux bootloader in the 
BIOS it will boot whatever 512 bytes are in sector 0. So if that's crap 
it doesn't matter what it would do if it was valid, some other bytes 
came off the drive instead. Maybe Windows, since there seems to be an 
option in Windows to check the boot sector on boot and rewrite it if it 
isn't the WinXP one.  One of my offspring has that problem, dual boot 
system, every time he boots Windows he has to boot from rescue and 
reinstall grub.


I think he could install grub in the partition, make that the active 
partition, and the boot would work, but he tried and only type FAT or 
VFAT seem to boot, active or not.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: switching root fs '/' to boot from RAID1 with grub

2007-11-03 Thread Bill Davidsen

berk walker wrote:

H. Peter Anvin wrote:

Doug Ledford wrote:


device /dev/sda (hd0)
root (hd0,0)
install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) 
/boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst

device /dev/hdc (hd0)
root (hd0,0)
install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) 
/boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst


That will install grub on the master boot record of hdc and sda, and in
both cases grub will look to whatever drive it is running on for the
files to boot instead of going to a specific drive.



No, it won't... it'll look for the first drive in the system (BIOS 
drive 80h).  This means that if the BIOS can see the bad drive, but 
it doesn't work, you're still screwed.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Depends how "bad" the drive is.  Just to align the thread on this -  
If the boot sector is bad - the bios on newer boxes will skip to the 
next one.  But if it is "good", and you boot into garbage - - could be 
Windows.. does it crash?


Right, if the drive is dead almost every BIOS will fail over, if the 
read gets a CRC or similar most recent BIOS will fail over, but if an 
error-free read returns bad data, how can the BIOS know.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Superblocks

2007-11-03 Thread Bill Davidsen

Greg Cormier wrote:

Any reason 0.9 is the default? Should I be worried about using 1.0
superblocks? And can I "upgrade" my array from 0.9 to 1.0 superblocks?
  


Do understand that Neil may have other reasons... but mainly the 0.9 
format is the default because it is most widely supported and allows you 
to use new mdadm versions on old distributions (I still have one FC1 
machine!). As for changing metadata on an existing array, I really can't 
offer any help.

Thanks,
Greg

On 11/1/07, Neil Brown <[EMAIL PROTECTED]> wrote:
  

On Tuesday October 30, [EMAIL PROTECTED] wrote:


Which is the default type of superblock? 0.90 or 1.0?
  

The default default is 0.90.
However a local device can be set in mdadm.conf with e.g.
   CREATE metdata=1.0

NeilBrown
    



--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: doesm mdadm try to use fastest HDD ?

2007-11-02 Thread Bill Davidsen

Janek Kozicki wrote:

Hello,

My three HHDs have following speeds:

  hda - speed 70 MB/sec
  hdc - speed 27 MB/sec
  sda - speed 60 MB/sec

They create a raid1 /dev/md0 and raid5 /dev/md1 arrays. I wanted to
ask if mdadm is trying to pick the fastest HDD during operation?

Maybe I can "tell" which HDD is preferred?
  


If you are doing raid-1 between hdc and some faster drive, you could try 
using write-mostly and see go that works for you. For raid-5, it's 
faster to read the data off the slow drive than reconstruct it with 
multiple reads to multiple othjer faster drives.

This came to my mind when I saw this:

  # mdadm --query --detail /dev/md1 | grep Prefer
 
  Preferred Minor : 1


And also in the manual:

  -W, --write-mostly [...] "can be useful if mirroring over a slow link."


many thanks for all your help!
  

I have two thoughts on this:
1 - if performance is critical, replace the slow drive
2 - for most things you do, I would expect seek to be more important 
than transfer rate


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stride / stripe alignment on LVM ?

2007-11-02 Thread Bill Davidsen

Neil Brown wrote:

On Thursday November 1, [EMAIL PROTECTED] wrote:
  

Hello,

I have raid5 /dev/md1, --chunk=128 --metadata=1.1. On it I have
created LVM volume called 'raid5', and finally a logical volume
'backup'.

Then I formatted it with command:

   mkfs.ext3 -b 4096 -E stride=32 -E resize=550292480 /dev/raid5/backup

And because LVM is putting its own metadata on /dev/md1, the ext3
partition is shifted by some (unknown for me) amount of bytes from
the beginning of /dev/md1.

I was wondering, how big is the shift, and would it hurt the
performance/safety if the `ext3 stride=32` didn't align perfectly
with the physical stripes on HDD?



It is probably better to ask this question on an ext3 list as people
there might know exactly what 'stride' does.

I *think* it causes the inode tables to be offset in different
block-groups so that they are not all on the same drive.  If that is
the case, then an offset causes by LVM isn't going to make any
difference at all.
  


Actually, I think that all of the performance evil Doug was mentioning 
will apply to LVM as well. So if things are poorly aligned, they will be 
poorly handled, a stripe-sized write will not go in a stripe, but will 
overlap chunks and cause all the data from all chunks to be read back 
for a new raid-5 calculation.


So I would expect this to make a very large performance difference, so 
even if it work it would do so slowly.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-11-02 Thread Bill Davidsen

Alberto Alonso wrote:

On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote:
  

Not in the older kernel versions you were running, no.



These "old versions" (specially the RHEL) are supposed to be
the official versions supported by Redhat and the hardware 
vendors, as they were very specific as to what versions of 
Linux were supported.


So the vendors of the failing drives claimed that these kernels were 
supported? That's great, most vendors don't even consider Linux 
supported. What response did you get when you reported the problem to 
Redhat on your RHEL support contract? Did they agree that this hardware, 
and its use for software raid, was supported and intended?



 Of all people, I would think you would
appreciate that. Sorry if I sound frustrated and upset, but 
it is clearly a result of what "supported and tested" really 
means in this case. I don't want to go into a discussion of

commercial distros, which are "supported" as this is nor the
time nor the place but I don't want to open the door to the
excuse of "its an old kernel", it wasn't when it got installed.
  
The problem is in the time travel module. It didn't properly cope with 
future hardware, and since you have very long uptimes, I'm reasonably 
sure you haven't updated the kernel to get fixes installed.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Superblocks

2007-11-02 Thread Bill Davidsen

Neil Brown wrote:

On Tuesday October 30, [EMAIL PROTECTED] wrote:
  

Which is the default type of superblock? 0.90 or 1.0?



The default default is 0.90.
However a local device can be set in mdadm.conf with e.g.
   CREATE metdata=1.0

  


If you change to 1.start, 1.ed, 1.4k names for clarity, they need to be 
accepted here, as well.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-11-02 Thread Bill Davidsen

Neil Brown wrote:

On Friday October 26, [EMAIL PROTECTED] wrote:
  
Perhaps you could have called them 1.start, 1.end, and 1.4k in the 
beginning? Isn't hindsight wonderful?





Those names seem good to me.  I wonder if it is safe to generate them
in "-Eb" output

  
If you agree that they are better, using them in the obvious places 
would be better now than later. Are you going to put them in the 
metadata options as well? Let me know, I have looking at the 
documentation on my list for next week, and could include some text.

Maybe the key confusion here is between "version" numbers and
"revision" numbers.
When you have multiple versions, there is no implicit assumption that
one is better than another. "Here is my version of what happened, now
let's hear yours".
When you have multiple revisions, you do assume ongoing improvement.

v1.0  v1.1 and v1.2 are different version of the v1 superblock, which
itself is a revision of the v0...
  


Like kernel releases, people assume that the first number means *big* 
changes, the second incremental change.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Requesting migrate device options for raid5/6

2007-11-01 Thread Bill Davidsen

Goswin von Brederlow wrote:

Hi,

I would welcome if someone could work on a new feature for raid5/6
that would allow replacing a disk in a raid5/6 with a new one without
having to degrade the array.

Consider the following situation:

raid5 md0 : sda sdb sdc

Now sda gives a "SMART - failure iminent" warning and you want to
repalce it with sdd.

% mdadm --fail /dev/md0 /dev/sda
% mdadm --remove /dev/md0 /dev/sda
% mdadm --add /dev/md0 /dev/sdd

Further consider that drive sdb will give an I/O error during resync
of the array or fail completly. The array is in degraded mode so you
experience data loss.

  

That's a two drive failure, so you will lose data.

But that is completly avoidable and some hardware raids support disk
migration too. Loosly speaking the kernel should do the following:

  
No, it's not "completly avoidable" because have described sda is ready 
to fail and sdb as "will give an I/O error" so if both happen at once 
you will lose data because you have no valid copy. That said, some of 
what you describe below is possible to *reduce* the probability of 
failure. But if sdb is going to have i/o errors, you really need to 
replace two drive :-(

See below for some thoughts.

raid5 md0 : sda sdb sdc
-> create internal raid1 or dm-mirror
raid1 mdT : sda
raid5 md0 : mdT sdb sdc
-> hot add sdd to mdT
raid1 mdT : sda sdd
raid5 md0 : mdT sdb sdc
-> resync and then drop sda
raid1 mdT : sdd
raid5 md0 : mdT sdb sdc
-> remove internal mirror
raid5 md0 : sdd sdb sdc 



Thoughts?
  


If there were a "migrate" option, it might work something like this:
Given a migrate from sda to sdd, as you noted and raid1 between sda and 
sdd needs to be created, and obviously all chunks of sdd need to be 
marked as needing rebuild, but in addition sda needs to be made 
read-only, to minimize the i/o and to prevent any errors which might 
come from a failed write, like failed sector relocates, etc. Also, if 
valid data for a chunk is on sdd, no read would be done to sda. I think 
there's relevant code in the "write-mostly" bits to implement keep i/o 
to sda to a minimum, no writes and only mandatory reads when no valid 
chunk is on sdd yet. This is similar to recovery to a spare, save that 
most data will be valid on the failing drive and doesn't need to be 
recreated, only unreadable data must be done the slow way.


Care is needed for sda as well, so that if sdd fails during migrate, a 
last chance attempt to bring sda back to useful content can be made, I'm 
paranoid that way.


Assuming the migrate works correctly, sda is removed from the array, and 
the superblock should be marked to reflect that. Now sdd is a part of 
the array, and assemble, at least using UUID, should work.


I personally think that a migrate capability would be vastly useful, 
both for handling failing drives and just moving data to a better place. 
As you point out, the user commands are not *quite* as robust as an 
internal implementation could be, and are complex enough to invite user 
error. I certainly always write down steps before doing migrate, and if 
possible do it with the system booted from a rescue media.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-11-01 Thread Bill Davidsen

Alberto Alonso wrote:

On Mon, 2007-10-29 at 13:22 -0400, Doug Ledford wrote:

  

What kernels were these under?




Yes, these 3 were all SATA. The kernels (in the same order as above) 
are:


* 2.4.21-4.ELsmp #1 (Basically RHEL v3)
* 2.6.18-4-686 #1 SMP on a Fedora Core release 2
* 2.6.17.13 (compiled from vanilla sources)
  


*Old* kernels. If you are going to build your own kernel, get a new one!

The RocketRAID was configured for all drives as legacy/normal and
software RAID5 across all drives. I wasn't using hardware raid on
the last described system when it crashed.
  

--

bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-11-01 Thread Bill Davidsen

Alberto Alonso wrote:

On Tue, 2007-10-30 at 13:39 -0400, Doug Ledford wrote:
  

Really, you've only been bitten by three so far.  Serverworks PATA
(which I tend to agree with the other person, I would probably chock



3 types of bugs is too many, it basically affected all my customers
with  multi-terabyte arrays. Heck, we can also oversimplify things and 
say that it is really just one type and define everything as kernel type

problems (or as some other kernel used to say... general protection
error).

I am sorry for not having hundreds of RAID servers from which to draw
statistical analysis. As I have clearly stated in the past I am trying
to come up with a list of known combinations that work. I think my
data points are worth something to some people, specially those 
considering SATA drives and software RAID for their file servers. If

you don't consider them important for you that's fine, but please don't
belittle them just because they don't match your needs.

  

this up to Serverworks, not PATA), USB storage, and SATA (the SATA stack
is arranged similar to the SCSI stack with a core library that all the
drivers use, and then hardware dependent driver modules...I suspect that
since you got bit on three different hardware versions that you were in
fact hitting a core library bug, but that's just a suspicion and I could
well be wrong).  What you haven't tried is any of the SCSI/SAS/FC stuff,
and generally that's what I've always used and had good things to say
about.  I've only used SATA for my home systems or workstations, not any
production servers.



The USB array was never meant to be a full production system, just to 
buy some time until the budget was allocated to buy a real array. Having

said that, the raid code is written to withstand the USB disks getting
disconnected as far as the driver reports it properly. Since it doesn't,
I consider it another case that shows when not to use software RAID
thinking that it will work.

As for SCSI I think it is a greatly proved and reliable technology, I've
dealt with it extensively and have always had great results. I know deal
with it mostly on non Linux based systems. But I don't think it is
affordable to most SMBs that need multi-terabyte arrays.
  
Actually, SCSI can fail as well. Until recently I was running servers 
with multi-TB arrays, and regularly, several times a year, a drive would 
fail and glitch the SCSI bus such that the next i/o to another drive 
would fail. And I've had SATA drives fail cleanly on small machines, so 
neither is an "always" config.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-11-01 Thread Bill Davidsen

Daniel L. Miller wrote:

Doug Ledford wrote:

Nah.  Even if we had concluded that udev was to blame here, I'm not
entirely certain that we hadn't left Daniel with the impression that we
suspected it versus blamed it, so reiterating it doesn't hurt.  And I'm
sure no one has given him a fix for the problem (although Neil did
request a change that will give debug output, but not solve the
problem), so not dropping it entirely would seem appropriate as well.
  
I've opened a bug report on Ubuntu's Launchpad.net.  Scott James 
Remnant asked me to cc him on Neil's incremental reference - we'll see 
what happens from here.


Thanks for the help guys.  At the moment, I've changed my mdadm.conf 
to explicitly list the drives, instead of the auto=partition 
parameter.  We'll see what happens on the next reboot.


I don't know if it means anything, but I'm using a self-compiled 
2.6.22 kernel - with initrd.  At least I THINK I'm using initrd - I 
have an image, but I don't see an initrd line in my grub config.  
HmmI'm going to add a stanza that includes the initrd and see what 
happens also.



What did that do?

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Bill Davidsen

Luca Berra wrote:

On Sun, Oct 28, 2007 at 08:21:34PM -0400, Bill Davidsen wrote:

Because you didn't stripe align the partition, your bad.
  
Align to /what/ stripe? Hardware (CHS is fiction), software (of the RAID 

the real stripe (track) size of the storage, you must read the manual
and/or bug technical support for that info.


That's my point, there *is* no "real stripe (track) size of the storage" 
because modern drives use zone bit recording, and sectors per track 
depends on track, and changes within a partition. See

 http://www.dewassoc.com/kbase/hard_drives/hard_disk_sector_structures.htm
 http://www.storagereview.com/guide2000/ref/hdd/op/mediaTracks.html
you're about to create), or ??? I don't notice my FC6 or FC7 install 
programs using any special partition location to start, I have only 
run (tried to run) FC8-test3 for the live CD, so I can't say what it 
might do. CentOS4 didn't do anything obvious, either, so unless I 
really misunderstand your position at redhat, that would be your 
bad.  ;-)


If you mean start a partition on a pseudo-CHS boundary, fdisk seems 
to use what it thinks are cylinders for that.
Yes, fdisk will create partition at sector 63 (due to CHS being 
braindead,

other than fictional: 63 sectors-per-track)
most arrays use 64 or 128 spt, and array cache are aligned accordingly.
So 63 is almost always the wrong choice.


As the above links show, there's no right choice.


for the default choice you must consider what spt your array uses, iirc
(this is from memory, so double check these figures)
IBM 64 spt (i think)
EMC DMX 64
EMC CX 128???
HDS (and HP XP) except OPEN-V 96
HDS (and HP XP) OPEN-V 128
HP EVA 4/6/8 with XCS 5.x state that no alignment is needed even if i
never found a technical explanation about that.
previous HP EVA versions did (maybe 64).
you might then want to consider how data is laid out on the storage, but
i believe the storage cache is enough to deal with that issue.

Please note that "0" is always well aligned.

Note to people who is now wondering WTH i am talking about.

consider a storage with 64 spt, an io size of 4k and partition starting
at sector 63.
first io request will require two ios from the storage (1 for sector 63,
and one for sectors 64 to 70)
the next 7 io (71-78,79-86,97-94,95-102,103-110,111-118,119-126) will be
on the same track
the 8th will again require to be split, and so on.
this causes the storage to do 1 unnecessary io every 8. YMMV.
No one makes drives with fixed spt any more. Your assumptions are a 
decade out of date.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-28 Thread Bill Davidsen

Doug Ledford wrote:

On Fri, 2007-10-26 at 14:41 -0400, Doug Ledford wrote:
  

Actually, after doing some research, here's what I've found:


I should note that both the lvm code and raid code are simplistic at the
moment.  For example, the raid5 mapping only supports the default raid5
layout.  If you use any other layout, game over.  Getting it to work
with version 1.1 or 1.2 superblocks probably wouldn't be that hard, but
getting it to the point where it handles all the relevant setups
properly would require a reasonable amount of coding.

  
My first thought is that after the /boot partition is read (assuming you 
use one) restrictions go away. Performance of /boot is not much of an 
issue, for me at least, but more complex setups are sometimes need for 
the rest of the system.


Thanks for the research.

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-28 Thread Bill Davidsen

Doug Ledford wrote:

On Sat, 2007-10-27 at 11:20 -0400, Bill Davidsen wrote:
  

* When using lilo to boot from a raid device, it automatically installs
itself to the mbr, not to the partition.  This can not be changed.  Only
0.90 and 1.0 superblock types are supported because lilo doesn't
understand the offset to the beginning of the fs otherwise.
  
  
I'm reasonably sure that's wrong, I used to set up dual boot machines by 
putting LILO in the partition and making that the boot partition, by 
changing the active partition flag I could just have the machine boot 
Windows, to keep people from getting confused.



Yeah, someone else pointed this out too.  The original patch to lilo
*did* do as I suggest, so they must have improved on the patch later.

  

* When using grub to boot from a raid device, only 0.90 and 1.0
superblocks are supported[1] (because grub is ignorant of the raid and
it requires the fs to start at the start of the partition).  You can use
either MBR or partition based installs of grub.  However, partition
based installs require that all bootable partitions be in exactly the
same logical block address across all devices.  This limitation can be
an extremely hazardous limitation in the event a drive dies and you have
to replace it with a new drive as newer drives may not share the older
drive's geometry and will require starting your boot partition in an odd
location to make the logical block addresses match.

* When using grub2, there is supposedly already support for raid/lvm
devices.  However, I do not know if this includes version 1.0, 1.1, or
1.2 superblocks.  I intend to find that out today.  If you tell grub2 to
install to an md device, it searches out all constituent devices and
installs to the MBR on each device[2].  This can't be changed (at least
right now, probably not ever though).
  
  
That sounds like a good reason to avoid grub2, frankly. Software which 
decides that it knows what to do better than the user isn't my 
preference. If I wanted software which fores me to do things "their way" 
I'd be running Windows.



It's not really all that unreasonable of a restriction.  Most people
aren't aware than when you put a boot sector at the beginning of a
partition, you only have 512 bytes of space, so the boot loader that you
put there is basically nothing more than code to read the remainder of
the boot loader from the file system space.  Now, traditionally, most
boot loaders have had to hard code the block addresses of certain key
components into these second stage boot loaders.  If a user isn't aware
of the fact that the boot loader does this at install time (or at kernel
selection update time in the case of lilo), then they aren't aware that
the files must reside at exactly the same logical block address on all
devices.  Without that knowledge, they can easily create an unbootable
setup by having the various boot partitions in slightly different
locations on the disks.  And intelligent partition editors like parted
can compound the problem because as they insulate the user from having
to pick which partition number is used for what partition, etc., they
can end up placing the various boot partitions in different areas of
different drives.  The requirement above is a means of making sure that
users aren't surprise by a non-working setup.  The whole element of
least surprise thing.  Of course, if they keep that requirement, then I
would expect it to be well documented so that people know this going
into putting the boot loader in place, but I would argue that this is at
least better than finding out when a drive dies that your system isn't
bootable.

  

So, given the above situations, really, superblock format 1.2 is likely
to never be needed.  None of the shipping boot loaders work with 1.2
regardless, and the boot loader under development won't install to the
partition in the event of an md device and therefore doesn't need that
4k buffer that 1.2 provides.
  
  

Sounds right, although it may have other uses for clever people.


[1] Grub won't work with either 1.1 or 1.2 superblocks at the moment.  A
person could probably hack it to work, but since grub development has
stopped in preference to the still under development grub2, they won't
take the patches upstream unless they are bug fixes, not new features.
  
  
If the patches were available, "doesn't work with existing raid formats" 
would probably qualify as a bug.



Possibly.  I'm a bit overbooked on other work at the moment, but I may
try to squeeze in some work on grub/grub2 to support version 1.1 or 1.2
superblocks.

  

[2] There are two ways to install to a master boot record.  The first is
to use the first 512 bytes *only* and hardcode the location of the
remainder of the boot loader into those 512 bytes.  The second way is to
use the free space between the

Re: Raid-10 mount at startup always has problem

2007-10-28 Thread Bill Davidsen

Doug Ledford wrote:

On Fri, 2007-10-26 at 11:15 +0200, Luca Berra wrote:
  

On Thu, Oct 25, 2007 at 02:40:06AM -0400, Doug Ledford wrote:


The partition table is the single, (mostly) universally recognized
arbiter of what possible data might be on the disk.  Having a partition
table may not make mdadm recognize the md superblock any better, but it
keeps all that other stuff from even trying to access data that it
doesn't have a need to access and prevents random luck from turning your
day bad.
  

on a pc maybe, but that is 20 years old design.



So?  Unix is 35+ year old design, I suppose you want to switch to Vista
then?

  

partition table design is limited because it is still based on C/H/S,
which do not exist anymore.
Put a partition table on a big storage, say a DMX, and enjoy a 20%
performance decrease.



Because you didn't stripe align the partition, your bad.
  
Align to /what/ stripe? Hardware (CHS is fiction), software (of the RAID 
you're about to create), or ??? I don't notice my FC6 or FC7 install 
programs using any special partition location to start, I have only run 
(tried to run) FC8-test3 for the live CD, so I can't say what it might 
do. CentOS4 didn't do anything obvious, either, so unless I really 
misunderstand your position at redhat, that would be your bad.  ;-)


If you mean start a partition on a pseudo-CHS boundary, fdisk seems to 
use what it thinks are cylinders for that.


Please clarify what alignment provides a performance benefit.

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software RAID when it works and when it doesn't

2007-10-27 Thread Bill Davidsen

Alberto Alonso wrote:

On Fri, 2007-10-26 at 18:12 +0200, Goswin von Brederlow wrote:

  

Depending on the hardware you can still access a different disk while
another one is reseting. But since there is no timeout in md it won't
try to use any other disk while one is stuck.

That is exactly what I miss.

MfG
Goswin
-



That is exactly what I've been talking about. Can md implement
timeouts and not just leave it to the drivers?

I can't believe it but last night another array hit the dust when
1 of the 12 drives went bad. This year is just a nightmare for
me. It brought all the network down until I was able to mark it
failed and reboot to remove it from the array.
  


I'm not sure what kind of drives and drivers you use, but I certainly 
have drives go bad and they get marked as failed. Both on old PATA 
drives and newer SATA. All the SCSI I currently use is on IBM hardware 
RAID (ServeRAID), so I can only assume that failure would be noted.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-27 Thread Bill Davidsen

Doug Ledford wrote:

On Fri, 2007-10-26 at 10:18 -0400, Bill Davidsen wrote:
  
[___snip___]
  



Actually, after doing some research, here's what I've found:

* When using lilo to boot from a raid device, it automatically installs
itself to the mbr, not to the partition.  This can not be changed.  Only
0.90 and 1.0 superblock types are supported because lilo doesn't
understand the offset to the beginning of the fs otherwise.
  


I'm reasonably sure that's wrong, I used to set up dual boot machines by 
putting LILO in the partition and making that the boot partition, by 
changing the active partition flag I could just have the machine boot 
Windows, to keep people from getting confused.

* When using grub to boot from a raid device, only 0.90 and 1.0
superblocks are supported[1] (because grub is ignorant of the raid and
it requires the fs to start at the start of the partition).  You can use
either MBR or partition based installs of grub.  However, partition
based installs require that all bootable partitions be in exactly the
same logical block address across all devices.  This limitation can be
an extremely hazardous limitation in the event a drive dies and you have
to replace it with a new drive as newer drives may not share the older
drive's geometry and will require starting your boot partition in an odd
location to make the logical block addresses match.

* When using grub2, there is supposedly already support for raid/lvm
devices.  However, I do not know if this includes version 1.0, 1.1, or
1.2 superblocks.  I intend to find that out today.  If you tell grub2 to
install to an md device, it searches out all constituent devices and
installs to the MBR on each device[2].  This can't be changed (at least
right now, probably not ever though).
  


That sounds like a good reason to avoid grub2, frankly. Software which 
decides that it knows what to do better than the user isn't my 
preference. If I wanted software which fores me to do things "their way" 
I'd be running Windows.

So, given the above situations, really, superblock format 1.2 is likely
to never be needed.  None of the shipping boot loaders work with 1.2
regardless, and the boot loader under development won't install to the
partition in the event of an md device and therefore doesn't need that
4k buffer that 1.2 provides.
  


Sounds right, although it may have other uses for clever people.

[1] Grub won't work with either 1.1 or 1.2 superblocks at the moment.  A
person could probably hack it to work, but since grub development has
stopped in preference to the still under development grub2, they won't
take the patches upstream unless they are bug fixes, not new features.
  


If the patches were available, "doesn't work with existing raid formats" 
would probably qualify as a bug.

[2] There are two ways to install to a master boot record.  The first is
to use the first 512 bytes *only* and hardcode the location of the
remainder of the boot loader into those 512 bytes.  The second way is to
use the free space between the MBR and the start of the first partition
to embed the remainder of the boot loader.  When you point grub2 at an
md device, they automatically only use the second method of boot loader
installation.  This gives them the freedom to be able to modify the
second stage boot loader on a boot disk by boot disk basis.  The
downside to this is that they need lots of room after the MBR and before
the first partition in order to put their core.img file in place.  I
*think*, and I'll know for sure later today, that the core.img file is
generated during grub install from the list of optional modules you
specify during setup.  Eg., the pc module gives partition table support,
the lvm module lvm support, etc.  You list the modules you need, and
grub then builds a core.img out of all those modules.  The normal amount
of space between the MBR and the first partition is (sectors_per_track -
1).  For standard disk geometries, that basically leaves 254 sectors, or
127k of space.  This might not be enough for your particular needs if
you have a complex boot environment.  In that case, you would need to
bump at least the starting track of your first partition to make room
for your boot loader.  Unfortunately, how is a person to know how much
room their setup needs until after they've installed and it's too late
to bump the partition table start?  They can't.  So, that's another
thing I think I will check out today, what the maximum size of grub2
might be with all modules included, and what a common size might be.

  
Based on your description, it sounds as if grub2 may not have given 
adequate thought to what users other than the authors might need (that 
may be a premature conclusion). I have multiple installs on several of 
my machines, and I assume that the grub2 for 32 and 64 bit will be 
different. Thanks for the research.


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Bill Davidsen

Neil Brown wrote:

On Thursday October 25, [EMAIL PROTECTED] wrote:
  

I didn't get a reply to my suggestion of separating the data and location...



No. Sorry.

  

ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data
format (0.9 vs 1.0) and a location (end,start,offset4k)?

This would certainly make things a lot clearer to new (and old!) users:

mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k
or
mdadm --create /dev/md0 --metadata 1.0 --meta-location start
or
mdadm --create /dev/md0 --metadata 1.0 --meta-location end



I'm happy to support synonyms.  How about

   --metadata 1-end
   --metadata 1-start

??
  
Offset? Do you like "1-offset4k" or maybe "1-start4k" or even 
"1-start+4k" for that? The last is most intuitive but I don't know how 
you feel about the + in there.
  

resulting in:
mdadm --detail /dev/md0

/dev/md0:
Version : 01.0
  Metadata-locn : End-of-device



It already lists the superblock location as a sector offset, but I
don't have a problem with reporting:

  Version : 1.0 (metadata at end of device)
  Version : 1.1 (metadata at start of device)

Would that help?

  

Same comments on the reporting, "metadata at block 4k" or something.
  

  Creation Time : Fri Aug  4 23:05:02 2006
 Raid Level : raid0

You provide rational defaults for mortals and this approach allows people like
Doug to do wacky HA things explicitly.

I'm not sure you need any changes to the kernel code - probably just the docs
and mdadm.



True.

  

It is conceivable that I could change the default, though that would
require a decision as to what the new default would be.  I think it
would have to be 1.0 or it would cause too much confusion.


A newer default would be nice.
  

I also suspect that a *lot* of people will assume that the highest superblock
version is the best and should be used for new installs etc.



Grumble... why can't people expect what I want them to expect?

  
I confess that I thought 1.x was a series of solutions reflecting your 
evolving opinion on what was best, so maybe in retrospect you made a 
non-intuitive choice of nomenclature. Or bluntly, you picked confusing 
names for this and confused people. If 1.0 meant start, 1.1 meant 4k, 
and 1.2 meant end, at least it would be easy to remember for people who 
only create a new array a few times a year, or once in the lifetime of a 
new computer.

So if you make 1.0 the default then how many users will try 'the bleeding edge'
and use 1.2? So then you have 1.3 which is the same as 1.0? H? So to quote
from an old Soap: "Confused, you  will be..."



Perhaps you could have called them 1.start, 1.end, and 1.4k in the 
beginning? Isn't hindsight wonderful?


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   >