Re: RAIDframe: what if a disc fails during copyback

2020-10-30 Thread Greg Oster

On 10/30/20 1:54 PM, Edgar Fuß wrote:

it locks out all other non-copyback IO in order to finish the job!

Oops!


Locking out all other IO is very poor... but if it's a small enough RAID set
you might be able to get away with the downtime for the copyback...

Certainly not.


You shouldn't need to reboot for this... the 'failing spared disk' and
'reconstruct to previous second disk' should work fine without reboot.

I still don't get this. What I have is:

Components:
/dev/sd5a: spared
/dev/sd6a: optimal
Spares:
/dev/sd7a: used_spare

So what am I supposed to do from here?


If you really want to get /dev/sd5a in use again, you can do:

 raidctl -f /dev/sda7 raidX
 raidctl -vR /dev/sd5a raidX

to do the fail of sd7a and rebuild of sd5a.  But unless you have a 
strong need to use sd5a I would do nothing and leave things as-is.  If 
you reboot at this point /dev/sd7a would show up as the first component 
and be marked as 'optimal'.


Later...

Greg Oster


Re: RAIDframe: what if a disc fails during copyback

2020-10-30 Thread Edgar Fuß
> it locks out all other non-copyback IO in order to finish the job!
Oops!

> Locking out all other IO is very poor... but if it's a small enough RAID set
> you might be able to get away with the downtime for the copyback...
Certainly not.

> You shouldn't need to reboot for this... the 'failing spared disk' and
> 'reconstruct to previous second disk' should work fine without reboot.
I still don't get this. What I have is:

Components:
   /dev/sd5a: spared
   /dev/sd6a: optimal
Spares:
   /dev/sd7a: used_spare

So what am I supposed to do from here?


Re: RAIDframe: what if a disc fails during copyback

2020-10-30 Thread Greg Oster

On 10/30/20 4:25 AM, Edgar Fuß wrote:

Thanks for the detailed answer.


it's still there, and it does work,

That's reassuring to know.


but it's not at all performant or system-friendly.

Just how bad is it?


It's been probably over a decade since I last tried it, but as I recall 
it locks out all other non-copyback IO in order to finish the job!



If you want the components labelled nicely, give the system a reboot

Re-booting our file server is something I like to avoid.


You'll like copyback even less then -- I'd say once you're done 
reconstruct, just leave it, or reconstruct again to the 'repaired 
original' as I suggested...



and behaves very poorly.

Depending on how poorly, I could probably live with it (the RAID in question
is the small system one, not the large user data one).


Locking out all other IO is very poor... but if it's a small enough RAID 
set you might be able to get away with the downtime for the copyback...



In your case, what I'd do is just fail the spare, and initiate a reconstruct
to the original failed component.  (You still have the data on the spare if
something goes back with the original good component.)

Hm, I guess I would need to re-boot and intervene manually in that case.
Just using the slow copyback looks preferrable if it doesn't take more than
a day.


You shouldn't need to reboot for this... the 'failing spared disk' and 
'reconstruct to previous second disk' should work fine without reboot. 
(IIRC I've used a '3rd component' to make the primary/secondary 
components swap places.. just to test that, of course :) )



Probably I need to test this on another machine before.
I guess there's no way to initiate a reconstruction to a spare and failing
the specified component only /after/ the reconstruction has completed,
not before?


No, there's not, unfortunately. :(

Later...

Greg Oster


Re: RAIDframe: what if a disc fails during copyback

2020-10-30 Thread Edgar Fuß
Thanks for the detailed answer.

> it's still there, and it does work, 
That's reassuring to know.

> but it's not at all performant or system-friendly.
Just how bad is it?

> If you want the components labelled nicely, give the system a reboot
Re-booting our file server is something I like to avoid.

> and behaves very poorly.
Depending on how poorly, I could probably live with it (the RAID in question 
is the small system one, not the large user data one).

> In your case, what I'd do is just fail the spare, and initiate a reconstruct
> to the original failed component.  (You still have the data on the spare if
> something goes back with the original good component.)
Hm, I guess I would need to re-boot and intervene manually in that case.
Just using the slow copyback looks preferrable if it doesn't take more than 
a day.

Probably I need to test this on another machine before.
I guess there's no way to initiate a reconstruction to a spare and failing 
the specified component only /after/ the reconstruction has completed, 
not before?


Re: RAIDframe: what if a disc fails during copyback

2020-10-29 Thread oster

On 10/29/20 11:33 AM, Edgar Fuß wrote:

(I could probably direct this question to oster@ instead of tech-kern@)

In a RAIDframe RAID-1, a disc failed and I reconstructed on a spare.
Now I want to replace the failed component (actually by the same disc,
which needed a firmware update) and want to copyback to it.
How will RAIDframe behave if, during the copyback:
1. The replaced component fails


This is a NOP, as it's already failed, and only the non-failed component 
will be considered.



2. The spare fails


You're back to considering only the non-failed component.


3. The other, non-replaced component fails?


Since the rebuild is done, you can use the rebuilt component to continue on.


Specifically: Is there any szenario (other than more than one disc failing)
that will put the RAID into a non-redundant state? I guess 3. may?



2 or 3 will put you back to non-redunant.

However: You really don't want to be using copyback.  it's still 
there, and it does work, but it's not at all performant or 
system-friendly.  Just put in the new component, reconstruct to it, and 
then call it a day.  If you want the components labelled nicely, give 
the system a reboot (ya, not ideal, but that's where we are).


Basically the copyback code doesn't have the same IO structure as a 
reconstruct, and behaves very poorly.  Copyback should really be just 
ripped out or at least ignored.


In your case, what I'd do is just fail the spare, and initiate a 
reconstruct to the original failed component.  (You still have the data 
on the spare if something goes back with the original good component.)


Later...

Greg Oster


Re: RAIDframe: what if a disc fails during copyback

2020-10-29 Thread Brian Buhrow
hello.  In my experience, the copyback feature never worked.  I
found I had to reboot, turning the hot spare C into component C, Add the
replaced B as a new hot spare, reconstruct to it, and reboot again to get
everything back into its proper place.  I forget the exact  problem I ran
into, but I think it had something to do with not being able to add another
hot spare when one was in use, or  the system not recognizing the replaced
component B as a valid thing to copy back to.  If you search the archives,
I'm sure you can find the exchange between Greg and I on the topic.  The
The result of that conversation was, as I remember it, something like,
yes, it's broken and if you'd like to fix it, be my guest.
So, I'd be curious to know if you can do the copyback without having
to reboot and, once done, how things work.

-thanks
-Brian

On Oct 29,  7:37pm, Edgar =?iso-8859-1?B?RnXf?= wrote:
} Subject: Re: RAIDframe: what if a disc fails during copyback
} There still seems to be confusion on what I did.
} 
} Let A and B be the two original components, C a spare (in the cupboard) 
} and B' be B with the new firmware.
} 
} I start with A and B as the two components of a RAID-1.
} Now B failes. I have a degraded RAID with A alone.
} I plug in C, scsictl scsibus0 scan all all it, add it as a hot spare 
} (raidctl -a C) and initiate a reconstruction (raidctl -F B).
} Now I'm redundant again with A and C. Since I didn't re-boot, RAIDframe 
} knows that B has failed and C is a used spare.
} I now actually un-plug B, plug it into another machine, do some testing 
} (verifying that it may reset on writes), install new firmware, do futher 
} testing (verifying it now doesn't reset on writes) and am about to 
} re-plug it into the orignal server (which won't notice it ever disappeared 
} or that B has turned into B'---as far as this question is concerned, 
} I could have done all this in the original server).
} What I'm now intending to do is to raidctl -B (with A, B' and C installed, 
} of course). After that, I intend to raidctl -r C, then 
} scscictl scsibius0 detach C and finally un-plug C and put it back into the 
} cupboard again.
} 
} My question was about 1. B', 2. C or 3. A failing during the copyback.
} 
} > there was a crop of bad Seagate 500GB disks for a while and they had 
} > a tendancy to fail in mass at the same time.
} My working hypothesis since some five years is that all Seagate discs 
} are bad and bound to fail. We had a series of SATA 250G (the example above 
} is about SAS 146K) drives that failed the same way (dozens of them), 
} got most of them replaced on warranty and had the replacements failing 
} the same way again.
>-- End of excerpt from Edgar =?iso-8859-1?B?RnXf?=




Re: RAIDframe: what if a disc fails during copyback

2020-10-29 Thread Mouse
>> So you have drives A, B, and C.  A and B were live.  Let's say B is
>> the one that failed.  You reconstructed onto C and have been running
>> with A and C.
> Yes.

>> Now you have a new B [...].
> Yes.

>> So, you'd pull C, replace it with B
> No.  I don't pull C. I re-add B (I have lots of empty slots).

Well, I meant "pull" in the sense of "remove from the RAID".   Whether
or not that means "disconnect from the system" is semi-irrelevant.
But

>> and initiate a reconstruct
> No, a copyback (raidctl -B).

This is an aspect of RAIDframe I don't know much about.  I'm guessing
here, but I would guess that this copies from C to B.  In that case,
for failures during the copyback

If A fails, you have a copy on C, with no redundancy until the copy to
B finishes.  I don't know whether it is capable of realizing, after the
copy is done, that it could run on B and C; unlike most RAID levels,
for RAID 1 there's no reason in principle it couldn't - but I suspect
that RAIDframe isn't set up to do so.

If B fails, well, I assume you have to start the copyback over.  If
that's not the worst that happens, I would call it a major bug in
RAIDframe.

If C fails, you have to be running non-redundant, off A alone, because
the copy to B isn't finished.  In principle, it could finish copying
from A instead, but I suspect it's not capable of switching
mid-copyback.  My guess is you have to reconstruct from A onto B.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: RAIDframe: what if a disc fails during copyback

2020-10-29 Thread Edgar Fuß
There still seems to be confusion on what I did.

Let A and B be the two original components, C a spare (in the cupboard) 
and B' be B with the new firmware.

I start with A and B as the two components of a RAID-1.
Now B failes. I have a degraded RAID with A alone.
I plug in C, scsictl scsibus0 scan all all it, add it as a hot spare 
(raidctl -a C) and initiate a reconstruction (raidctl -F B).
Now I'm redundant again with A and C. Since I didn't re-boot, RAIDframe 
knows that B has failed and C is a used spare.
I now actually un-plug B, plug it into another machine, do some testing 
(verifying that it may reset on writes), install new firmware, do futher 
testing (verifying it now doesn't reset on writes) and am about to 
re-plug it into the orignal server (which won't notice it ever disappeared 
or that B has turned into B'---as far as this question is concerned, 
I could have done all this in the original server).
What I'm now intending to do is to raidctl -B (with A, B' and C installed, 
of course). After that, I intend to raidctl -r C, then 
scscictl scsibius0 detach C and finally un-plug C and put it back into the 
cupboard again.

My question was about 1. B', 2. C or 3. A failing during the copyback.

> there was a crop of bad Seagate 500GB disks for a while and they had 
> a tendancy to fail in mass at the same time.
My working hypothesis since some five years is that all Seagate discs 
are bad and bound to fail. We had a series of SATA 250G (the example above 
is about SAS 146K) drives that failed the same way (dozens of them), 
got most of them replaced on warranty and had the replacements failing 
the same way again.


Re: RAIDframe: what if a disc fails during copyback

2020-10-29 Thread Brian Buhrow
Hello.  Note that Raidframe's notion of a hot spare is somewhat
different than other software raid systems in that once you reboot after
copying to a hot spare, that hot spare becomes just another component in
the raid set.  In other words, it loses its hot spare designation and you
should treat it as you would any other component.   That means that raidctl
-r to replace the existing in-place component can be used to replace the
spare with the original disk now that you have it repaired.

Assuming the original component is still good, a, in Mouse's example,
if 'b' fails during the reconstruction, you're left with a single component
raid1 system again.  If 'A fails during the copy, you're left with some
corrupt data, though the system will not panic and you'll be able to
salvage what you can from the raid.  Unfortunately, I've been caught in
this situation more times than I'd like to say -- there was a crop of bad
Seagate 500GB disks for a while and they had a tendancy to fail in mass at
the same time.

-thanks
-Brian

On Oct 29,  1:53pm, Mouse wrote:
} Subject: Re: RAIDframe: what if a disc fails during copyback
} > In a RAIDframe RAID-1, a disc failed and I reconstructed on a spare.
} > Now I want to replace the failed component (actually by the same
} > disc, which needed a firmware update) and want to copyback to it.
} 
} So, let me make sure I understand you correctly.
} 
} So you have drives A, B, and C.  A and B were live.  Let's say B is the
} one that failed.  You reconstructed onto C and have been running with A
} and C.
} 
} Now you have a new B (which in this case is the same hardware with new
} firmware) and want to put it back into service.  I'm not sure whether
} you want to put it into service in place of A or in place of C.  I'm
} going to assume C.
} 
} So, you'd pull C, replace it with B, and initiate a reconstruct, which
} for RAID 1 means copying from A to B.  Right?
} 
} > How will RAIDframe behave if, during the copyback:
} > 1. The replaced component fails
} 
} Is this B?  Or C?  Because it sounds to me as though C would be out of
} service at this point.
} 
} > 2. The spare fails
} 
} Which is "the spare"?  Are you running with a hot spare?  I think a hot
} spare failing means nothing until/unless RAIDframe tries to fall back
} on it.
} 
} > 3. The other, non-replaced component fails?
} 
} That would be A?
} 
} > Specifically: Is there any szenario (other than more than one disc
} > failing) that will put the RAID into a non-redundant state?  I guess
} > 3. may?
} 
} For RAID 1 in general, as soon as you have only one non-failed drive,
} you have no redundancy.  Based on the assumption that RAIDframe RAID 1
} cannot handle more than two drives (always true as far as I know, and
} the 9.0 raidctl(8) manpage says it's still true as of 9.0), this means
} that
} 
} - If B fails while copying back to it, you are back to non-redundant
}operation on A.
} 
} - If A fails while copying back, you have no operational set.  Your
}only real option is to pull A and B, connect C alone, and fall back
}to the state of things as of when you pulled it; then re-add A or B
}and copyback from C.
} 
} - If C fails while copying from A to B, nothing in particular happens
}except that you don't have the hot spare you thought you did.
} 
} /~\ The ASCII   Mouse
} \ / Ribbon Campaign
}  X  Against HTML  mo...@rodents-montreal.org
} / \ Email! 7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B
>-- End of excerpt from Mouse




Re: RAIDframe: what if a disc fails during copyback

2020-10-29 Thread Edgar Fuß
> So you have drives A, B, and C.  A and B were live.  Let's say B is the
> one that failed.  You reconstructed onto C and have been running with A
> and C.
Yes.

> Now you have a new B (which in this case is the same hardware with new
> firmware) and want to put it back into service.  I'm not sure whether
> you want to put it into service in place of A or in place of C.  I'm
> going to assume C.
Yes.

> So, you'd pull C, replace it with B
No. I don't pull C. I re-add B (I have lots of empty slots).

> and initiate a reconstruct
No, a copyback (raidctl -B).

> which for RAID 1 means copying from A to B.
I don't know. I would expect it to copy from C to B.

> > 1. The replaced component fails
> 
> Is this B?  Or C?  Because it sounds to me as though C would be out of
> service at this point.
I mean B.

> > 2. The spare fails
> 
> Which is "the spare"?
C.

> Are you running with a hot spare?
Yes. I added C as a hot spare when B failed and started a reconstruction.

> I think a hot spare failing means nothing until/unless RAIDframe 
> tries to fall back on it.
Yes.

> > 3. The other, non-replaced component fails?
> 
> That would be A?
Yes.

> Based on the assumption that RAIDframe RAID 1 cannot handle more than 
> two drives (always true as far as I know, and the 9.0 raidctl(8) manpage 
> says it's still true as of 9.0)
The RAID-1 I'm speaking of does only have to components, but I did operate 
a RAIDframe RAID-1 on three components with 5.1 or something.


Re: RAIDframe: what if a disc fails during copyback

2020-10-29 Thread Mouse
> In a RAIDframe RAID-1, a disc failed and I reconstructed on a spare.
> Now I want to replace the failed component (actually by the same
> disc, which needed a firmware update) and want to copyback to it.

So, let me make sure I understand you correctly.

So you have drives A, B, and C.  A and B were live.  Let's say B is the
one that failed.  You reconstructed onto C and have been running with A
and C.

Now you have a new B (which in this case is the same hardware with new
firmware) and want to put it back into service.  I'm not sure whether
you want to put it into service in place of A or in place of C.  I'm
going to assume C.

So, you'd pull C, replace it with B, and initiate a reconstruct, which
for RAID 1 means copying from A to B.  Right?

> How will RAIDframe behave if, during the copyback:
> 1. The replaced component fails

Is this B?  Or C?  Because it sounds to me as though C would be out of
service at this point.

> 2. The spare fails

Which is "the spare"?  Are you running with a hot spare?  I think a hot
spare failing means nothing until/unless RAIDframe tries to fall back
on it.

> 3. The other, non-replaced component fails?

That would be A?

> Specifically: Is there any szenario (other than more than one disc
> failing) that will put the RAID into a non-redundant state?  I guess
> 3. may?

For RAID 1 in general, as soon as you have only one non-failed drive,
you have no redundancy.  Based on the assumption that RAIDframe RAID 1
cannot handle more than two drives (always true as far as I know, and
the 9.0 raidctl(8) manpage says it's still true as of 9.0), this means
that

- If B fails while copying back to it, you are back to non-redundant
   operation on A.

- If A fails while copying back, you have no operational set.  Your
   only real option is to pull A and B, connect C alone, and fall back
   to the state of things as of when you pulled it; then re-add A or B
   and copyback from C.

- If C fails while copying from A to B, nothing in particular happens
   except that you don't have the hot spare you thought you did.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B