Did the asynchronous write stuff (as it was in fr1) ever get into kernel
software raid?
I see from the raid acceleration (ioat) patching going on that some
sort of asynchronicity is being contemplated, but blessed if I can make
head or tail of the descriptions I've read. It looks vaguely like
While travelling the last few days, a theory has occurred to me to
explain this sort of thing ...
A user has sent me a ps ax output showing an enbd client daemon
blocked in get_active_stripe (I presume in raid5.c).
ps ax -of,uid,pid,ppid,pri,ni,vsz,rss,wchan:30,stat,tty,time,command
Also sprach Gabor Gombas:
On Thu, Aug 17, 2006 at 08:28:07AM +0200, Peter T. Breuer wrote:
1) if the network disk device has decided to shut down wholesale
(temporarily) because of lack of contact over the net, then
retries and writes are _bound_ to fail for a while, so
HI Neil ..
Also sprach Neil Brown:
On Wednesday August 16, [EMAIL PROTECTED] wrote:
1) I would like raid request retries to be done with exponential
delays, so that we get a chance to overcome network brownouts.
2) I would like some channel of communication to be available
with
Also sprach Molle Bestefich:
Peter T. Breuer wrote:
I would like raid request retries to be done with exponential
delays, so that we get a chance to overcome network brownouts.
Hmm, I don't think MD even does retries of requests.
I had a robust read patch in FR1, and I thought Neil
Also sprach Molle Bestefich:
[Charset ISO-8859-1 unsupported, filtering to ASCII...]
Peter T. Breuer wrote:
You want to hurt performance for every single MD user out there, just
There's no performance drop! Exponentially staged retries on failure
are standard in all network protocols
Also sprach Molle Bestefich:
See above. The problem is generic to fixed bandwidth transmission
channels, which, in the abstract, is everything. As soon as one
does retransmits one has a kind of obligation to keep retransmissions
down to a fixed maximum percentage of the potential
Molle Bestefich [EMAIL PROTECTED] wrote:
There seems to be an obvious lack of a properly thought out interface
to notify userspace applications of MD events (disk failed -- go
light a LED, etc).
Well, that's probably truish. I've been meaning to ask for a per-device
sysctl interface for some
Robbie Hughes [EMAIL PROTECTED] wrote:
Number Major Minor RaidDevice State
0 00 -1 removed
1 22 661 active sync /dev/hdd2
2 330 spare /dev/hda3
The main problem i have now is that
tmp [EMAIL PROTECTED] wrote:
I've read man mdadm and man mdadm.conf but I certainly doesn't have
an overview of software RAID.
Then try using it instead/as well as reading about it, and you will
obtain a more cmprehensive understanding.
OK. The HOWTO describes mostly a raidtools context,
Doug Ledford [EMAIL PROTECTED] wrote:
Now, if I recall correctly, Peter posted a patch that changed this
semantic in the raid1 code. The raid1 code does not complete a write to
the upper layers of the kernel until it's been completed on all devices
and his patch made it such that as
I forgot to say thanks! Thanks for the breakdown.
Doug Ledford [EMAIL PROTECTED] wrote:
(of event count increment)
I think the best explanation is this: any change in array state that
OK ..
would necessitate kicking a drive out of the array if it didn't also
make this change in state with
Schuett Thomas EXT [EMAIL PROTECTED] wrote:
And here the fault happens:
By chance, it reads the transaction log from hda, then sees, that the
transaction was finished, and clears the overall unclean bit.
This cleaning is a write, so it goes to *both* HDs.
Don't put the journal on the raid
Neil Brown [EMAIL PROTECTED] wrote:
Due to the system crash the data on hdb is completely ignored. Data
Neil - can you explain the algorithm that stamps the superblocks with
an event count, once and for all? (until further amendment :-).
It goes without saying that sb's are not stamped at
Neil Brown [EMAIL PROTECTED] wrote:
On Tuesday March 29, [EMAIL PROTECTED] wrote:
Don't put the journal on the raid device, then - I'm not ever sure why
people do that! (they probably have a reason that is good - to them).
Not good advice. DO put the journal on a raid device. It is
Luca Berra [EMAIL PROTECTED] wrote:
On Tue, Mar 29, 2005 at 01:29:22PM +0200, Peter T. Breuer wrote:
Neil Brown [EMAIL PROTECTED] wrote:
Due to the system crash the data on hdb is completely ignored. Data
Neil - can you explain the algorithm that stamps the superblocks with
an event count
md_exit calls mddev_put on each mddev during module exit. mddev_put
calls blk_put_queue under spinlock, although it can sleep (it clearly
calls kblockd_flush). This patch lifts the spinlock to do the flush.
--- md.c.orig Fri Dec 24 22:34:29 2004
+++ md.cSun Mar 27 14:14:22 2005
@@
Luca Berra [EMAIL PROTECTED] wrote:
we can have a series of failures which must be accounted for and dealt
with according to a policy that might be site specific.
A) Failure of the standby node
A.1) the active is allowed to continue in the absence of a data replica
A.2) disk writes from
Luca Berra [EMAIL PROTECTED] wrote:
If we want to do data-replication, access to the data-replicated device
should be controlled by the data replication process (*), md does not
guarantee this.
Well, if one writes to the md device, then md does guarantee this - but
I find it hard to parse the
Paul Clements [EMAIL PROTECTED] wrote:
system A
[raid1]
/ \
[disk][nbd] -- system B
2) you're writing, say, block 10 to the raid1 when A crashes (block 10
is dirty in the bitmap, and you don't know whether it got written to the
disk on A or B, neither, or both)
Paul Clements [EMAIL PROTECTED] wrote:
At any rate, this is all irrelevant given the second part of that email
reply that I gave. You still have to do the bitmap combining, regardless
of whether two systems were active at the same time or not.
As I understand it, you want both bitmaps in
Paul Clements [EMAIL PROTECTED] wrote:
OK - thanks for the reply, Paul ...
Peter T. Breuer wrote:
But why don't we already know from the _single_ bitmap on the array
node (the node with the array) what to rewrite in total? All writes
must go through the array. We know how many didn't go
Mario Holbe [EMAIL PROTECTED] wrote:
Peter T. Breuer [EMAIL PROTECTED] wrote:
Yes, you can sync them by writing any one of the two mirrors to the
other one, and need do so only on the union of the mapped data areas,
As far as I understand the issue, this is exactly what should be
possible
Michael Tokarev [EMAIL PROTECTED] wrote:
Luca Berra wrote:
On Fri, Mar 18, 2005 at 02:42:55PM +0100, Lars Marowsky-Bree wrote:
The problem is for multi-nodes, both sides have their own bitmap. When a
split scenario occurs, and both sides begin modifying the data, that
bitmap needs to
Michael Tokarev [EMAIL PROTECTED] wrote:
Ok, you intrigued me enouth already.. what's the FR1 patch? I want
to give it a try... ;) Especially I'm interested in the Robust Read
thing...
That was published on this list a few weeks ago (probably needs updating,
but I am sure you can help :-).
Lars Marowsky-Bree [EMAIL PROTECTED] wrote:
On 2005-03-19T12:43:41, Peter T. Breuer [EMAIL PROTECTED] wrote:
Well, there is the right data from our point of view, and it is what
should by on (one/both?) device by now. One doesn't get to recover that
right data by copying one disk over
Lars Marowsky-Bree [EMAIL PROTECTED] wrote:
On 2005-03-19T14:27:45, Peter T. Breuer [EMAIL PROTECTED] wrote:
Which one of the datasets you choose you could either arbitate via some
automatic mechanisms (drbd-0.8 has a couple) or let a human decide.
But how on earth can you get
Michael Tokarev [EMAIL PROTECTED] wrote:
[-- text/plain, encoding 7bit, charset: KOI8-R, 74 lines --]
Peter T. Breuer wrote:
[]
The patch was originally developed for 2.4, then ported to 2.6.3, and
then to 2.6.8.1. Neil has recently been doing stuff, so I don't
think it applies cleanly
Lars Marowsky-Bree [EMAIL PROTECTED] wrote:
On 2005-03-19T16:06:29, Peter T. Breuer [EMAIL PROTECTED] wrote:
I'm cutting out those parts of the discussion which are irrelevant (or
which I don't consider worth pursuing; maybe you'll find someone else
to explain with more patience).
Probably
Guy [EMAIL PROTECTED] wrote:
I agree, but I don't think a block device can to a re-sync without
corrupting both. How do you merge a superset at the block level? AND the 2
Don't worry - it's just a one-way copy done efficiently (i.e., leaving
out all the blocks known to be unmodified both
Michael Tokarev [EMAIL PROTECTED] wrote:
Uh OK. As I recall one only needs to count, one doesn't need a bitwise
map of what one has dealt with.
Well. I see read_balance() is now used to resubmit reads. There's
a reason to use it instead of choosing next disk, I think.
I can't think of
Lars Marowsky-Bree [EMAIL PROTECTED] wrote:
On 2005-03-18T13:52:54, Peter T. Breuer [EMAIL PROTECTED] wrote:
(proviso - I didn't read the post where you set out the error
situations, but surely, on theoretical grounds, all that can happen is
that the bitmap causes more to be synced than
Mario Holbe [EMAIL PROTECTED] wrote:
Peter T. Breuer [EMAIL PROTECTED] wrote:
different (legitimate) data. It doesn't seem relevant to me to consider
if they are equally up to date wrt the writes they have received. They
will be in the wrong even if they are up to date.
The goal
Neil Brown [EMAIL PROTECTED] wrote:
On Tuesday March 8, [EMAIL PROTECTED] wrote:
Have you remodelled the md/raid1 make_request() fn?
Somewhat. Write requests are queued, and raid1d submits them when
it is happy that all bitmap updates have been done.
OK - so a slight modification of the
Can Sar [EMAIL PROTECTED] wrote:
the driver just cycles through all devices that make up a soft raid
device and just calls generic_make_request on them. Is this correct, or
does some other function involved in the write process (starting from
the soft raid level down) actually wait on
NeilBrown [EMAIL PROTECTED] wrote:
The second two fix bugs that were introduced by the recent
bitmap-based-intent-logging patches and so are not relevant
Neil - can you describe for me (us all?) what is meant by
intentÂlogging here.
Well, I can guess - I suppose the driver marks the bitmap
Paul Clements [EMAIL PROTECTED] wrote:
Peter T. Breuer wrote:
Neil - can you describe for me (us all?) what is meant by
intent-logging here.
Since I wrote a lot of the code, I guess I'll try...
Hi, Paul. Thanks.
Well, I can guess - I suppose the driver marks the bitmap before a write
[EMAIL PROTECTED] wrote:
I've been going through the MD driver source, and to tell the truth, can't
figure out where the read error is detected and how to hook that event and
force a re-write of the failing sector. I would very much appreciate it if
I did that for RAID1, or at least most of
berk walker [EMAIL PROTECTED] wrote:
What might the proper [or functional] syntax be to do this?
I'm running 2.6.10-1.766-FC3, and mdadm 1.90.
Substitute the word missing for the corresponding device in the
mdadm create command.
(quotes manual page)
To create a degraded array in which
J. David Beutel [EMAIL PROTECTED] wrote:
I'd like to try this patch
http://marc.theaimsgroup.com/?l=linux-raidm=110704868115609w=2 with
EVMS BBR.
Has anyone tried it on 2.6.10 (with FC2 1.9 and EVMS patches)? Has
anyone tried the rewrite part at all? I don't know md or the kernel or
Michael Tokarev [EMAIL PROTECTED] wrote:
(note raid5 performs faster than a single drive, it's expectable
as it is possible to write to several drives in parallel).
Each raid5 write must include at least ONE write to a target. I think
you're saying that the writes go to different targets from
J. David Beutel [EMAIL PROTECTED] wrote:
Peter T. Breuer wrote, on 2005-Feb-23 1:50 AM:
Quite possibly - I never tested the rewrite part of the patch, just
wrote it to indicate how it should go and stuck it in to encourage
others to go on from there. It's disabled by default. You almost
Michael Tokarev [EMAIL PROTECTED] wrote:
When debugging some other problem, I noticied that
direct-io (O_DIRECT) write speed on a software raid5
And normal write speed (over 10 times the size of ram)?
is terrible slow. Here's a small table just to show
the idea (not numbers by itself as
No email [EMAIL PROTECTED] wrote:
Forgive me as this is probably a silly question and one that has been
answered many times, I have tried to search for the answers but have
ended up more confused than when I started. So thought maybe I could
ask the community to put me out of my misery
Peter T. Breuer [EMAIL PROTECTED] wrote:
Allow me to remind what the patch does: it allows raid1 to proceed
smoothly after a read error on a mirror component, without faulting the
component. If the information is on another component, it will be
returned. If all components are faulty
I've had the opportunity to test the robust read patch that I posted
earier in the month (10 Jan, Subject: Re: Spares and partitioning huge
disks), and it needs one more change ... I assumed that the raid1 map
function would move a (retried) request to another disk, but it des not,
it always moves
Lars Marowsky-Bree [EMAIL PROTECTED] wrote:
On 2005-01-23T16:13:05, Luca Berra [EMAIL PROTECTED] wrote:
the first one adds an auto=dev parameter
rationale: udev does not create /dev/md* device files, so we need a way
to create them when assembling the md device.
Am I missing something
Luca Berra [EMAIL PROTECTED] wrote:
I believe the correct solution to this would be implementing a char-misc
/dev/mdadm device that mdadm would use instead of the block device,
like device-mapper does. Alas i have no time for this in the forseable
future.
It's a generic problem (or
Just a followup ...
Neil said he has never seen disks corrupt spontaneously. I'm just making
the rounds of checking the daily md5sums on one group of machines with
a view to estimating the corruption rates. Here's one of the typical
(one bit) corruptions:
doc013:/usr/oboe/ptb% cmp --verbose
David Dougall [EMAIL PROTECTED] wrote:
If I am running software raid1 and a disk device starts throwing I/O
errors, Is the filesystem supposed to see any indication of this? I
No - not if the error is on only one disk. The first error will fault
the disk from the array and the driver will
Hans Kristian Rosbach [EMAIL PROTECTED] wrote:
On Mon, 2005-01-17 at 17:46, Peter T. Breuer wrote:
Interesting. How did you measure latency? Do you have a script you
could post?
It's part of another application we use internally at work. I'll check
to see wether part of it could be GPL'ed
Hans Kristian Rosbach [EMAIL PROTECTED] wrote:
-It selects the disk that is closest to the wanted sector by remembering
what sector was last requested and what disk was used for it.
-For sequential reads (sucha as hdparm) it will override and use the
same disk anyways. (sector =
Michael Tokarev [EMAIL PROTECTED] wrote:
That all to say: yes indeed, this lack of smart error handling is
a noticieable omission in linux software raid. There are quite some
(sometimes fatal to the data) failure scenarios that'd not had happened
provided the smart error handling where in
53 matches
Mail list logo