nolargeio=1 ?

2005-03-14 Thread peter.greis
Greetings,

I found that for reisfer filesystems sometimes the option nolargeio=1 is added 
to the fstab entry. At first blush this seems to be a workaround for a kernel 
bug. Does anyone have any more information ? I am running reiser - lvm2 - 
raid5 currently.

regards,

-Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disaster. raid1 drive failure rsync=DELAYED why?? please help

2005-03-14 Thread David Greaves
Mitchell Laks wrote:
On Sunday 13 March 2005 10:49 am, David Greave wrote: Many Helpful remarks:
 

David I am grateful that you were there for me.
 

No probs - we've all been there!
My assessment (correct me if I am wrong) is that I have to rethink my 
architecture. As I continue to work with software raid, I likely will have to 
move the postgresql database to a separate partition, so I will not have 
mixing of points of failure.  

Well, once things are calmer, post your layout and new thinking and I'm 
sure people will input.
Amongst other things, mdadm can allow you to keep 1 or more hot spares 
in a system that you can 'share' between multiple raid1 mirrors.
This kind of trick (learnt by hanging out here) may be the answer to 
muliple failures.

David
PS don't forget the mdadm upgrade.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] md bitmap bug fixes

2005-03-14 Thread Lars Marowsky-Bree
On 2005-03-14T15:43:52, Neil Brown [EMAIL PROTECTED] wrote:

Hi there, just a question about how the bitmap stuff works with
1++-redundancy, say RAID1 with 2 mirrors, or RAID6.

One disk fails and is replaced/reattached, and resync begins. Now
another disk fails and is replaced. Is the bitmap local to each disk?

And in case of RAID1, with 4 disks (and two of them resyncing), could
disk3 be rebuild from disk1 and disk4 from disk2 (as to optimize disk
bandwidth)?


Sincerely,
Lars Marowsky-Brée [EMAIL PROTECTED]

-- 
High Availability  Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Q: Moving raid1 array to another host, safe?

2005-03-14 Thread Chris Osicki

Hi

I have two Linux boxes running kernel 2.4.21 having access to two
devices over fibre channel SAN.
What I'm trying to achive is host based mirroring with ability to
move the storage from one host to another.
On the firs host I created a raid1 array, put LVM on it, created a
filesystem. To move the storage to the second host I do the following
(on the first host):

deactivate volume group: 
vgchange -an dxvg
stop array: 
mdadm --misc --stop /dev/md0

Then on the second host:

assemble the array: 
mdadm --assemble /dev/md0 /dev/emcpowera  /dev/emcpowerb
activate the volume group: 
vgchange -ay dxvg

The following procedure seams to be working OK. 
However, I'm asking myself how safe is it what I'm doing?

Thanks for your time.

Regards,
Chris
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PERFORM] Postgres on RAID5

2005-03-14 Thread Greg Stark

Alex Turner [EMAIL PROTECTED] writes:

 a 14 drive stripe will max out the PCI bus long before anything else,

Hopefully anyone with a 14 drive stripe is using some combination of 64 bit
PCI-X cards running at 66Mhz...

 the only reason for a stripe this size is to get a total accessible
 size up.  

Well, many drives also cuts average latency. So even if you have no need for
more bandwidth you still benefit from a lower average response time by adding
more drives.

-- 
greg

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PERFORM] Postgres on RAID5

2005-03-14 Thread Michael Tokarev
Arshavir Grigorian wrote:
Alex Turner wrote:
[]
Well, by putting the pg_xlog directory on a separate disk/partition, I 
was able to increase this rate to about 50 or so per second (still 
pretty far from your numbers). Next I am going to try putting the 
pg_xlog on a RAID1+0 array and see if that helps.
pg_xlog is written syncronously, right?  It should be, or else reliability
of the database will be at a big question...
I posted a question on Feb-22 here in linux-raid, titled *terrible*
direct-write performance with raid5.  There's a problem with write
performance of a raid4/5/6 array, which is due to the design.
Consider raid5 array (raid4 will be exactly the same, and for raid6,
just double the parity writes) with N data block and 1 parity block.
At the time of writing a portion of data, parity block should be
updated too, to be consistent and recoverable.  And here, the size of
the write plays very significant role.  If your write size is smaller
than chunk_size*N (N = number of data blocks in a stripe), in order
to calculate correct parity you have to read data from the remaining
drives.  The only case where you don't need to read data from other
drives is when you're writing by the size of chunk_size*N, AND the
write is block-aligned.  By default, chunk_size is 64Kb (min is 4Kb).
So the only reasonable direct-write size of N drives will be 64Kb*N,
or else raid code will have to read missing data to calculate the
parity block.  Ofcourse, in 99% cases you're writing in much smaller
sizes, say 4Kb or so.  And here, the more drives you have, the
LESS write speed you will have.
When using the O/S buffer and filesystem cache, the system has much
more chances to re-order requests and sometimes even omit reading
entirely (when you perform many sequentional writes for example,
without sync in between), so buffered writes might be much fast.
But not direct or syncronous writes, again especially when you're
doing alot of sequential writes...
So to me it looks like an inherent problem of raid5 architecture
wrt database-like workload -- databases tends to use syncronous
or direct writes to ensure good data consistency.
For pgsql, which (i don't know for sure but reportedly) uses syncronous
writs only for the transaction log, it is a good idea to put that log
only to a raid1 or raid10 array, but NOT to raid5 array.
Just IMHO ofcourse.
/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] md bitmap bug fixes

2005-03-14 Thread Neil Brown
On Monday March 14, [EMAIL PROTECTED] wrote:
 On 2005-03-14T21:22:57, Neil Brown [EMAIL PROTECTED] wrote:
 
   Hi there, just a question about how the bitmap stuff works with
   1++-redundancy, say RAID1 with 2 mirrors, or RAID6.
  I assume you mean RAID1 with 3 drives (there isn't really one main
  drive and all the others are mirrors - all drives are nearly equal).
 
 Yeah, that's what I meant.
 
 (BTW, if they are all equal, how to you figure out where to sync
 from?

It arbitrarily chooses one.  It doesn't matter which.  The code
currently happens to choose the first, but this is not a significant choice.

 Isn't the first one also the first one to receive the writes, so
 unless it's somehow identified as bad, it's the one which will have the
 best data?)

Data is written to all drives in parallel (the request to the first
might be launched slightly before the second, but the difference is
insignificant compared to the time it takes for the write to
complete). 

There is no such thing as best data.
Consider the situation where you want to make a transactional update
to a file that requires writing two block.
If the system dies while writing the first, the before data is
better.  If it dies while writing the second, the after data is
better. 

 
  We haven't put any significant work into bitmap intent logging for
  levels other than raid1, so some of the answer may be pure theory.
 
 OK.
 
 (Though in particular for raid5 with the expensive parity and raid6 with
 the even more expensive parity this seems desireable.)

Yes.  We will get there.  We just aren't there yet so I cannot say
with confidence how it will work.

 
 I think each disk needs to have it's own bitmap in the long run. On
 start, we need to merge them.

I think any scheme that involved multiple bitmaps would be introducing
too much complexity.  Certainly your examples sound very far fetched
(as I think you admitted yourself).  But I always try to be open to
new ideas.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PERFORM] Postgres on RAID5

2005-03-14 Thread Guy
You said:
If your write size is smaller than chunk_size*N (N = number of data blocks
in a stripe), in order to calculate correct parity you have to read data
from the remaining drives.

Neil explained it in this message:
http://marc.theaimsgroup.com/?l=linux-raidm=108682190730593w=2

Guy

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Michael Tokarev
Sent: Monday, March 14, 2005 5:47 PM
To: Arshavir Grigorian
Cc: linux-raid@vger.kernel.org; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Postgres on RAID5

Arshavir Grigorian wrote:
 Alex Turner wrote:
 
[]
 Well, by putting the pg_xlog directory on a separate disk/partition, I 
 was able to increase this rate to about 50 or so per second (still 
 pretty far from your numbers). Next I am going to try putting the 
 pg_xlog on a RAID1+0 array and see if that helps.

pg_xlog is written syncronously, right?  It should be, or else reliability
of the database will be at a big question...

I posted a question on Feb-22 here in linux-raid, titled *terrible*
direct-write performance with raid5.  There's a problem with write
performance of a raid4/5/6 array, which is due to the design.

Consider raid5 array (raid4 will be exactly the same, and for raid6,
just double the parity writes) with N data block and 1 parity block.
At the time of writing a portion of data, parity block should be
updated too, to be consistent and recoverable.  And here, the size of
the write plays very significant role.  If your write size is smaller
than chunk_size*N (N = number of data blocks in a stripe), in order
to calculate correct parity you have to read data from the remaining
drives.  The only case where you don't need to read data from other
drives is when you're writing by the size of chunk_size*N, AND the
write is block-aligned.  By default, chunk_size is 64Kb (min is 4Kb).
So the only reasonable direct-write size of N drives will be 64Kb*N,
or else raid code will have to read missing data to calculate the
parity block.  Ofcourse, in 99% cases you're writing in much smaller
sizes, say 4Kb or so.  And here, the more drives you have, the
LESS write speed you will have.

When using the O/S buffer and filesystem cache, the system has much
more chances to re-order requests and sometimes even omit reading
entirely (when you perform many sequentional writes for example,
without sync in between), so buffered writes might be much fast.
But not direct or syncronous writes, again especially when you're
doing alot of sequential writes...

So to me it looks like an inherent problem of raid5 architecture
wrt database-like workload -- databases tends to use syncronous
or direct writes to ensure good data consistency.

For pgsql, which (i don't know for sure but reportedly) uses syncronous
writs only for the transaction log, it is a good idea to put that log
only to a raid1 or raid10 array, but NOT to raid5 array.

Just IMHO ofcourse.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] md bitmap bug fixes

2005-03-14 Thread Paul Clements
Neil Brown wrote:
On Wednesday March 9, [EMAIL PROTECTED] wrote:

avoid setting of sb-events_lo = 1 when creating a 0.90 superblock -- it 
doesn't seem to be necessary and it was causing the event counters to 
start at 4 billion+ (events_lo is actually the high part of the events 
counter, on little endian machines anyway)

events_lo really should be the low part of the counter and it is for
me  something funny must be happening for you...
Yikes...compiling mdadm against the kernel headers. I was trying to 
simplify things and avoid the inevitable breakage that occurs when 
kernel and mdadm headers get out of sync, but alas, it's causing 
problems because of differences between kernel and userland header 
definitions...my mdadm was wrongly assuming big endian for the events 
counters.


if'ed out super1 definition which is now in the kernel headers

I don't like this.  I don't mdadm to include the kernel raid headers.
I want it to use it's own. 
Yes, I agree, see above... :/

included sys/time.h to avoid compile error

I wonder why I don't get an error.. What error do you get?
The machine I happen to be compiling on has old gcc/libc (2.91) and it's 
not getting the definition for one of the time structures (I forget 
which...).

Thanks,
Paul
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PERFORM] Postgres on RAID5

2005-03-14 Thread Jim Buttafuoco
All,

I have a 13 disk (250G each) software raid 5 set using 1 16 port adaptec SATA 
controller.   
I am very happy with the performance. The reason I went with the 13 disk raid 5 
set was for the space NOT performance. 
  I have a single postgresql database that is over 2 TB with about 500 GB free 
on the disk.   This raid set performs
about the same as my ICP SCSI raid controller (also with raid 5).  

That said, now that postgresql 8 has tablespaces, I would NOT create 1 single 
raid 5 set, but 3 smaller sets.  I also DO
NOT have my wal and log's on this raid set, but on a  smaller 2 disk mirror.

Jim

-- Original Message ---
From: Greg Stark [EMAIL PROTECTED]
To: Alex Turner [EMAIL PROTECTED]
Cc: Greg Stark [EMAIL PROTECTED], Arshavir Grigorian [EMAIL PROTECTED], 
linux-raid@vger.kernel.org,
pgsql-performance@postgresql.org
Sent: 14 Mar 2005 15:17:11 -0500
Subject: Re: [PERFORM] Postgres on RAID5

 Alex Turner [EMAIL PROTECTED] writes:
 
  a 14 drive stripe will max out the PCI bus long before anything else,
 
 Hopefully anyone with a 14 drive stripe is using some combination of 64 bit
 PCI-X cards running at 66Mhz...
 
  the only reason for a stripe this size is to get a total accessible
  size up.
 
 Well, many drives also cuts average latency. So even if you have no need for
 more bandwidth you still benefit from a lower average response time by adding
 more drives.
 
 -- 
 greg
 
 ---(end of broadcast)---
 TIP 9: the planner will ignore your desire to choose an index scan if your
   joining column's datatypes do not match
--- End of Original Message ---

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html