nolargeio=1 ?
Greetings, I found that for reisfer filesystems sometimes the option nolargeio=1 is added to the fstab entry. At first blush this seems to be a workaround for a kernel bug. Does anyone have any more information ? I am running reiser - lvm2 - raid5 currently. regards, -Peter - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: disaster. raid1 drive failure rsync=DELAYED why?? please help
Mitchell Laks wrote: On Sunday 13 March 2005 10:49 am, David Greave wrote: Many Helpful remarks: David I am grateful that you were there for me. No probs - we've all been there! My assessment (correct me if I am wrong) is that I have to rethink my architecture. As I continue to work with software raid, I likely will have to move the postgresql database to a separate partition, so I will not have mixing of points of failure. Well, once things are calmer, post your layout and new thinking and I'm sure people will input. Amongst other things, mdadm can allow you to keep 1 or more hot spares in a system that you can 'share' between multiple raid1 mirrors. This kind of trick (learnt by hanging out here) may be the answer to muliple failures. David PS don't forget the mdadm upgrade. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] md bitmap bug fixes
On 2005-03-14T15:43:52, Neil Brown [EMAIL PROTECTED] wrote: Hi there, just a question about how the bitmap stuff works with 1++-redundancy, say RAID1 with 2 mirrors, or RAID6. One disk fails and is replaced/reattached, and resync begins. Now another disk fails and is replaced. Is the bitmap local to each disk? And in case of RAID1, with 4 disks (and two of them resyncing), could disk3 be rebuild from disk1 and disk4 from disk2 (as to optimize disk bandwidth)? Sincerely, Lars Marowsky-Brée [EMAIL PROTECTED] -- High Availability Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Q: Moving raid1 array to another host, safe?
Hi I have two Linux boxes running kernel 2.4.21 having access to two devices over fibre channel SAN. What I'm trying to achive is host based mirroring with ability to move the storage from one host to another. On the firs host I created a raid1 array, put LVM on it, created a filesystem. To move the storage to the second host I do the following (on the first host): deactivate volume group: vgchange -an dxvg stop array: mdadm --misc --stop /dev/md0 Then on the second host: assemble the array: mdadm --assemble /dev/md0 /dev/emcpowera /dev/emcpowerb activate the volume group: vgchange -ay dxvg The following procedure seams to be working OK. However, I'm asking myself how safe is it what I'm doing? Thanks for your time. Regards, Chris - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PERFORM] Postgres on RAID5
Alex Turner [EMAIL PROTECTED] writes: a 14 drive stripe will max out the PCI bus long before anything else, Hopefully anyone with a 14 drive stripe is using some combination of 64 bit PCI-X cards running at 66Mhz... the only reason for a stripe this size is to get a total accessible size up. Well, many drives also cuts average latency. So even if you have no need for more bandwidth you still benefit from a lower average response time by adding more drives. -- greg - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PERFORM] Postgres on RAID5
Arshavir Grigorian wrote: Alex Turner wrote: [] Well, by putting the pg_xlog directory on a separate disk/partition, I was able to increase this rate to about 50 or so per second (still pretty far from your numbers). Next I am going to try putting the pg_xlog on a RAID1+0 array and see if that helps. pg_xlog is written syncronously, right? It should be, or else reliability of the database will be at a big question... I posted a question on Feb-22 here in linux-raid, titled *terrible* direct-write performance with raid5. There's a problem with write performance of a raid4/5/6 array, which is due to the design. Consider raid5 array (raid4 will be exactly the same, and for raid6, just double the parity writes) with N data block and 1 parity block. At the time of writing a portion of data, parity block should be updated too, to be consistent and recoverable. And here, the size of the write plays very significant role. If your write size is smaller than chunk_size*N (N = number of data blocks in a stripe), in order to calculate correct parity you have to read data from the remaining drives. The only case where you don't need to read data from other drives is when you're writing by the size of chunk_size*N, AND the write is block-aligned. By default, chunk_size is 64Kb (min is 4Kb). So the only reasonable direct-write size of N drives will be 64Kb*N, or else raid code will have to read missing data to calculate the parity block. Ofcourse, in 99% cases you're writing in much smaller sizes, say 4Kb or so. And here, the more drives you have, the LESS write speed you will have. When using the O/S buffer and filesystem cache, the system has much more chances to re-order requests and sometimes even omit reading entirely (when you perform many sequentional writes for example, without sync in between), so buffered writes might be much fast. But not direct or syncronous writes, again especially when you're doing alot of sequential writes... So to me it looks like an inherent problem of raid5 architecture wrt database-like workload -- databases tends to use syncronous or direct writes to ensure good data consistency. For pgsql, which (i don't know for sure but reportedly) uses syncronous writs only for the transaction log, it is a good idea to put that log only to a raid1 or raid10 array, but NOT to raid5 array. Just IMHO ofcourse. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] md bitmap bug fixes
On Monday March 14, [EMAIL PROTECTED] wrote: On 2005-03-14T21:22:57, Neil Brown [EMAIL PROTECTED] wrote: Hi there, just a question about how the bitmap stuff works with 1++-redundancy, say RAID1 with 2 mirrors, or RAID6. I assume you mean RAID1 with 3 drives (there isn't really one main drive and all the others are mirrors - all drives are nearly equal). Yeah, that's what I meant. (BTW, if they are all equal, how to you figure out where to sync from? It arbitrarily chooses one. It doesn't matter which. The code currently happens to choose the first, but this is not a significant choice. Isn't the first one also the first one to receive the writes, so unless it's somehow identified as bad, it's the one which will have the best data?) Data is written to all drives in parallel (the request to the first might be launched slightly before the second, but the difference is insignificant compared to the time it takes for the write to complete). There is no such thing as best data. Consider the situation where you want to make a transactional update to a file that requires writing two block. If the system dies while writing the first, the before data is better. If it dies while writing the second, the after data is better. We haven't put any significant work into bitmap intent logging for levels other than raid1, so some of the answer may be pure theory. OK. (Though in particular for raid5 with the expensive parity and raid6 with the even more expensive parity this seems desireable.) Yes. We will get there. We just aren't there yet so I cannot say with confidence how it will work. I think each disk needs to have it's own bitmap in the long run. On start, we need to merge them. I think any scheme that involved multiple bitmaps would be introducing too much complexity. Certainly your examples sound very far fetched (as I think you admitted yourself). But I always try to be open to new ideas. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PERFORM] Postgres on RAID5
You said: If your write size is smaller than chunk_size*N (N = number of data blocks in a stripe), in order to calculate correct parity you have to read data from the remaining drives. Neil explained it in this message: http://marc.theaimsgroup.com/?l=linux-raidm=108682190730593w=2 Guy -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael Tokarev Sent: Monday, March 14, 2005 5:47 PM To: Arshavir Grigorian Cc: linux-raid@vger.kernel.org; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Postgres on RAID5 Arshavir Grigorian wrote: Alex Turner wrote: [] Well, by putting the pg_xlog directory on a separate disk/partition, I was able to increase this rate to about 50 or so per second (still pretty far from your numbers). Next I am going to try putting the pg_xlog on a RAID1+0 array and see if that helps. pg_xlog is written syncronously, right? It should be, or else reliability of the database will be at a big question... I posted a question on Feb-22 here in linux-raid, titled *terrible* direct-write performance with raid5. There's a problem with write performance of a raid4/5/6 array, which is due to the design. Consider raid5 array (raid4 will be exactly the same, and for raid6, just double the parity writes) with N data block and 1 parity block. At the time of writing a portion of data, parity block should be updated too, to be consistent and recoverable. And here, the size of the write plays very significant role. If your write size is smaller than chunk_size*N (N = number of data blocks in a stripe), in order to calculate correct parity you have to read data from the remaining drives. The only case where you don't need to read data from other drives is when you're writing by the size of chunk_size*N, AND the write is block-aligned. By default, chunk_size is 64Kb (min is 4Kb). So the only reasonable direct-write size of N drives will be 64Kb*N, or else raid code will have to read missing data to calculate the parity block. Ofcourse, in 99% cases you're writing in much smaller sizes, say 4Kb or so. And here, the more drives you have, the LESS write speed you will have. When using the O/S buffer and filesystem cache, the system has much more chances to re-order requests and sometimes even omit reading entirely (when you perform many sequentional writes for example, without sync in between), so buffered writes might be much fast. But not direct or syncronous writes, again especially when you're doing alot of sequential writes... So to me it looks like an inherent problem of raid5 architecture wrt database-like workload -- databases tends to use syncronous or direct writes to ensure good data consistency. For pgsql, which (i don't know for sure but reportedly) uses syncronous writs only for the transaction log, it is a good idea to put that log only to a raid1 or raid10 array, but NOT to raid5 array. Just IMHO ofcourse. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] md bitmap bug fixes
Neil Brown wrote: On Wednesday March 9, [EMAIL PROTECTED] wrote: avoid setting of sb-events_lo = 1 when creating a 0.90 superblock -- it doesn't seem to be necessary and it was causing the event counters to start at 4 billion+ (events_lo is actually the high part of the events counter, on little endian machines anyway) events_lo really should be the low part of the counter and it is for me something funny must be happening for you... Yikes...compiling mdadm against the kernel headers. I was trying to simplify things and avoid the inevitable breakage that occurs when kernel and mdadm headers get out of sync, but alas, it's causing problems because of differences between kernel and userland header definitions...my mdadm was wrongly assuming big endian for the events counters. if'ed out super1 definition which is now in the kernel headers I don't like this. I don't mdadm to include the kernel raid headers. I want it to use it's own. Yes, I agree, see above... :/ included sys/time.h to avoid compile error I wonder why I don't get an error.. What error do you get? The machine I happen to be compiling on has old gcc/libc (2.91) and it's not getting the definition for one of the time structures (I forget which...). Thanks, Paul - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PERFORM] Postgres on RAID5
All, I have a 13 disk (250G each) software raid 5 set using 1 16 port adaptec SATA controller. I am very happy with the performance. The reason I went with the 13 disk raid 5 set was for the space NOT performance. I have a single postgresql database that is over 2 TB with about 500 GB free on the disk. This raid set performs about the same as my ICP SCSI raid controller (also with raid 5). That said, now that postgresql 8 has tablespaces, I would NOT create 1 single raid 5 set, but 3 smaller sets. I also DO NOT have my wal and log's on this raid set, but on a smaller 2 disk mirror. Jim -- Original Message --- From: Greg Stark [EMAIL PROTECTED] To: Alex Turner [EMAIL PROTECTED] Cc: Greg Stark [EMAIL PROTECTED], Arshavir Grigorian [EMAIL PROTECTED], linux-raid@vger.kernel.org, pgsql-performance@postgresql.org Sent: 14 Mar 2005 15:17:11 -0500 Subject: Re: [PERFORM] Postgres on RAID5 Alex Turner [EMAIL PROTECTED] writes: a 14 drive stripe will max out the PCI bus long before anything else, Hopefully anyone with a 14 drive stripe is using some combination of 64 bit PCI-X cards running at 66Mhz... the only reason for a stripe this size is to get a total accessible size up. Well, many drives also cuts average latency. So even if you have no need for more bandwidth you still benefit from a lower average response time by adding more drives. -- greg ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match --- End of Original Message --- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html