Re: Hard drive lifetime: wear from spinning up or rebooting vs running
On Sun, 5 Feb 2006, David Liontooth wrote: In designing an archival system, we're trying to find data on when it pays to power or spin the drives down versus keeping them running. Is there a difference between spinning up the drives from sleep and from a reboot? Leaving out the cost imposed on the (separate) operating system drive. Hitachi claims 5 years (Surface temperature of HDA is 45°C or less) Life of the drive does not change in the case that the drive is used intermittently. for their ultrastar 10K300 drives. I suspect that the best estimates you're going to get is from the manufacturers, if you can find the right documents (OEM specifications, not marketing blurbs). For their deskstar (sata/pata) drives I didn't find life time estimates beyond 5 start-stop-cycles. /Mattias Wadenstein - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Question: read-only array
Hi I've just noticed that setting an array readonly doesn't really make it readonly. I have a RAID1 array and LVM on top of it. When I run /sbin/mdadm --misc --readonly /dev/md0 /proc/mdstat shows: Personalities : [raid1] md0 : active (read-only) raid1 sda[0] sdb[1] 160436096 blocks [2/2] [UU] However, it doesn't prevent me from activating volume groups, mounting filesystems and write files onto it. Is it a bug, feature or my misunderstanding of the meaning of readonly flag? I use RedHat AS 4 (U1) on a dual core Opteron machine. Kernel 2.6.9-11.ELsmp as delivered with RH. mdadm - v1.12.0 - 14 June 2005 Thanks for your time. Regards, Chris - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hard drive lifetime: wear from spinning up or rebooting vs running
Mattias Wadenstein wrote: On Sun, 5 Feb 2006, David Liontooth wrote: In designing an archival system, we're trying to find data on when it pays to power or spin the drives down versus keeping them running. Hitachi claims 5 years (Surface temperature of HDA is 45°C or less) Life of the drive does not change in the case that the drive is used intermittently. for their ultrastar 10K300 drives. I suspect that the best estimates you're going to get is from the manufacturers, if you can find the right documents (OEM specifications, not marketing blurbs). Intermittent may assume the drive is powered on and in regular use and may simply be a claim that spindle drive components are designed to fail simultaneously with disk platter and head motor components. Konstantin's observation that disk die about evenly from 3 causes: no spinning (dead spindle motor power electronics), heads do not move (dead head motor power electronics), or spontaneusly developing bad sectors (disk platter contamination?) is consistent with a rational goal of manufacturing components with similar lifetimes under normal use. For their deskstar (sata/pata) drives I didn't find life time estimates beyond 5 start-stop-cycles. If components are in fact manufactured to fail simultaneously under normal use (including a dozen or two start-stop cycles a day), then taking the drive off-line for more than a few hours should unproblematically extend its life. Appreciate all the good advice and references. While we have to rely on specifications rather than actual long-term tests, this should still move us in the right direction. One of the problems with creating a digital archive is that the technology has no archival history. We know acid-free paper lasts millennia; how long do modern hard drives last in cold storage? To some people's horror, we now know home-made CDs last a couple of years. Dave - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hard drive lifetime: wear from spinning up or rebooting vs running
2006/2/6, David Liontooth [EMAIL PROTECTED]: Mattias Wadenstein wrote: On Sun, 5 Feb 2006, David Liontooth wrote: For their deskstar (sata/pata) drives I didn't find life time estimates beyond 5 start-stop-cycles. If components are in fact manufactured to fail simultaneously under normal use (including a dozen or two start-stop cycles a day), then taking the drive off-line for more than a few hours should unproblematically extend its life. IMHO, a single start-stop cycle is more costy in terms of lifetime than a couple of hours spinning. As far as I know, on actual disks (especially 7200 and 10k rpm ones), spinup is a really critical and life-consuming action ; spindle motor is heavily used, much more than it will be when spin speed is stable. On our actual storage design, disks are never stopped (sorry for Earth...), because it doesn't worth spinning down for less than a couple of days. However, temperature has a real impact on heads, (incl. head motors), because of material dilatation on overheat. So cooling your drives is a major issue. how long do modern hard drives last in cold storage? Demagnetation ? A couple of years back in time, there were some tools to read and then rewrite floppy contents to remagnet the floppy content. I guess it shall be the same for the drive : periodically re-read and re-write each and every sector of the drive to grant a good magnetation of the surface. I would not give more than 100 years for a drive to lose all its content by demagnetation... Anyway, in 100 years, no computer will have the controllers to plug a sata nor a scsi :-p. I guess a long-living system should not stay cool, and re-activate/check its content periodically... we now know home-made CDs last a couple of years. I thought it was said to be at least a century... But with the enormous cost reduction operated in this area, it's no surprise the lifetime decreased so much. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction
On 2/5/06, Neil Brown [EMAIL PROTECTED] wrote: I've looked through the patches - not exhaustively, but hopefully enough to get a general idea of what is happening. There are some things I'm not clear on and some things that I could suggest alternates too... I have a few questions to check that I understand your suggestions. - Each ADMA client (e.g. a raid5 array) gets a dedicated adma thread to handle all its requests. And it handles them all in series. I wonder if this is really optimal. If there are multiple adma engines, then a single client could only make use of one of them reliably. It would seem to make more sense to have just one thread - or maybe one per processor or one per adma engine - and have any ordering between requests made explicit in the interface. Actually as each processor could be seen as an ADMA engine, maybe you want one thread per processor AND one per engine. If there are no engines, the per-processor threads run with high priority, else with low. ...so the engine thread would handle explicit client requested ordering constraints and then hand the operations off to per processor worker threads in the pio case or queue directly to hardware in the presence of such an engine. In md_thread you talk about priority inversion deadlocks, do those same concerns apply here? - I have thought that the way md/raid5 currently does the 'copy-to-buffer' and 'xor' in two separate operations may not be the best use of the memory bus. If you could have a 3-address operation that read from A, stored into B, and xorred into C, then A would have to be read half as often. Would such an interface make sense with ADMA? I don't have sufficient knowledge of assemble to do it myself for the current 'xor' code. At the very least I can add a copy+xor command to ADMA, that way developers implementing engines can optimize for this case, if the hardware supports it, and the hand coded assembly guys can do their thing. - Your handling of highmem doesn't seem right. You shouldn't kmap it until you have decided that you have to do the operation 'by hand' (i.e. in the cpu, not in the DMA engine). If the dma engine can be used at all, kmap isn't needed at all. I made the assumption that if CONFIG_HIGHMEM is not set then the kmap call resolves to a simple page_address() call. I think its ok, but it does look fishy so I will revise this code. I also was looking to handle the case where the underlying hardware DMA engine does not support high memory addresses. - The interfacing between raid5 and adma seems clumsy... Maybe this is just because you were trying to minimise changes to raid5.c. I think it would be better to make substantial but elegant changes to raid5.c - handle_stripe in particular - so that what is happening becomes very obvious. Yes, I went into this with the idea of being minimally intrusive, but you are right the end result should have MD optimized for ADMA rather than ADMA shoe-horned into MD. For example, one it has been decided to initiate a write (there is enough data to correctly update the parity block). You need to perform a sequence of copies and xor operations, and then submit write requests. This is currently done by the copy/xor happening inline under the sh-lock spinlock, and then R5_WantWrite is set. Then, out side the spinlock, if WantWrite is set generic_make_request is calls as appropriate. I would change this so that a sequence of descriptors was assembled which described that copies and xors. Appropriate call-backs would be set so that the generic_make_request is called at the right time (after the copy, or after that last xor for the parity block). Then outside the sh-lock spinlock this sequence is passed to the ADMA manager. If there is no ADMA engine present, everything is performed inline - multiple xors are possibly combined into multi-way xors automatically. If there is an ADMA engine, it is scheduled to do the work. I like this idea of clearly separated stripe assembly (finding work while under the lock) and stripe execute (running copy+xor / touching disks) stages. Can you elaborate on a scenario where xors are combined into multi-way xors? The relevant blocks are all 'locked' as they are added to the sequence, and unlocked as the writes complete or, for unchanged block in RECONSTRUCT_WRITE, when the copy xor that uses them completes. resync operations would construct similar descriptor sequences, and have a different call-back on completion. Doing this would require making sure that get_desc always succeeds. I notice that you currently allow for the possible failure of adma_get_desc and fall back to 'pio' in that case (I think). I think it would be better to use a mempool (or similar) to ensure that you never fail. There
Re: Hard drive lifetime: wear from spinning up or rebooting vs running
On Sun, 2006-02-05 at 15:42 -0800, David Liontooth wrote: In designing an archival system, we're trying to find data on when it pays to power or spin the drives down versus keeping them running. Is there a difference between spinning up the drives from sleep and from a reboot? Leaving out the cost imposed on the (separate) operating system drive. Temperature obviously matters -- a linear approximation might look like this, Lifetime = 60 - 12 [(t-40)/2.5] where 60 is the average maximum lifetime, achieved at 40 degrees C and below, and lifetime decreases by a year for every 2.5 degree rise in temperature. Does anyone have an actual formula? To keep it simple, let's assume we keep temperature at or below what is required to reach average maximum lifetime. What is the cost of spinning up the drives in the currency of lifetime months? My guess would be that the cost is tiny -- in the order of minutes. Or are different components stressed in a running drive versus one that is spinning up, so it's not possible to translate the cost of one into the currency of the other? Finally, is there passive decay of drive components in storage? Dave I read somewhere, still looking for the link, that the constant on/off of a drive actually decrease's the drives lifespan due to the heating/cooling of the bearings. It was actually determined to be best to leave the drive spinning. Brad Dameron SeaTab Software www.seatab.com - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hard drive lifetime: wear from spinning up or rebooting vs running
Drives are probably going to have a lifetime that is proportionate to a variety of things, and while I'm not a physicist or mechanical engineer, nor in the hard disk business, the things that come to mind first are: 1) Thermal stress due to temperate changes - with more rapid changes being more severe (expansion and contraction, I assume - viiz one of those projectors or cars that run hot, and leave a fan running for a while before fully powering off) 2) The amount of time a disk spends in a powered-off state (EG, lubricants may congeal, and about every time my employer, UCI, has a campus-wide power outage, -some- piece of equipment somewhere on campus fails to come back up - probably due to thermal stress) 3) The number of times a disk goes to a powered-off state (thermal stress again) 4) The amount of bumping around the disk undergoes, which may to an extent be greater in disks that are surrounded by other disks, with disks on the physical periphery of your RAID solution bumping around a little less - those little rubber things that you screw the drive into may help here. 5) The materials used in the platters, heads, servo, etc. 6) The number of alternate blocks for remapping bad blocks 7) The degree of tendency for a head crash to peel off a bunch of material, or to just make a tiny scratch, and the degree of tendency for scratched-off particles to bang into platters or heads later and scrape off more particles - which can sometimes yield an exponential decay of drive usability 8) How good the clean room(s) the drive was built in was/were 9) How good a drive is at parking the hards over unimportant parts of the platters, when bumped, dropped, in an earth quake, when turned off, etc. If you want to be thorough with this, you probably want to employ some materials scientists, some statisticians, get a bunch of different kinds of drives and characterize their designs somehow, do multiple longitudinal studies, hunt for correlations between drive attributes and lifetimes, etc. And I totally agree with a previous poster - this stuff may all change quite a bit by the time the study is done, so it'd be a really good idea to look for ways of increasing your characterizations longevity somehow, possibly by delving down into individual parts of the drives and looking at their lifetime. But don't rule out holistic/chaotic effects unnecessarily, even if the light's better over here when looking at the reductionistic view of drives. PS: Letting a drive stay powered without spinning is sometimes called a warm spare, while a drive that's spinning all the time even while not in active use in a RAID array is called a cold spare. HTH :) On Sun, 2006-02-05 at 15:42 -0800, David Liontooth wrote: In designing an archival system, we're trying to find data on when it pays to power or spin the drives down versus keeping them running. Is there a difference between spinning up the drives from sleep and from a reboot? Leaving out the cost imposed on the (separate) operating system drive. Temperature obviously matters -- a linear approximation might look like this, Lifetime = 60 - 12 [(t-40)/2.5] where 60 is the average maximum lifetime, achieved at 40 degrees C and below, and lifetime decreases by a year for every 2.5 degree rise in temperature. Does anyone have an actual formula? To keep it simple, let's assume we keep temperature at or below what is required to reach average maximum lifetime. What is the cost of spinning up the drives in the currency of lifetime months? My guess would be that the cost is tiny -- in the order of minutes. Or are different components stressed in a running drive versus one that is spinning up, so it's not possible to translate the cost of one into the currency of the other? Finally, is there passive decay of drive components in storage? Dave - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Raid 1 always degrades after a reboot.
Hi all. After every reboot, my brand new Raid1 array comes up degraded. It's always /dev/sdb1 that is unavailable or removed. The hardware is as follows.. 2x200MB Seagate SATA drives, in RAID 1 These are for data only, OS is on a separate IDE disk. LVM Partitions for my data on the RAID Promise SATA300 TX2Plus Sata Card. (use their kernel module, ULSATA2) Asus P3B motherboard and 400Mhz P2 (getting replaced in the near future) Software Mandriva 2006 download edition, upgraded from Mandrake 9.1 Kernel 2.6.12-15mdk (not rebuilt) MDADM version 00.90.01 Logs, etc.. dmesg Linux version 2.6.12-15mdk ([EMAIL PROTECTED]) (gcc version 4.0.1 (4.0.1-5mdk for Mandriva Linux release 2006.0)) #1 Mon Jan 9 17:08:48 MST 2006 BIOS-provided physical RAM map: BIOS-e820: - 0009d400 (usable) BIOS-e820: 0009d400 - 000a (reserved) BIOS-e820: 000f - 0010 (reserved) BIOS-e820: 0010 - 27ffc000 (usable) BIOS-e820: 27ffc000 - 27fff000 (ACPI data) BIOS-e820: 27fff000 - 2800 (ACPI NVS) BIOS-e820: - 0001 (reserved) 0MB HIGHMEM available. 639MB LOWMEM available. On node 0 totalpages: 163836 DMA zone: 4096 pages, LIFO batch:1 Normal zone: 159740 pages, LIFO batch:31 HighMem zone: 0 pages, LIFO batch:1 DMI 2.3 present. ACPI: RSDP (v000 ASUS ) @ 0x000f58a0 ACPI: RSDT (v001 ASUS P3B_F0x30303031 MSFT 0x31313031) @ 0x27ffc000 ACPI: FADT (v001 ASUS P3B_F0x30303031 MSFT 0x31313031) @ 0x27ffc080 ACPI: BOOT (v001 ASUS P3B_F0x30303031 MSFT 0x31313031) @ 0x27ffc040 ACPI: DSDT (v001 ASUS P3B_F0x1000 MSFT 0x010b) @ 0x ACPI: PM-Timer IO Port: 0xe408 Allocating PCI resources starting at 2800 (gap: 2800:d7ff) Built 1 zonelists Local APIC disabled by BIOS -- you can enable it with lapic mapped APIC to d000 (01503000) Initializing CPU#0 Kernel command line: auto BOOT_IMAGE=linux root=306 quiet acpi=ht resume=/dev/hda5 splash=silent bootsplash: silent mode. PID hash table entries: 4096 (order: 12, 65536 bytes) Detected 400.955 MHz processor. Using pmtmr for high-res timesource Console: colour dummy device 80x25 Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) Memory: 644744k/655344k available (2348k kernel code, 10028k reserved, 717k data, 268k init, 0k highmem, 0k BadRAM) Checking if this processor honours the WP bit even in supervisor mode... Ok. Calibrating delay loop... 794.62 BogoMIPS (lpj=397312) Mount-cache hash table entries: 512 CPU: After generic identify, caps: 0183f9ff CPU: After vendor identify, caps: 0183f9ff CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 512K CPU: After all inits, caps: 0183f9ff 0040 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU: Intel Pentium II (Deschutes) stepping 02 Enabling fast FPU save and restore... done. Checking 'hlt' instruction... OK. checking if image is initramfs...it isn't (bad gzip magic numbers); looks like an initrd Freeing initrd memory: 299k freed NET: Registered protocol family 16 PCI: PCI BIOS revision 2.10 entry at 0xf08b0, last bus=1 PCI: Using configuration type 1 mtrr: v2.0 (20020519) ACPI: Subsystem revision 20050309 ACPI: Interpreter disabled. Linux Plug and Play Support v0.97 (c) Adam Belay pnp: PnP ACPI: disabled PnPBIOS: Disabled PCI: Probing PCI hardware PCI: Probing PCI hardware (bus 00) Boot video device is :01:00.0 PCI: Using IRQ router PIIX/ICH [8086/7110] at :00:04.0 Simple Boot Flag at 0x3a set to 0x1 apm: BIOS version 1.2 Flags 0x03 (Driver version 1.16ac) audit: initializing netlink socket (disabled) audit(1139207174.020:0): initialized VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 1024 (order 0, 4096 bytes) devfs: 2004-01-31 Richard Gooch ([EMAIL PROTECTED]) devfs: boot_options: 0x0 Initializing Cryptographic API Limiting direct PCI/PCI transfers. vesafb: framebuffer at 0xe300, mapped to 0xe888, using 3750k, total 4096k vesafb: mode is 800x600x16, linelength=1600, pages=3 vesafb: protected mode interface info at c000:474c vesafb: scrolling: redraw vesafb: Truecolor: size=0:5:6:5, shift=0:11:5:0 bootsplash 3.1.6-2004/03/31: looking for picture...6 silentjpeg size 34430 bytes,6...found (800x600, 34382 bytes, v3). Console: switching to colour frame buffer device 93x30 fb0: VESA VGA frame buffer device isapnp: Scanning for PnP cards... isapnp: No Plug Play device found Real Time Clock Driver v1.12 PNP: No PS/2 controller found. Probing ports directly. serio: i8042 AUX port at 0x60,0x64 irq 12 serio: i8042 KBD port at 0x60,0x64 irq 1 Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports,
Re: [klibc] Re: Exporting which partitions to md-configure
Neil Brown wrote: What constitutes 'a piece of data'? A bit? a byte? I would say that msdos:fd is one piece of data. The 'fd' is useless without the 'msdos'. The 'msdos' is, I guess, not completely useless with the fd. I would lean towards the composite, but I wouldn't fight a separation. Well, the two pieces come from different sources. Just as there is a direct unambiguous causal path from something present at early boot to the root filesystem that is mounted (and the root filesystem specifies all other filesystems through fstab) similarly there should be an unambiguous causal path from something present at early boot to the array which holds the root filesystem - and the root filesystem should describe all other arrays via mdadm.conf Does that make sense? It makes sense, but I disagree. I believe you are correct in that the current preferred minor bit causes an invalid assumption that, e.g. /dev/md3 is always a certain thing, but since each array has a UUID, and one should be able to mount by either filesystem UUID or array UUID, there should be no need for such a conflict if one allows for dynamic md numbers. Requiring that mdadm.conf describes the actual state of all volumes would be an enormous step in the wrong direction. Right now, the Linux md system can handle some very oddball hardware changes (such as on hera.kernel.org, when the disks not just completely changed names due to a controller change, but changed from hd* to sd*!) Dynamicity is a good thing, although it needs to be harnessed. kernel parameter md_root_uuid=xxyy:zzyy:aabb:ccdd... This could be interpreted by an initramfs script to run mdadm to find and assemble the array with that uuid. The uuid of each array is reasonably unique. This, in fact is *EXACTLY* what we're talking about; it does require autoassemble. Why do we care about the partition types at all? The reason is that since the md superblock is at the end, it doesn't get automatically wiped if the partition is used as a raw filesystem, and so it's important that there is a qualifier for it. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 Debian Yaird Woes
On Sun, 5 Feb 2006, Lewis Shobbrook wrote: On Saturday 04 February 2006 11:22 am, you wrote: On Sat, 4 Feb 2006, Lewis Shobbrook wrote: Is there any way to avoid this requirement for input, so that the system skips the missing drive as the raid/initrd system did previously? what boot errors are you getting before it drops you to the root password prompt? Basically it just states waiting X seconds for /dev/sdx3 (corresponding to the missing raid5 member). Where X cycles from 2,4,8,16 and then drops you into a recovery console, no root pwd prompt. It will only occur if the partition is completely missing, such as a replacement disk with a blank partition table, or a completely missing/failed drive. is it trying to fsck some filesystem it doesn't have access to? No fsck seen for bad extX partitions etc. try something like this... cd /tmp mkdir t cd t zcat /boot/initrd.img-`uname -r` | cpio -i grep -r sd.3 . that should show us what script is directly accessing /dev/sdx3 ... maybe there's something more we can do about it. i did find a possible deficiency with the patch i posted... looking more closely at my yaird /init i see this: mkbdev '/dev/sdb' 'sdb' mkbdev '/dev/sdb4' 'sdb/sdb4' mkbdev '/dev/sda' 'sda' mkbdev '/dev/sda4' 'sda/sda4' and i think that means that mdadm -Ac partitions will fail if one of my root disks ends up somewhere other than sda or sdb... because the device nodes won't exist. i suspect i should update the patch to use mdrun instead of mdadm -Ac partitions... because mdrun will create temporary device nodes for everything in /proc/partitions in order to find all the possible raid pieces. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Problems making a RAID-0 over 2 gpt partitions...
I recently acquired a 7TB Xserve RAID. It is configured in hardware as 2 RAID 5 arrays of 3TB each. Now I'm trying to configure a RAID 0 over these 2 drives (so RAID 50 in total). I only wanted to make 1 large partition on each array, so I used parted as follows: parted /dev/sd[bc] (parted) mklabel gpt (parted) mkpart primary 0 3000600 (parted) set 1 raid on (parted) q for each of the disks. Then I went to go make the RAID array: mdadm -C -l 0 --raid-devices=2 /dev/md0 /dev/sdb1 /dev/sdc1 Everything seems ok at this point, /etc/mdstat lists the array as active.. I then wanted to put LVM on top of this for future expansion: pvcreate /dev/md0 vgcreate imagery /dev/md0 lvcreate -l xxx -n image1 imagery (x is the number of PEs for the whole disk, couldn't remember the number off the top of my head) then a filesystem.. mkfs.xfs /dev/imagery/image1 Everything works fine up to this point until I reboot. after reboot, the md array does not reassemble itself, and manually doing it results in: mdadm -A /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdb1: no RAID superblock Kernel is 2.6.14 mdadm is 1.12.0 Did I miss a partitioning step here (or do something else sufficiently stupid)? Thanks in advance, and please CC me for I am not subscribed. -- James - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction
On Mon, Feb 06, 2006 at 12:25:22PM -0700, Dan Williams ([EMAIL PROTECTED]) wrote: On 2/5/06, Neil Brown [EMAIL PROTECTED] wrote: I've looked through the patches - not exhaustively, but hopefully enough to get a general idea of what is happening. There are some things I'm not clear on and some things that I could suggest alternates too... I have a few questions to check that I understand your suggestions. - Each ADMA client (e.g. a raid5 array) gets a dedicated adma thread to handle all its requests. And it handles them all in series. I wonder if this is really optimal. If there are multiple adma engines, then a single client could only make use of one of them reliably. It would seem to make more sense to have just one thread - or maybe one per processor or one per adma engine - and have any ordering between requests made explicit in the interface. Actually as each processor could be seen as an ADMA engine, maybe you want one thread per processor AND one per engine. If there are no engines, the per-processor threads run with high priority, else with low. ...so the engine thread would handle explicit client requested ordering constraints and then hand the operations off to per processor worker threads in the pio case or queue directly to hardware in the presence of such an engine. In md_thread you talk about priority inversion deadlocks, do those same concerns apply here? Just for reference: the more threads you have, the less stable your system is. Ping-ponging work between several completely independent entities is always a bad idea. Even completion of the request postponed to workqueue from current execution unit introduces noticeble latencies. System should be able to process as much as possible of it's work in one flow. Dan -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html