Re: Process Scheduling Issue using sg/libata
> SFF ATA controllers are peculiar in that... > > 1. it doesn't have reliable IRQ pending bit. > > 2. it doesn't have reliable IRQ mask bit. > > 3. some controllers tank the machine completely if status or data > register is accessed differently than the chip likes. And 4. which is a killer for a lot of RT users An I/O cycle to a taskfile style controller generally goes at ISA type speed down the wire to the drive and back again. The CPU is stalled for this and there is nothing we can do about it. > > So, it's not like we're all dickheads. We know it's good to take those > out of irq handler. The hardware just isn't very forgiving and I bet > you'll get obscure machine lockups if the RT kernel arbitrarily pushes > ATA PIO data transfers into kernel threads. > > I think doing what IDE has been doing (disabling IRQ from interrupt > controller) is the way to go. Agreed - at which point RT or otherwise you can push it out. If you need to do serious (sub 1mS) ATA then also go get a non SFF controller. Alan - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
James Chapman wrote: > Mark Lord wrote: >> One way to deal with it in an embedded device, is to force the >> application that's generating the I/O to self-throttle. >> Or modify the device driver to self-throttle. > > Does disk access have to be so interrupt driven? Could disk interrupt > handling be done in a softirq/kthread like the networking guys deal with > network device interrupts? This would prevent the system from > live-locking when it is being bombarded with disk IO events. It doesn't > seem right that the disk IO subsystem can cause interrupt live-lock on > relatively slow CPUs... > >> You may want to find an embedded Linux consultant to help out >> with this situation if it's beyond your expertise. > > Check out the rtlinux patch, which pushes all interrupt handling out to > per-cpu kernel threads (irqd). The kernel scheduler then regains control > of what runs when. > > Another option is to change your ATA driver to do interrupt processing > at task level using a workqueue or similar. SFF ATA controllers are peculiar in that... 1. it doesn't have reliable IRQ pending bit. 2. it doesn't have reliable IRQ mask bit. 3. some controllers tank the machine completely if status or data register is accessed differently than the chip likes. So, it's not like we're all dickheads. We know it's good to take those out of irq handler. The hardware just isn't very forgiving and I bet you'll get obscure machine lockups if the RT kernel arbitrarily pushes ATA PIO data transfers into kernel threads. I think doing what IDE has been doing (disabling IRQ from interrupt controller) is the way to go. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Mark Lord wrote: > Fajun Chen wrote: >> >> As a matter of fact, I'm using /dev/sg*. Due to the size of my test >> application, I have not be able to compress it into a small and >> publishable form. However, this issue can be easily reproduced on my >> ARM XScale target using sg3_util code as follows: >> 1. Run printtime.c attached, which prints message to console in a loop. >> 2. Run sgm_dd (part of sg3_util package, source code attached) on the >> same system as follows: >>> sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1 >> The print task can be delayed for as many as 25 seconds. Surprisingly, >> I can't reproduce the problem in an i386 test system with a more >> powerful processor. >> >> Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a >> bug and more testing, this option seems to make no difference to cpu >> load. Sorry about previous report. Back to the drawing board now :-) > .. > > Okay, I don't see anything unusual here. The code is on a slow CPU, > and is triggering 10MBytes of PIO over a (probably) slow bus to an ATA > device. > > This *will* tie up the CPU at 100% for the duration of the I/O, > because the I/O happens in interrupt handlers, which are outside > of the realm of the CPU scheduler. > > This is a known shortcoming of Linux for real-time uses. > > When the I/O uses DMA transfers, it *may* still have a similar effect, > depending upon the caching in the ATA device, and on how the DMA shares > the memory bus with the CPU. > > Again, no surprise here. > > One way to deal with it in an embedded device, is to force the > application that's generating the I/O to self-throttle. > Or modify the device driver to self-throttle. Does disk access have to be so interrupt driven? Could disk interrupt handling be done in a softirq/kthread like the networking guys deal with network device interrupts? This would prevent the system from live-locking when it is being bombarded with disk IO events. It doesn't seem right that the disk IO subsystem can cause interrupt live-lock on relatively slow CPUs... > You may want to find an embedded Linux consultant to help out > with this situation if it's beyond your expertise. Check out the rtlinux patch, which pushes all interrupt handling out to per-cpu kernel threads (irqd). The kernel scheduler then regains control of what runs when. Another option is to change your ATA driver to do interrupt processing at task level using a workqueue or similar. -- James Chapman Katalix Systems Ltd http://www.katalix.com Catalysts for your Embedded Linux software development - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: As a matter of fact, I'm using /dev/sg*. Due to the size of my test application, I have not be able to compress it into a small and publishable form. However, this issue can be easily reproduced on my ARM XScale target using sg3_util code as follows: 1. Run printtime.c attached, which prints message to console in a loop. 2. Run sgm_dd (part of sg3_util package, source code attached) on the same system as follows: sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1 The print task can be delayed for as many as 25 seconds. Surprisingly, I can't reproduce the problem in an i386 test system with a more powerful processor. Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a bug and more testing, this option seems to make no difference to cpu load. Sorry about previous report. Back to the drawing board now :-) .. Okay, I don't see anything unusual here. The code is on a slow CPU, and is triggering 10MBytes of PIO over a (probably) slow bus to an ATA device. This *will* tie up the CPU at 100% for the duration of the I/O, because the I/O happens in interrupt handlers, which are outside of the realm of the CPU scheduler. This is a known shortcoming of Linux for real-time uses. When the I/O uses DMA transfers, it *may* still have a similar effect, depending upon the caching in the ATA device, and on how the DMA shares the memory bus with the CPU. Again, no surprise here. One way to deal with it in an embedded device, is to force the application that's generating the I/O to self-throttle. Or modify the device driver to self-throttle. You may want to find an embedded Linux consultant to help out with this situation if it's beyond your expertise. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
On 11/18/07, Mark Lord <[EMAIL PROTECTED]> wrote: > Fajun Chen wrote: > >.. > > I verified your program works in my system and my application works as > > well if changed accordingly. However, this change (indirect IO in sg > > term) may come at a performance cost for IO intensive applications > > since it does NOT utilize mmaped buffer managed by sg driver. Please > > see relevant sg document below: > > http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330 > > http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio > > As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag > > in SG_IO. Please see source code attached. I also noticed that > > MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not > > be desirable as you pointed out in previous emails. So this brings up > > an interesting sg usage issue: can we use MAP_ANONYMOUS with > > SG_FLAG_MMAP_IO flag in SG_IO? > .. > > The SG_FLAG_MMAP works only with /dev/sg* devices, not /dev/sd* devices. > I don't know which kind you were trying to use, since you still have > not provided your source code for examination. > > If you are using /dev/sg*, then you should be able to get your original mmap() > code to work. But the behaviour described thus far seems to indicate that > your secret program must have been using /dev/sd* instead. > As a matter of fact, I'm using /dev/sg*. Due to the size of my test application, I have not be able to compress it into a small and publishable form. However, this issue can be easily reproduced on my ARM XScale target using sg3_util code as follows: 1. Run printtime.c attached, which prints message to console in a loop. 2. Run sgm_dd (part of sg3_util package, source code attached) on the same system as follows: >sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1 The print task can be delayed for as many as 25 seconds. Surprisingly, I can't reproduce the problem in an i386 test system with a more powerful processor. Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a bug and more testing, this option seems to make no difference to cpu load. Sorry about previous report. Back to the drawing board now :-) Thanks, Fajun printtime.c Description: Binary data sgm_dd.c Description: Binary data
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: .. I verified your program works in my system and my application works as well if changed accordingly. However, this change (indirect IO in sg term) may come at a performance cost for IO intensive applications since it does NOT utilize mmaped buffer managed by sg driver. Please see relevant sg document below: http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330 http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag in SG_IO. Please see source code attached. I also noticed that MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not be desirable as you pointed out in previous emails. So this brings up an interesting sg usage issue: can we use MAP_ANONYMOUS with SG_FLAG_MMAP_IO flag in SG_IO? .. The SG_FLAG_MMAP works only with /dev/sg* devices, not /dev/sd* devices. I don't know which kind you were trying to use, since you still have not provided your source code for examination. If you are using /dev/sg*, then you should be able to get your original mmap() code to work. But the behaviour described thus far seems to indicate that your secret program must have been using /dev/sd* instead. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
On 11/18/07, Mark Lord <[EMAIL PROTECTED]> wrote: > Fajun Chen wrote: > > On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote: > .. > >> What you probably intended to do instead, was to use mmap to just allocate > >> some page-aligned RAM, not to actually mmap'd any on-disk data. Right? > >> > >> Here's how that's done: > >> > >> read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > >> MAP_SHARED|MAP_ANONYMOUS, -1, 0); > >> > > What I intended to do is to write data into disc or read data from > > disc via SG_IO as requested by my user-space application. I don't want > > any automatically scheduled kernel task to sync data with disc. > .. > > Right. Then you definitely do NOT want to mmap your device, > because that's exactly what would otherwise happen, by design! > > > > I've experimented with memory mapping using MAP_ANONYMOUS as you > > suggested, the good news is that it does free up the cpu load and my > > system is much more responsive with the change. > .. > > Yes, that's what we expected to see. > > > > The bad news is that > > the data read back from disc (PIO or DMA read) seems to be invisible > > to user-space application. For instance, read buffer is all zeros > > after Identify Device command. Is this expected side effect of > > MAP_ANONYMOUS option? > .. > > No, that would be a side effect of some other bug in the code. > > Here (attached) is a working program that performs (PACKET)IDENTIFY DEVICE > commands, using a mmap() buffer to receive the data. > I verified your program works in my system and my application works as well if changed accordingly. However, this change (indirect IO in sg term) may come at a performance cost for IO intensive applications since it does NOT utilize mmaped buffer managed by sg driver. Please see relevant sg document below: http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330 http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag in SG_IO. Please see source code attached. I also noticed that MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not be desirable as you pointed out in previous emails. So this brings up an interesting sg usage issue: can we use MAP_ANONYMOUS with SG_FLAG_MMAP_IO flag in SG_IO? Thanks, Fajun sg_rbuf.c Description: Binary data
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote: .. What you probably intended to do instead, was to use mmap to just allocate some page-aligned RAM, not to actually mmap'd any on-disk data. Right? Here's how that's done: read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0); What I intended to do is to write data into disc or read data from disc via SG_IO as requested by my user-space application. I don't want any automatically scheduled kernel task to sync data with disc. .. Right. Then you definitely do NOT want to mmap your device, because that's exactly what would otherwise happen, by design! I've experimented with memory mapping using MAP_ANONYMOUS as you suggested, the good news is that it does free up the cpu load and my system is much more responsive with the change. .. Yes, that's what we expected to see. The bad news is that the data read back from disc (PIO or DMA read) seems to be invisible to user-space application. For instance, read buffer is all zeros after Identify Device command. Is this expected side effect of MAP_ANONYMOUS option? .. No, that would be a side effect of some other bug in the code. Here (attached) is a working program that performs (PACKET)IDENTIFY DEVICE commands, using a mmap() buffer to receive the data. Cheers /* * This code is copyright 2007 by Mark Lord, * and is made available to all under the terms * of the GNU General Public License v2. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include typedef unsigned long long u64; enum { ATA_CMD_PIO_IDENTIFY = 0xec, ATA_CMD_PIO_PIDENTIFY = 0xa1, /* normal sector size (bytes) for PIO/DMA */ ATA_SECT_SIZE = 512, ATA_16= 0x85, ATA_16_LEN = 16, ATA_DEV_REG_LBA = (1 << 6), ATA_LBA48 = 1, /* data transfer protocols; only basic PIO and DMA actually work */ ATA_PROTO_NON_DATA = ( 3 << 1), ATA_PROTO_PIO_IN = ( 4 << 1), ATA_PROTO_PIO_OUT = ( 5 << 1), ATA_PROTO_DMA = ( 6 << 1), ATA_PROTO_UDMA_IN = (11 << 1), /* unsupported */ ATA_PROTO_UDMA_OUT = (12 << 1), /* unsupported */ }; /* * Taskfile layout for ATA_16 cdb (LBA28/LBA48): * * cdb[ 4] = feature * cdb[ 6] = nsect * cdb[ 8] = lbal * cdb[10] = lbam * cdb[12] = lbah * cdb[13] = device * cdb[14] = command * * "high order byte" (hob) fields for LBA48 commands: * * cdb[ 3] = hob_feature * cdb[ 5] = hob_nsect * cdb[ 7] = hob_lbal * cdb[ 9] = hob_lbam * cdb[11] = hob_lbah * * dxfer_direction choices: * * SG_DXFER_TO_DEV (writing to drive) * SG_DXFER_FROM_DEV (reading from drive) * SG_DXFER_NONE (non-data commands) */ static int sg_issue (int fd, unsigned char ata_op, void *buf) { unsigned char cdb[ATA_16_LEN] = { ATA_16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; unsigned char sense[32]; unsigned int nsects = 1; struct sg_io_hdr hdr; cdb[ 1] = ATA_PROTO_PIO_IN; cdb[ 6] = nsects; cdb[14] = ata_op; memset(&hdr, 0, sizeof(struct sg_io_hdr)); hdr.interface_id = 'S'; hdr.cmd_len = ATA_16_LEN; hdr.mx_sb_len = sizeof(sense); hdr.dxfer_direction = SG_DXFER_FROM_DEV; hdr.dxfer_len = nsects * ATA_SECT_SIZE; hdr.dxferp = buf; hdr.cmdp = cdb; hdr.sbp = sense; hdr.timeout = 5000; /* milliseconds */ memset(sense, 0, sizeof(sense)); if (ioctl(fd, SG_IO, &hdr) < 0) { perror("ioctl(SG_IO)"); return (-1); } if (hdr.status == 0 && hdr.host_status == 0 && hdr.driver_status == 0) return 0; /* success */ if (hdr.status > 0) { unsigned char *d = sense + 8; /* SCSI status is non-zero */ fprintf(stderr, "SG_IO error: SCSI sense=0x%x/%02x/%02x, ATA=0x%02x/%02x\n", sense[1] & 0xf, sense[2], sense[3], d[13], d[3]); return -1; } /* some other error we don't know about yet */ fprintf(stderr, "SG_IO returned: SCSI status=0x%x, host_status=0x%x, driver_status=0x%x\n", hdr.status, hdr.host_status, hdr.driver_status); return -1; } int main (int argc, char *argv[]) { const char *devpath; int i, rc, fd; #if 0 unsigned short id[ATA_SECT_SIZE / 2]; memset(id, 0, sizeof(id)); #else unsigned short *id; id = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0); if (id == MAP_FAILED) { perror("mmap"); exit(1); } #endif if (argc != 2) { fprintf(stderr, "%s: bad/missing parm: expected \n", argv[0]); exit(1); } devpath = argv[1]; fd = open(devpath, O_RDWR|O_NONBLOCK); if (fd == -1) { perror(devpath); exit(1); } rc = sg_issue(fd, ATA_CMD_PIO_IDENTIFY, id); if (rc != 0) rc = sg_issue(fd, ATA_CMD_PIO_PIDENTIFY, id); if (rc == 0) { unsigned short *d = id; for (i = 0; i < (256/8); ++i) { printf("%04x %04x %04x %04x %04x %04x %04x %04x\n", d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7]); d += 8; } exit(0); } exit(1); }
Re: Process Scheduling Issue using sg/libata
On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote: > Fajun Chen wrote: > > On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote: > >> Fajun Chen wrote: > >>> On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote: > Fajun Chen wrote: > .. > >>> This problem also happens with R/W DMA ops. Below are simplified code > >>> snippets: > >>> // Open one sg device for read > >>> if ((sg_fd = open(dev_name, O_RDWR))<0) > >>> { > >>> ... > >>> } > >>> read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > >>> MAP_SHARED, sg_fd, 0); > >>> > >>> // Open the same sg device for write > >>> if ((sg_fd_wr = open(dev_name, O_RDWR))<0) > >>> { > >>> ... > >>> } > >>> write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > >>> MAP_SHARED, sg_fd_wr, 0); > >> .. > >> > >> Mmmm.. what is the purpose of those two mmap'd areas ? > >> I think this is important and relevant here: what are they used for? > >> > >> As coded above, these are memory mapped areas taht (1) overlap, > >> and (2) will be demand paged automatically to/from the disk > >> as they are accessed/modified. This *will* conflict with any SG_IO > >> operations happening at the same time on the same device. > .. > > The purpose of using two memory mapped areas is to meet our > > requirement that certain data patterns for writing need to be kept > > across commands. For instance, if one buffer is used for both reads > > and writes, then this buffer will need to be re-populated with certain > > write data after each read command, which would be very costly for > > write-read mixed type of ops. This separate R/W buffer setting also > > facilitates data comparison. > > > > These buffers are not used at the same time (one will be used only > > after the command on the other is completed). My application is the > > only program accessing disk using sg/libata and the rest of the > > programs run from ramdisk. Also, each buffer is only about 0.5MB and > > we have 64MB RAM on the target board. > > With this setup, these two buffers should be pretty much independent > > and free from block layer/file system, correct? > .. > > No. Those "buffers" as coded above are actually mmap'ed representations > of portions of the device (disk drive). So any write into one of those > buffers will trigger disk writes, and just accessing ("read") the buffers > may trigger disk reads. > > So what could be happening here, is when you trigger manual disk accesses > via SG_IO, that result in data being copied into those "buffers", the kernel > then automatically schedules disk writes to update the on-disk copies of > those mmap'd regions. > > What you probably intended to do instead, was to use mmap to just allocate > some page-aligned RAM, not to actually mmap'd any on-disk data. Right? > > Here's how that's done: > > read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > MAP_SHARED|MAP_ANONYMOUS, -1, 0); > What I intended to do is to write data into disc or read data from disc via SG_IO as requested by my user-space application. I don't want any automatically scheduled kernel task to sync data with disc. I've experimented with memory mapping using MAP_ANONYMOUS as you suggested, the good news is that it does free up the cpu load and my system is much more responsive with the change. The bad news is that the data read back from disc (PIO or DMA read) seems to be invisible to user-space application. For instance, read buffer is all zeros after Identify Device command. Is this expected side effect of MAP_ANONYMOUS option? Thanks, Fajun - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote: Fajun Chen wrote: On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote: Fajun Chen wrote: .. This problem also happens with R/W DMA ops. Below are simplified code snippets: // Open one sg device for read if ((sg_fd = open(dev_name, O_RDWR))<0) { ... } read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd, 0); // Open the same sg device for write if ((sg_fd_wr = open(dev_name, O_RDWR))<0) { ... } write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd_wr, 0); .. Mmmm.. what is the purpose of those two mmap'd areas ? I think this is important and relevant here: what are they used for? As coded above, these are memory mapped areas taht (1) overlap, and (2) will be demand paged automatically to/from the disk as they are accessed/modified. This *will* conflict with any SG_IO operations happening at the same time on the same device. .. The purpose of using two memory mapped areas is to meet our requirement that certain data patterns for writing need to be kept across commands. For instance, if one buffer is used for both reads and writes, then this buffer will need to be re-populated with certain write data after each read command, which would be very costly for write-read mixed type of ops. This separate R/W buffer setting also facilitates data comparison. These buffers are not used at the same time (one will be used only after the command on the other is completed). My application is the only program accessing disk using sg/libata and the rest of the programs run from ramdisk. Also, each buffer is only about 0.5MB and we have 64MB RAM on the target board. With this setup, these two buffers should be pretty much independent and free from block layer/file system, correct? .. No. Those "buffers" as coded above are actually mmap'ed representations of portions of the device (disk drive). So any write into one of those buffers will trigger disk writes, and just accessing ("read") the buffers may trigger disk reads. So what could be happening here, is when you trigger manual disk accesses via SG_IO, that result in data being copied into those "buffers", the kernel then automatically schedules disk writes to update the on-disk copies of those mmap'd regions. What you probably intended to do instead, was to use mmap to just allocate some page-aligned RAM, not to actually mmap'd any on-disk data. Right? Here's how that's done: read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0); Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
On 11/17/07, James Chapman <[EMAIL PROTECTED]> wrote: > Fajun Chen wrote: > > On 11/16/07, Tejun Heo <[EMAIL PROTECTED]> wrote: > >> Fajun Chen wrote: > >>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 > >>> and libata version 2.00 are loaded on ARM XScale board. Under heavy > >>> cpu load (e.g. when blocks per transfer/sector count is set to 1), > >>> I've observed that the test application can suck cpu away for long > >>> time (more than 20 seconds) and other processes including high > >>> priority shell can not get the time slice to run. What's interesting > >>> is that if the application is under heavy IO load (e.g. when blocks > >>> per transfer/sector count is set to 256), the problem goes away. I > >>> also tested with open source code sg_utils and got the same result, so > >>> this is not a problem specific to my user-space application. > >>> > >>> Since user preemption is checked when the kernel is about to return to > >>> user-space from a system call, process scheduler should be invoked > >>> after each system call. Something seems to be broken here. I found a > >>> similar issue below: > >>> http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2 > >>> But that turns out to be an issue with MTD/JFFS2 drivers, which are > >>> not used in my system. > >>> > >>> Has anyone experienced similar issues with sg/libata? Any information > >>> would be greatly appreciated. > >> That's one weird story. Does kernel say anything during that 20 seconds? > >> > > No. Nothing in kernel log. > > > > Fajun > > Have you considered using oprofile to find out what the CPU is doing > during the 20 seconds? > Haven't tried oprofile yet, not sure if it will get the time slice to run though. During this 20 seconds, I've verified that my application is still busy with R/W ops. > Does the problem occur when you put it under load using another method? > What are the ATA and network drivers here? I've seen some awful > out-of-tree device drivers hog the CPU with busy-waits and other crap. > Oprofile results should show the culprit. If blocks per transfer/sector count is set to 256, which means cpu has less load (any other implications?), this problem no longer occurs. Our target system uses libata sil24/pata680 drivers, has a customized FIFO driver but no network driver. The relevant variable here is blocks per transfer/sector count, which seems to matter only to sg/libata. Thanks, Fajun - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote: > Fajun Chen wrote: > > On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote: > >> Fajun Chen wrote: > >>> Hi All, > >>> > >>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 > >>> and libata version 2.00 are loaded on ARM XScale board. Under heavy > >>> cpu load (e.g. when blocks per transfer/sector count is set to 1), > >>> I've observed that the test application can suck cpu away for long > >>> time (more than 20 seconds) and other processes including high > >>> priority shell can not get the time slice to run. What's interesting > >>> is that if the application is under heavy IO load (e.g. when blocks > >>> per transfer/sector count is set to 256), the problem goes away. I > >>> also tested with open source code sg_utils and got the same result, so > >>> this is not a problem specific to my user-space application. > >> .. > >> > >> Post the relevant code here, and then we'll be able to better understand > >> and explain it to you. > >> > >> For example, if the code is using ATA opcodes 0x20, 0x21, 0x24, > >> 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops), > >> then this behaviour does not surprise me in the least. Fully expected > >> and difficult to avoid. > >> > > > > This problem also happens with R/W DMA ops. Below are simplified code > > snippets: > > // Open one sg device for read > > if ((sg_fd = open(dev_name, O_RDWR))<0) > > { > > ... > > } > > read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > > MAP_SHARED, sg_fd, 0); > > > > // Open the same sg device for write > > if ((sg_fd_wr = open(dev_name, O_RDWR))<0) > > { > > ... > > } > > write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > > MAP_SHARED, sg_fd_wr, 0); > .. > > Mmmm.. what is the purpose of those two mmap'd areas ? > I think this is important and relevant here: what are they used for? > > As coded above, these are memory mapped areas taht (1) overlap, > and (2) will be demand paged automatically to/from the disk > as they are accessed/modified. This *will* conflict with any SG_IO > operations happening at the same time on the same device. > > The purpose of using two memory mapped areas is to meet our requirement that certain data patterns for writing need to be kept across commands. For instance, if one buffer is used for both reads and writes, then this buffer will need to be re-populated with certain write data after each read command, which would be very costly for write-read mixed type of ops. This separate R/W buffer setting also facilitates data comparison. These buffers are not used at the same time (one will be used only after the command on the other is completed). My application is the only program accessing disk using sg/libata and the rest of the programs run from ramdisk. Also, each buffer is only about 0.5MB and we have 64MB RAM on the target board. With this setup, these two buffers should be pretty much independent and free from block layer/file system, correct? One thing is worthy of mentioning here. If the application is set to low priority (nice 19) or sched_yield() is called after each R/W command, then this issue disappears but performance suffers. Some thoughts here. For a static process, Linux scheduler could assign some dynamic priority to it based on activity and age, etc. Any chance that the scheduler favors my application unfairly due to the load condition? Thanks, Fajun - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: > On 11/16/07, Tejun Heo <[EMAIL PROTECTED]> wrote: >> Fajun Chen wrote: >>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 >>> and libata version 2.00 are loaded on ARM XScale board. Under heavy >>> cpu load (e.g. when blocks per transfer/sector count is set to 1), >>> I've observed that the test application can suck cpu away for long >>> time (more than 20 seconds) and other processes including high >>> priority shell can not get the time slice to run. What's interesting >>> is that if the application is under heavy IO load (e.g. when blocks >>> per transfer/sector count is set to 256), the problem goes away. I >>> also tested with open source code sg_utils and got the same result, so >>> this is not a problem specific to my user-space application. >>> >>> Since user preemption is checked when the kernel is about to return to >>> user-space from a system call, process scheduler should be invoked >>> after each system call. Something seems to be broken here. I found a >>> similar issue below: >>> http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2 >>> But that turns out to be an issue with MTD/JFFS2 drivers, which are >>> not used in my system. >>> >>> Has anyone experienced similar issues with sg/libata? Any information >>> would be greatly appreciated. >> That's one weird story. Does kernel say anything during that 20 seconds? >> > No. Nothing in kernel log. > > Fajun Have you considered using oprofile to find out what the CPU is doing during the 20 seconds? Does the problem occur when you put it under load using another method? What are the ATA and network drivers here? I've seen some awful out-of-tree device drivers hog the CPU with busy-waits and other crap. Oprofile results should show the culprit. -- James Chapman Katalix Systems Ltd http://www.katalix.com Catalysts for your Embedded Linux software development - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote: Fajun Chen wrote: Hi All, I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 and libata version 2.00 are loaded on ARM XScale board. Under heavy cpu load (e.g. when blocks per transfer/sector count is set to 1), I've observed that the test application can suck cpu away for long time (more than 20 seconds) and other processes including high priority shell can not get the time slice to run. What's interesting is that if the application is under heavy IO load (e.g. when blocks per transfer/sector count is set to 256), the problem goes away. I also tested with open source code sg_utils and got the same result, so this is not a problem specific to my user-space application. .. Post the relevant code here, and then we'll be able to better understand and explain it to you. For example, if the code is using ATA opcodes 0x20, 0x21, 0x24, 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops), then this behaviour does not surprise me in the least. Fully expected and difficult to avoid. This problem also happens with R/W DMA ops. Below are simplified code snippets: // Open one sg device for read if ((sg_fd = open(dev_name, O_RDWR))<0) { ... } read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd, 0); // Open the same sg device for write if ((sg_fd_wr = open(dev_name, O_RDWR))<0) { ... } write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd_wr, 0); .. Mmmm.. what is the purpose of those two mmap'd areas ? I think this is important and relevant here: what are they used for? As coded above, these are memory mapped areas taht (1) overlap, and (2) will be demand paged automatically to/from the disk as they are accessed/modified. This *will* conflict with any SG_IO operations happening at the same time on the same device. sg_io_hdr_t io_hdr; memset(&io_hdr, 0, sizeof(sg_io_hdr_t)); io_hdr.interface_id = 'S'; io_hdr.mx_sb_len= sizeof(sense_buffer); io_hdr.sbp = sense_buffer; io_hdr.dxfer_len= dxfer_len; io_hdr.cmd_len = cmd_len; io_hdr.cmdp = cmdp;// ATA pass through command block io_hdr.timeout = cmd_tmo * 1000; // In millisecs io_hdr.pack_id = id; // Read/write counter for now io_hdr.iovec_count=0; // scatter gather elements, 0=not being used if (direction == 1) { io_hdr.dxfer_direction = SG_DXFER_TO_DEV; io_hdr.flags |= SG_FLAG_MMAP_IO; status = ioctl(sg_fd_wr, SG_IO, &io_hdr); } else { io_hdr.dxfer_direction = SG_DXFER_FROM_DEV; io_hdr.flags |= SG_FLAG_MMAP_IO; status = ioctl(sg_fd, SG_IO, &io_hdr); } ... Mmaped IO is a moot point here since this problem is also observed when using direct IO. Thanks, Fajun - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote: > Fajun Chen wrote: > > Hi All, > > > > I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 > > and libata version 2.00 are loaded on ARM XScale board. Under heavy > > cpu load (e.g. when blocks per transfer/sector count is set to 1), > > I've observed that the test application can suck cpu away for long > > time (more than 20 seconds) and other processes including high > > priority shell can not get the time slice to run. What's interesting > > is that if the application is under heavy IO load (e.g. when blocks > > per transfer/sector count is set to 256), the problem goes away. I > > also tested with open source code sg_utils and got the same result, so > > this is not a problem specific to my user-space application. > .. > > Post the relevant code here, and then we'll be able to better understand > and explain it to you. > > For example, if the code is using ATA opcodes 0x20, 0x21, 0x24, > 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops), > then this behaviour does not surprise me in the least. Fully expected > and difficult to avoid. > This problem also happens with R/W DMA ops. Below are simplified code snippets: // Open one sg device for read if ((sg_fd = open(dev_name, O_RDWR))<0) { ... } read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd, 0); // Open the same sg device for write if ((sg_fd_wr = open(dev_name, O_RDWR))<0) { ... } write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd_wr, 0); sg_io_hdr_t io_hdr; memset(&io_hdr, 0, sizeof(sg_io_hdr_t)); io_hdr.interface_id = 'S'; io_hdr.mx_sb_len= sizeof(sense_buffer); io_hdr.sbp = sense_buffer; io_hdr.dxfer_len= dxfer_len; io_hdr.cmd_len = cmd_len; io_hdr.cmdp = cmdp;// ATA pass through command block io_hdr.timeout = cmd_tmo * 1000; // In millisecs io_hdr.pack_id = id; // Read/write counter for now io_hdr.iovec_count=0; // scatter gather elements, 0=not being used if (direction == 1) { io_hdr.dxfer_direction = SG_DXFER_TO_DEV; io_hdr.flags |= SG_FLAG_MMAP_IO; status = ioctl(sg_fd_wr, SG_IO, &io_hdr); } else { io_hdr.dxfer_direction = SG_DXFER_FROM_DEV; io_hdr.flags |= SG_FLAG_MMAP_IO; status = ioctl(sg_fd, SG_IO, &io_hdr); } ... Mmaped IO is a moot point here since this problem is also observed when using direct IO. Thanks, Fajun - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
On 11/16/07, Tejun Heo <[EMAIL PROTECTED]> wrote: > Fajun Chen wrote: > > I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 > > and libata version 2.00 are loaded on ARM XScale board. Under heavy > > cpu load (e.g. when blocks per transfer/sector count is set to 1), > > I've observed that the test application can suck cpu away for long > > time (more than 20 seconds) and other processes including high > > priority shell can not get the time slice to run. What's interesting > > is that if the application is under heavy IO load (e.g. when blocks > > per transfer/sector count is set to 256), the problem goes away. I > > also tested with open source code sg_utils and got the same result, so > > this is not a problem specific to my user-space application. > > > > Since user preemption is checked when the kernel is about to return to > > user-space from a system call, process scheduler should be invoked > > after each system call. Something seems to be broken here. I found a > > similar issue below: > > http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2 > > But that turns out to be an issue with MTD/JFFS2 drivers, which are > > not used in my system. > > > > Has anyone experienced similar issues with sg/libata? Any information > > would be greatly appreciated. > > That's one weird story. Does kernel say anything during that 20 seconds? > No. Nothing in kernel log. Fajun - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: Hi All, I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 and libata version 2.00 are loaded on ARM XScale board. Under heavy cpu load (e.g. when blocks per transfer/sector count is set to 1), I've observed that the test application can suck cpu away for long time (more than 20 seconds) and other processes including high priority shell can not get the time slice to run. What's interesting is that if the application is under heavy IO load (e.g. when blocks per transfer/sector count is set to 256), the problem goes away. I also tested with open source code sg_utils and got the same result, so this is not a problem specific to my user-space application. .. Post the relevant code here, and then we'll be able to better understand and explain it to you. For example, if the code is using ATA opcodes 0x20, 0x21, 0x24, 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops), then this behaviour does not surprise me in the least. Fully expected and difficult to avoid. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: > I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 > and libata version 2.00 are loaded on ARM XScale board. Under heavy > cpu load (e.g. when blocks per transfer/sector count is set to 1), > I've observed that the test application can suck cpu away for long > time (more than 20 seconds) and other processes including high > priority shell can not get the time slice to run. What's interesting > is that if the application is under heavy IO load (e.g. when blocks > per transfer/sector count is set to 256), the problem goes away. I > also tested with open source code sg_utils and got the same result, so > this is not a problem specific to my user-space application. > > Since user preemption is checked when the kernel is about to return to > user-space from a system call, process scheduler should be invoked > after each system call. Something seems to be broken here. I found a > similar issue below: > http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2 > But that turns out to be an issue with MTD/JFFS2 drivers, which are > not used in my system. > > Has anyone experienced similar issues with sg/libata? Any information > would be greatly appreciated. That's one weird story. Does kernel say anything during that 20 seconds? -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Process Scheduling Issue using sg/libata
Hi All, I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 and libata version 2.00 are loaded on ARM XScale board. Under heavy cpu load (e.g. when blocks per transfer/sector count is set to 1), I've observed that the test application can suck cpu away for long time (more than 20 seconds) and other processes including high priority shell can not get the time slice to run. What's interesting is that if the application is under heavy IO load (e.g. when blocks per transfer/sector count is set to 256), the problem goes away. I also tested with open source code sg_utils and got the same result, so this is not a problem specific to my user-space application. Since user preemption is checked when the kernel is about to return to user-space from a system call, process scheduler should be invoked after each system call. Something seems to be broken here. I found a similar issue below: http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2 But that turns out to be an issue with MTD/JFFS2 drivers, which are not used in my system. Has anyone experienced similar issues with sg/libata? Any information would be greatly appreciated. Thanks, Fajun - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html