Re: Process Scheduling Issue using sg/libata

2007-11-19 Thread Alan Cox
> SFF ATA controllers are peculiar in that...
> 
> 1. it doesn't have reliable IRQ pending bit.
> 
> 2. it doesn't have reliable IRQ mask bit.
> 
> 3. some controllers tank the machine completely if status or data
> register is accessed differently than the chip likes.

And 4. which is a killer for a lot of RT users

An I/O cycle to a taskfile style controller generally goes at ISA type
speed down the wire to the drive and back again. The CPU is stalled for
this and there is nothing we can do about it.

> 
> So, it's not like we're all dickheads.  We know it's good to take those
> out of irq handler.  The hardware just isn't very forgiving and I bet
> you'll get obscure machine lockups if the RT kernel arbitrarily pushes
> ATA PIO data transfers into kernel threads.
> 
> I think doing what IDE has been doing (disabling IRQ from interrupt
> controller) is the way to go.

Agreed - at which point RT or otherwise you can push it out. If you need
to do serious (sub 1mS) ATA then also go get a non SFF controller.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-19 Thread Tejun Heo
James Chapman wrote:
> Mark Lord wrote:
>> One way to deal with it in an embedded device, is to force the
>> application that's generating the I/O to self-throttle.
>> Or modify the device driver to self-throttle.
> 
> Does disk access have to be so interrupt driven? Could disk interrupt
> handling be done in a softirq/kthread like the networking guys deal with
> network device interrupts? This would prevent the system from
> live-locking when it is being bombarded with disk IO events. It doesn't
> seem right that the disk IO subsystem can cause interrupt live-lock on
> relatively slow CPUs...
> 
>> You may want to find an embedded Linux consultant to help out
>> with this situation if it's beyond your expertise.
> 
> Check out the rtlinux patch, which pushes all interrupt handling out to
> per-cpu kernel threads (irqd). The kernel scheduler then regains control
> of what runs when.
> 
> Another option is to change your ATA driver to do interrupt processing
> at task level using a workqueue or similar.

SFF ATA controllers are peculiar in that...

1. it doesn't have reliable IRQ pending bit.

2. it doesn't have reliable IRQ mask bit.

3. some controllers tank the machine completely if status or data
register is accessed differently than the chip likes.

So, it's not like we're all dickheads.  We know it's good to take those
out of irq handler.  The hardware just isn't very forgiving and I bet
you'll get obscure machine lockups if the RT kernel arbitrarily pushes
ATA PIO data transfers into kernel threads.

I think doing what IDE has been doing (disabling IRQ from interrupt
controller) is the way to go.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-19 Thread James Chapman
Mark Lord wrote:
> Fajun Chen wrote:
>>
>> As a matter of fact, I'm using /dev/sg*.  Due to the size of my test
>> application, I have not be able to compress it into a small and
>> publishable form. However, this issue can be easily reproduced on my
>> ARM XScale target using sg3_util code as follows:
>> 1. Run printtime.c attached,  which prints message to console in a loop.
>> 2. Run sgm_dd (part of sg3_util package, source code attached) on the
>> same system as follows:
>>> sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1
>> The print task can be delayed for as many as 25 seconds. Surprisingly,
>> I can't reproduce the problem in an i386 test system with a more
>> powerful processor.
>>
>> Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a
>> bug and more testing, this option seems to make no difference to cpu
>> load. Sorry about previous report. Back to the drawing board now :-)
> ..
> 
> Okay, I don't see anything unusual here.  The code is on a slow CPU,
> and is triggering 10MBytes of PIO over a (probably) slow bus to an ATA
> device.
> 
> This *will* tie up the CPU at 100% for the duration of the I/O,
> because the I/O happens in interrupt handlers, which are outside
> of the realm of the CPU scheduler.
> 
> This is a known shortcoming of Linux for real-time uses.
> 
> When the I/O uses DMA transfers, it *may* still have a similar effect,
> depending upon the caching in the ATA device, and on how the DMA shares
> the memory bus with the CPU.
> 
> Again, no surprise here.
> 
> One way to deal with it in an embedded device, is to force the
> application that's generating the I/O to self-throttle.
> Or modify the device driver to self-throttle.

Does disk access have to be so interrupt driven? Could disk interrupt
handling be done in a softirq/kthread like the networking guys deal with
network device interrupts? This would prevent the system from
live-locking when it is being bombarded with disk IO events. It doesn't
seem right that the disk IO subsystem can cause interrupt live-lock on
relatively slow CPUs...

> You may want to find an embedded Linux consultant to help out
> with this situation if it's beyond your expertise.

Check out the rtlinux patch, which pushes all interrupt handling out to
per-cpu kernel threads (irqd). The kernel scheduler then regains control
of what runs when.

Another option is to change your ATA driver to do interrupt processing
at task level using a workqueue or similar.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-18 Thread Mark Lord

Fajun Chen wrote:


As a matter of fact, I'm using /dev/sg*.  Due to the size of my test
application, I have not be able to compress it into a small and
publishable form. However, this issue can be easily reproduced on my
ARM XScale target using sg3_util code as follows:
1. Run printtime.c attached,  which prints message to console in a loop.
2. Run sgm_dd (part of sg3_util package, source code attached) on the
same system as follows:

sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1

The print task can be delayed for as many as 25 seconds. Surprisingly,
I can't reproduce the problem in an i386 test system with a more
powerful processor.

Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a
bug and more testing, this option seems to make no difference to cpu
load. Sorry about previous report. Back to the drawing board now :-)

..

Okay, I don't see anything unusual here.  The code is on a slow CPU,
and is triggering 10MBytes of PIO over a (probably) slow bus to an ATA device.

This *will* tie up the CPU at 100% for the duration of the I/O,
because the I/O happens in interrupt handlers, which are outside
of the realm of the CPU scheduler.

This is a known shortcoming of Linux for real-time uses.

When the I/O uses DMA transfers, it *may* still have a similar effect,
depending upon the caching in the ATA device, and on how the DMA shares
the memory bus with the CPU.

Again, no surprise here.

One way to deal with it in an embedded device, is to force the
application that's generating the I/O to self-throttle.
Or modify the device driver to self-throttle.

You may want to find an embedded Linux consultant to help out
with this situation if it's beyond your expertise.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-18 Thread Fajun Chen
On 11/18/07, Mark Lord <[EMAIL PROTECTED]> wrote:
> Fajun Chen wrote:
> >..
> > I verified your program works in my system and my application works as
> > well if changed accordingly. However, this change (indirect IO in sg
> > term) may come at a performance cost for IO intensive applications
> > since it does NOT utilize mmaped buffer managed by sg driver.  Please
> > see relevant sg document below:
> > http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330
> > http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio
> > As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag
> > in SG_IO. Please see source code attached. I also noticed that
> > MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not
> > be desirable as you pointed out in previous emails. So this brings up
> > an interesting sg usage issue: can we use MAP_ANONYMOUS with
> > SG_FLAG_MMAP_IO flag in SG_IO?
> ..
>
> The SG_FLAG_MMAP works only with /dev/sg* devices, not /dev/sd* devices.
> I don't know which kind you were trying to use, since you still have
> not provided your source code for examination.
>
> If you are using /dev/sg*, then you should be able to get your original mmap()
> code to work.  But the behaviour described thus far seems to indicate that
> your secret program must have been using /dev/sd* instead.
>
As a matter of fact, I'm using /dev/sg*.  Due to the size of my test
application, I have not be able to compress it into a small and
publishable form. However, this issue can be easily reproduced on my
ARM XScale target using sg3_util code as follows:
1. Run printtime.c attached,  which prints message to console in a loop.
2. Run sgm_dd (part of sg3_util package, source code attached) on the
same system as follows:
>sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1
The print task can be delayed for as many as 25 seconds. Surprisingly,
I can't reproduce the problem in an i386 test system with a more
powerful processor.

Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a
bug and more testing, this option seems to make no difference to cpu
load. Sorry about previous report. Back to the drawing board now :-)

Thanks,
Fajun


printtime.c
Description: Binary data


sgm_dd.c
Description: Binary data


Re: Process Scheduling Issue using sg/libata

2007-11-18 Thread Mark Lord

Fajun Chen wrote:

..
I verified your program works in my system and my application works as
well if changed accordingly. However, this change (indirect IO in sg
term) may come at a performance cost for IO intensive applications
since it does NOT utilize mmaped buffer managed by sg driver.  Please
see relevant sg document below:
http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330
http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio
As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag
in SG_IO. Please see source code attached. I also noticed that
MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not
be desirable as you pointed out in previous emails. So this brings up
an interesting sg usage issue: can we use MAP_ANONYMOUS with
SG_FLAG_MMAP_IO flag in SG_IO?

..

The SG_FLAG_MMAP works only with /dev/sg* devices, not /dev/sd* devices.
I don't know which kind you were trying to use, since you still have
not provided your source code for examination.

If you are using /dev/sg*, then you should be able to get your original mmap()
code to work.  But the behaviour described thus far seems to indicate that
your secret program must have been using /dev/sd* instead.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-18 Thread Fajun Chen
On 11/18/07, Mark Lord <[EMAIL PROTECTED]> wrote:
> Fajun Chen wrote:
> > On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote:
> ..
> >> What you probably intended to do instead, was to use mmap to just allocate
> >> some page-aligned RAM, not to actually mmap'd any on-disk data.  Right?
> >>
> >> Here's how that's done:
> >>
> >>   read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
> >>  MAP_SHARED|MAP_ANONYMOUS, -1, 0);
> >>
> > What I intended to do is to write data into disc or read data from
> > disc via SG_IO as requested by my user-space application. I don't want
> > any automatically scheduled kernel task to sync data with disc.
> ..
>
> Right.  Then you definitely do NOT want to mmap your device,
> because that's exactly what would otherwise happen, by design!
>
>
> > I've experimented with memory mapping using MAP_ANONYMOUS as you
> > suggested, the good news is that it does free up the cpu load and my
> > system is much more responsive with the change.
> ..
>
> Yes, that's what we expected to see.
>
>
> > The bad news is that
> > the data read back from disc (PIO or DMA read) seems to be invisible
> > to user-space application. For instance, read buffer is all zeros
> > after Identify Device command. Is this expected side effect of
> > MAP_ANONYMOUS option?
> ..
>
> No, that would be a side effect of some other bug in the code.
>
> Here (attached) is a working program that performs (PACKET)IDENTIFY DEVICE
> commands, using a mmap() buffer to receive the data.
>

I verified your program works in my system and my application works as
well if changed accordingly. However, this change (indirect IO in sg
term) may come at a performance cost for IO intensive applications
since it does NOT utilize mmaped buffer managed by sg driver.  Please
see relevant sg document below:
http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330
http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio
As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag
in SG_IO. Please see source code attached. I also noticed that
MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not
be desirable as you pointed out in previous emails. So this brings up
an interesting sg usage issue: can we use MAP_ANONYMOUS with
SG_FLAG_MMAP_IO flag in SG_IO?

Thanks,
Fajun


sg_rbuf.c
Description: Binary data


Re: Process Scheduling Issue using sg/libata

2007-11-18 Thread Mark Lord

Fajun Chen wrote:

On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote:

..

What you probably intended to do instead, was to use mmap to just allocate
some page-aligned RAM, not to actually mmap'd any on-disk data.  Right?

Here's how that's done:

  read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED|MAP_ANONYMOUS, -1, 0);


What I intended to do is to write data into disc or read data from
disc via SG_IO as requested by my user-space application. I don't want
any automatically scheduled kernel task to sync data with disc.

..

Right.  Then you definitely do NOT want to mmap your device,
because that's exactly what would otherwise happen, by design!



I've experimented with memory mapping using MAP_ANONYMOUS as you
suggested, the good news is that it does free up the cpu load and my
system is much more responsive with the change.

..

Yes, that's what we expected to see.



The bad news is that
the data read back from disc (PIO or DMA read) seems to be invisible
to user-space application. For instance, read buffer is all zeros
after Identify Device command. Is this expected side effect of
MAP_ANONYMOUS option?

..

No, that would be a side effect of some other bug in the code.

Here (attached) is a working program that performs (PACKET)IDENTIFY DEVICE
commands, using a mmap() buffer to receive the data.

Cheers
/*
 * This code is copyright 2007 by Mark Lord,
 * and is made available to all under the terms
 * of the GNU General Public License v2.
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 

typedef unsigned long long u64;

enum {
	ATA_CMD_PIO_IDENTIFY		= 0xec,
	ATA_CMD_PIO_PIDENTIFY		= 0xa1,

	/* normal sector size (bytes) for PIO/DMA */
	ATA_SECT_SIZE			= 512,

	ATA_16= 0x85,
	ATA_16_LEN			= 16,

	ATA_DEV_REG_LBA			= (1 << 6),

	ATA_LBA48			= 1,

	/* data transfer protocols; only basic PIO and DMA actually work */
	ATA_PROTO_NON_DATA		= ( 3 << 1),
	ATA_PROTO_PIO_IN		= ( 4 << 1),
	ATA_PROTO_PIO_OUT		= ( 5 << 1),
	ATA_PROTO_DMA			= ( 6 << 1),
	ATA_PROTO_UDMA_IN		= (11 << 1), /* unsupported */
	ATA_PROTO_UDMA_OUT		= (12 << 1), /* unsupported */
};

/*
 * Taskfile layout for ATA_16 cdb (LBA28/LBA48):
 *
 *	cdb[ 4] = feature
 *	cdb[ 6] = nsect
 *	cdb[ 8] = lbal
 *	cdb[10] = lbam
 *	cdb[12] = lbah
 *	cdb[13] = device
 *	cdb[14] = command
 *
 * "high order byte" (hob) fields for LBA48 commands:
 *
 *	cdb[ 3] = hob_feature
 *	cdb[ 5] = hob_nsect
 *	cdb[ 7] = hob_lbal
 *	cdb[ 9] = hob_lbam
 *	cdb[11] = hob_lbah
 *
 * dxfer_direction choices:
 *
 *	SG_DXFER_TO_DEV		(writing to drive)
 *	SG_DXFER_FROM_DEV	(reading from drive)
 *	SG_DXFER_NONE		(non-data commands)
 */

static int sg_issue (int fd, unsigned char ata_op, void *buf)
{
	unsigned char cdb[ATA_16_LEN]
		= { ATA_16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
	unsigned char sense[32];
	unsigned int nsects = 1;
	struct sg_io_hdr hdr;

	cdb[ 1] = ATA_PROTO_PIO_IN;
	cdb[ 6] = nsects;
	cdb[14] = ata_op;

	memset(&hdr, 0, sizeof(struct sg_io_hdr));
	hdr.interface_id	= 'S';
	hdr.cmd_len		= ATA_16_LEN;
	hdr.mx_sb_len		= sizeof(sense);
	hdr.dxfer_direction	= SG_DXFER_FROM_DEV;
	hdr.dxfer_len		= nsects * ATA_SECT_SIZE;
	hdr.dxferp		= buf;
	hdr.cmdp		= cdb;
	hdr.sbp			= sense;
	hdr.timeout		= 5000; /* milliseconds */

	memset(sense, 0, sizeof(sense));
	if (ioctl(fd, SG_IO, &hdr) < 0) {
		perror("ioctl(SG_IO)");
		return (-1);
	}
	if (hdr.status == 0 && hdr.host_status == 0 && hdr.driver_status == 0)
		return 0; /* success */

	if (hdr.status > 0) {
		unsigned char *d = sense + 8;
		/* SCSI status is non-zero */
		fprintf(stderr, "SG_IO error: SCSI sense=0x%x/%02x/%02x, ATA=0x%02x/%02x\n",
			sense[1] & 0xf, sense[2], sense[3], d[13], d[3]);
		return -1;
	}
	/* some other error we don't know about yet */
	fprintf(stderr, "SG_IO returned: SCSI status=0x%x, host_status=0x%x, driver_status=0x%x\n",
		hdr.status, hdr.host_status, hdr.driver_status);
	return -1;
}

int main (int argc, char *argv[])
{
	const char *devpath;
	int i, rc, fd;
#if 0
	unsigned short id[ATA_SECT_SIZE / 2];
	memset(id, 0, sizeof(id));
#else
	unsigned short *id;
	id = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);
	if (id == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}
#endif
	if (argc != 2) {
		fprintf(stderr, "%s: bad/missing parm: expected \n", argv[0]);
		exit(1);
	}
	devpath = argv[1];

	fd = open(devpath, O_RDWR|O_NONBLOCK);
	if (fd == -1) {
		perror(devpath);
		exit(1);
	}
	rc = sg_issue(fd, ATA_CMD_PIO_IDENTIFY, id);
	if (rc != 0)
		rc = sg_issue(fd, ATA_CMD_PIO_PIDENTIFY, id);
	if (rc == 0) {
		unsigned short *d = id;
		for (i = 0; i < (256/8); ++i) {
			printf("%04x %04x %04x %04x %04x %04x %04x %04x\n",
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7]);
			d += 8;
		}
		exit(0);
	}
	exit(1);
}


Re: Process Scheduling Issue using sg/libata

2007-11-17 Thread Fajun Chen
On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote:
> Fajun Chen wrote:
> > On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote:
> >> Fajun Chen wrote:
> >>> On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote:
>  Fajun Chen wrote:
> ..
> >>> This problem also happens with R/W DMA ops. Below are simplified code 
> >>> snippets:
> >>> // Open one sg device for read
> >>>   if ((sg_fd  = open(dev_name, O_RDWR))<0)
> >>>   {
> >>>   ...
> >>>   }
> >>>   read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
> >>>  MAP_SHARED, sg_fd, 0);
> >>>
> >>> // Open the same sg device for write
> >>>   if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
> >>>   {
> >>>  ...
> >>>   }
> >>>   write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
> >>>  MAP_SHARED, sg_fd_wr, 0);
> >> ..
> >>
> >> Mmmm.. what is the purpose of those two mmap'd areas ?
> >> I think this is important and relevant here:  what are they used for?
> >>
> >> As coded above, these are memory mapped areas taht (1) overlap,
> >> and (2) will be demand paged automatically to/from the disk
> >> as they are accessed/modified.  This *will* conflict with any SG_IO
> >> operations happening at the same time on the same device.
> ..
> > The purpose of using two memory mapped areas is to meet our
> > requirement that certain data patterns for writing need to be kept
> > across commands. For instance, if one buffer is used for both reads
> > and writes, then this buffer will need to be re-populated with certain
> > write data after each read command, which would be very costly for
> > write-read mixed type of ops. This separate R/W buffer setting also
> > facilitates data comparison.
> >
> > These buffers are not used at the same time (one will be used only
> > after the command on the other is completed). My application is the
> > only program accessing disk using sg/libata and the rest of the
> > programs run from ramdisk. Also, each buffer is only about 0.5MB and
> > we have 64MB RAM on the target board.
> > With this setup,  these two buffers should be pretty much independent
> > and free from block layer/file system, correct?
> ..
>
> No.  Those "buffers" as coded above are actually mmap'ed representations
> of portions of the device (disk drive).  So any write into one of those
> buffers will trigger disk writes, and just accessing ("read") the buffers
> may trigger disk reads.
>
> So what could be happening here, is when you trigger manual disk accesses
> via SG_IO, that result in data being copied into those "buffers", the kernel
> then automatically schedules disk writes to update the on-disk copies of
> those mmap'd regions.
>
> What you probably intended to do instead, was to use mmap to just allocate
> some page-aligned RAM, not to actually mmap'd any on-disk data.  Right?
>
> Here's how that's done:
>
>   read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
>  MAP_SHARED|MAP_ANONYMOUS, -1, 0);
>
What I intended to do is to write data into disc or read data from
disc via SG_IO as requested by my user-space application. I don't want
any automatically scheduled kernel task to sync data with disc.

I've experimented with memory mapping using MAP_ANONYMOUS as you
suggested, the good news is that it does free up the cpu load and my
system is much more responsive with the change. The bad news is that
the data read back from disc (PIO or DMA read) seems to be invisible
to user-space application. For instance, read buffer is all zeros
after Identify Device command. Is this expected side effect of
MAP_ANONYMOUS option?

Thanks,
Fajun
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-17 Thread Mark Lord

Fajun Chen wrote:

On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote:

Fajun Chen wrote:

On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote:

Fajun Chen wrote:

..

This problem also happens with R/W DMA ops. Below are simplified code snippets:
// Open one sg device for read
  if ((sg_fd  = open(dev_name, O_RDWR))<0)
  {
  ...
  }
  read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED, sg_fd, 0);

// Open the same sg device for write
  if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
  {
 ...
  }
  write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED, sg_fd_wr, 0);

..

Mmmm.. what is the purpose of those two mmap'd areas ?
I think this is important and relevant here:  what are they used for?

As coded above, these are memory mapped areas taht (1) overlap,
and (2) will be demand paged automatically to/from the disk
as they are accessed/modified.  This *will* conflict with any SG_IO
operations happening at the same time on the same device.

..

The purpose of using two memory mapped areas is to meet our
requirement that certain data patterns for writing need to be kept
across commands. For instance, if one buffer is used for both reads
and writes, then this buffer will need to be re-populated with certain
write data after each read command, which would be very costly for
write-read mixed type of ops. This separate R/W buffer setting also
facilitates data comparison.

These buffers are not used at the same time (one will be used only
after the command on the other is completed). My application is the
only program accessing disk using sg/libata and the rest of the
programs run from ramdisk. Also, each buffer is only about 0.5MB and
we have 64MB RAM on the target board.
With this setup,  these two buffers should be pretty much independent
and free from block layer/file system, correct?

..

No.  Those "buffers" as coded above are actually mmap'ed representations
of portions of the device (disk drive).  So any write into one of those
buffers will trigger disk writes, and just accessing ("read") the buffers
may trigger disk reads.

So what could be happening here, is when you trigger manual disk accesses
via SG_IO, that result in data being copied into those "buffers", the kernel
then automatically schedules disk writes to update the on-disk copies of
those mmap'd regions.

What you probably intended to do instead, was to use mmap to just allocate
some page-aligned RAM, not to actually mmap'd any on-disk data.  Right?

Here's how that's done:

  read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED|MAP_ANONYMOUS, -1, 0);

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-17 Thread Fajun Chen
On 11/17/07, James Chapman <[EMAIL PROTECTED]> wrote:
> Fajun Chen wrote:
> > On 11/16/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
> >> Fajun Chen wrote:
> >>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
> >>> and libata version 2.00 are loaded on ARM XScale board.  Under heavy
> >>> cpu load (e.g. when blocks per transfer/sector count is set to 1),
> >>> I've observed that the test application can suck cpu away for long
> >>> time (more than 20 seconds) and other processes including high
> >>> priority shell can not get the time slice to run.  What's interesting
> >>> is that if the application is under heavy IO load (e.g. when blocks
> >>> per transfer/sector count is set to 256),  the problem goes away. I
> >>> also tested with open source code sg_utils and got the same result, so
> >>> this is not a problem specific to my user-space application.
> >>>
> >>> Since user preemption is checked when the kernel is about to return to
> >>> user-space from a system call,  process scheduler should be invoked
> >>> after each system call. Something seems to be broken here.  I found a
> >>> similar issue below:
> >>> http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2
> >>> But that turns out to be an issue with MTD/JFFS2 drivers, which are
> >>> not used in my system.
> >>>
> >>> Has anyone experienced similar issues with sg/libata? Any information
> >>> would be greatly appreciated.
> >> That's one weird story.  Does kernel say anything during that 20 seconds?
> >>
> > No. Nothing in kernel log.
> >
> > Fajun
>
> Have you considered using oprofile to find out what the CPU is doing
> during the 20 seconds?
>
Haven't tried oprofile yet, not sure if it will get the time slice to
run though. During this 20 seconds, I've verified that my application
is still busy with R/W ops.

> Does the problem occur when you put it under load using another method?
> What are the ATA and network drivers here? I've seen some awful
> out-of-tree device drivers hog the CPU with busy-waits and other crap.
> Oprofile results should show the culprit.
If blocks per transfer/sector count is set to 256, which means cpu has
less load (any other implications?), this problem no longer occurs.
Our target system uses libata sil24/pata680 drivers, has a customized
FIFO driver but no network driver. The relevant variable here is
blocks per transfer/sector count, which seems to matter only to
sg/libata.

Thanks,
Fajun
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-17 Thread Fajun Chen
On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote:
> Fajun Chen wrote:
> > On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote:
> >> Fajun Chen wrote:
> >>> Hi All,
> >>>
> >>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
> >>> and libata version 2.00 are loaded on ARM XScale board.  Under heavy
> >>> cpu load (e.g. when blocks per transfer/sector count is set to 1),
> >>> I've observed that the test application can suck cpu away for long
> >>> time (more than 20 seconds) and other processes including high
> >>> priority shell can not get the time slice to run.  What's interesting
> >>> is that if the application is under heavy IO load (e.g. when blocks
> >>> per transfer/sector count is set to 256),  the problem goes away. I
> >>> also tested with open source code sg_utils and got the same result, so
> >>> this is not a problem specific to my user-space application.
> >> ..
> >>
> >> Post the relevant code here, and then we'll be able to better understand
> >> and explain it to you.
> >>
> >> For example, if the code is using ATA opcodes 0x20, 0x21, 0x24,
> >> 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops),
> >> then this behaviour does not surprise me in the least.  Fully expected
> >> and difficult to avoid.
> >>
> >
> > This problem also happens with R/W DMA ops. Below are simplified code 
> > snippets:
> > // Open one sg device for read
> >   if ((sg_fd  = open(dev_name, O_RDWR))<0)
> >   {
> >   ...
> >   }
> >   read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
> >  MAP_SHARED, sg_fd, 0);
> >
> > // Open the same sg device for write
> >   if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
> >   {
> >  ...
> >   }
> >   write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
> >  MAP_SHARED, sg_fd_wr, 0);
> ..
>
> Mmmm.. what is the purpose of those two mmap'd areas ?
> I think this is important and relevant here:  what are they used for?
>
> As coded above, these are memory mapped areas taht (1) overlap,
> and (2) will be demand paged automatically to/from the disk
> as they are accessed/modified.  This *will* conflict with any SG_IO
> operations happening at the same time on the same device.
>
> 

The purpose of using two memory mapped areas is to meet our
requirement that certain data patterns for writing need to be kept
across commands. For instance, if one buffer is used for both reads
and writes, then this buffer will need to be re-populated with certain
write data after each read command, which would be very costly for
write-read mixed type of ops. This separate R/W buffer setting also
facilitates data comparison.

These buffers are not used at the same time (one will be used only
after the command on the other is completed). My application is the
only program accessing disk using sg/libata and the rest of the
programs run from ramdisk. Also, each buffer is only about 0.5MB and
we have 64MB RAM on the target board.
With this setup,  these two buffers should be pretty much independent
and free from block layer/file system, correct?

One thing is worthy of mentioning here. If the application is set to
low priority (nice 19) or sched_yield() is called after each R/W
command, then this issue disappears but performance suffers.

Some thoughts here. For  a static process, Linux scheduler could
assign some dynamic priority to it based on activity and age, etc. Any
chance  that the scheduler favors my application unfairly due to the
load condition?

Thanks,
Fajun
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-17 Thread James Chapman
Fajun Chen wrote:
> On 11/16/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
>> Fajun Chen wrote:
>>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
>>> and libata version 2.00 are loaded on ARM XScale board.  Under heavy
>>> cpu load (e.g. when blocks per transfer/sector count is set to 1),
>>> I've observed that the test application can suck cpu away for long
>>> time (more than 20 seconds) and other processes including high
>>> priority shell can not get the time slice to run.  What's interesting
>>> is that if the application is under heavy IO load (e.g. when blocks
>>> per transfer/sector count is set to 256),  the problem goes away. I
>>> also tested with open source code sg_utils and got the same result, so
>>> this is not a problem specific to my user-space application.
>>>
>>> Since user preemption is checked when the kernel is about to return to
>>> user-space from a system call,  process scheduler should be invoked
>>> after each system call. Something seems to be broken here.  I found a
>>> similar issue below:
>>> http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2
>>> But that turns out to be an issue with MTD/JFFS2 drivers, which are
>>> not used in my system.
>>>
>>> Has anyone experienced similar issues with sg/libata? Any information
>>> would be greatly appreciated.
>> That's one weird story.  Does kernel say anything during that 20 seconds?
>>
> No. Nothing in kernel log.
> 
> Fajun

Have you considered using oprofile to find out what the CPU is doing
during the 20 seconds?

Does the problem occur when you put it under load using another method?
What are the ATA and network drivers here? I've seen some awful
out-of-tree device drivers hog the CPU with busy-waits and other crap.
Oprofile results should show the culprit.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-17 Thread Mark Lord

Fajun Chen wrote:

On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote:

Fajun Chen wrote:

Hi All,

I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
and libata version 2.00 are loaded on ARM XScale board.  Under heavy
cpu load (e.g. when blocks per transfer/sector count is set to 1),
I've observed that the test application can suck cpu away for long
time (more than 20 seconds) and other processes including high
priority shell can not get the time slice to run.  What's interesting
is that if the application is under heavy IO load (e.g. when blocks
per transfer/sector count is set to 256),  the problem goes away. I
also tested with open source code sg_utils and got the same result, so
this is not a problem specific to my user-space application.

..

Post the relevant code here, and then we'll be able to better understand
and explain it to you.

For example, if the code is using ATA opcodes 0x20, 0x21, 0x24,
0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops),
then this behaviour does not surprise me in the least.  Fully expected
and difficult to avoid.



This problem also happens with R/W DMA ops. Below are simplified code snippets:
// Open one sg device for read
  if ((sg_fd  = open(dev_name, O_RDWR))<0)
  {
  ...
  }
  read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED, sg_fd, 0);

// Open the same sg device for write
  if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
  {
 ...
  }
  write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED, sg_fd_wr, 0);

..

Mmmm.. what is the purpose of those two mmap'd areas ?
I think this is important and relevant here:  what are they used for?

As coded above, these are memory mapped areas taht (1) overlap,
and (2) will be demand paged automatically to/from the disk
as they are accessed/modified.  This *will* conflict with any SG_IO
operations happening at the same time on the same device.






  sg_io_hdr_t io_hdr;

  memset(&io_hdr, 0, sizeof(sg_io_hdr_t));

  io_hdr.interface_id = 'S';
  io_hdr.mx_sb_len= sizeof(sense_buffer);
  io_hdr.sbp  = sense_buffer;
  io_hdr.dxfer_len= dxfer_len;
  io_hdr.cmd_len  = cmd_len;
  io_hdr.cmdp = cmdp;// ATA pass through command block
  io_hdr.timeout  = cmd_tmo * 1000;   // In millisecs
  io_hdr.pack_id = id;  // Read/write counter for now
  io_hdr.iovec_count=0;   // scatter gather elements, 0=not being used

  if (direction == 1)
  {
io_hdr.dxfer_direction = SG_DXFER_TO_DEV;
io_hdr.flags |= SG_FLAG_MMAP_IO;
status = ioctl(sg_fd_wr, SG_IO, &io_hdr);
  }
  else
  {
io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
io_hdr.flags |= SG_FLAG_MMAP_IO;
status = ioctl(sg_fd, SG_IO, &io_hdr);
  }
  ...
Mmaped IO is a moot point here since this problem is also observed
when using direct IO.

Thanks,
Fajun


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-16 Thread Fajun Chen
On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote:
> Fajun Chen wrote:
> > Hi All,
> >
> > I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
> > and libata version 2.00 are loaded on ARM XScale board.  Under heavy
> > cpu load (e.g. when blocks per transfer/sector count is set to 1),
> > I've observed that the test application can suck cpu away for long
> > time (more than 20 seconds) and other processes including high
> > priority shell can not get the time slice to run.  What's interesting
> > is that if the application is under heavy IO load (e.g. when blocks
> > per transfer/sector count is set to 256),  the problem goes away. I
> > also tested with open source code sg_utils and got the same result, so
> > this is not a problem specific to my user-space application.
> ..
>
> Post the relevant code here, and then we'll be able to better understand
> and explain it to you.
>
> For example, if the code is using ATA opcodes 0x20, 0x21, 0x24,
> 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops),
> then this behaviour does not surprise me in the least.  Fully expected
> and difficult to avoid.
>

This problem also happens with R/W DMA ops. Below are simplified code snippets:
// Open one sg device for read
  if ((sg_fd  = open(dev_name, O_RDWR))<0)
  {
  ...
  }
  read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED, sg_fd, 0);

// Open the same sg device for write
  if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
  {
 ...
  }
  write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED, sg_fd_wr, 0);

  sg_io_hdr_t io_hdr;

  memset(&io_hdr, 0, sizeof(sg_io_hdr_t));

  io_hdr.interface_id = 'S';
  io_hdr.mx_sb_len= sizeof(sense_buffer);
  io_hdr.sbp  = sense_buffer;
  io_hdr.dxfer_len= dxfer_len;
  io_hdr.cmd_len  = cmd_len;
  io_hdr.cmdp = cmdp;// ATA pass through command block
  io_hdr.timeout  = cmd_tmo * 1000;   // In millisecs
  io_hdr.pack_id = id;  // Read/write counter for now
  io_hdr.iovec_count=0;   // scatter gather elements, 0=not being used

  if (direction == 1)
  {
io_hdr.dxfer_direction = SG_DXFER_TO_DEV;
io_hdr.flags |= SG_FLAG_MMAP_IO;
status = ioctl(sg_fd_wr, SG_IO, &io_hdr);
  }
  else
  {
io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
io_hdr.flags |= SG_FLAG_MMAP_IO;
status = ioctl(sg_fd, SG_IO, &io_hdr);
  }
  ...
Mmaped IO is a moot point here since this problem is also observed
when using direct IO.

Thanks,
Fajun
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-16 Thread Fajun Chen
On 11/16/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
> Fajun Chen wrote:
> > I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
> > and libata version 2.00 are loaded on ARM XScale board.  Under heavy
> > cpu load (e.g. when blocks per transfer/sector count is set to 1),
> > I've observed that the test application can suck cpu away for long
> > time (more than 20 seconds) and other processes including high
> > priority shell can not get the time slice to run.  What's interesting
> > is that if the application is under heavy IO load (e.g. when blocks
> > per transfer/sector count is set to 256),  the problem goes away. I
> > also tested with open source code sg_utils and got the same result, so
> > this is not a problem specific to my user-space application.
> >
> > Since user preemption is checked when the kernel is about to return to
> > user-space from a system call,  process scheduler should be invoked
> > after each system call. Something seems to be broken here.  I found a
> > similar issue below:
> > http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2
> > But that turns out to be an issue with MTD/JFFS2 drivers, which are
> > not used in my system.
> >
> > Has anyone experienced similar issues with sg/libata? Any information
> > would be greatly appreciated.
>
> That's one weird story.  Does kernel say anything during that 20 seconds?
>
No. Nothing in kernel log.

Fajun
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-16 Thread Mark Lord

Fajun Chen wrote:

Hi All,

I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
and libata version 2.00 are loaded on ARM XScale board.  Under heavy
cpu load (e.g. when blocks per transfer/sector count is set to 1),
I've observed that the test application can suck cpu away for long
time (more than 20 seconds) and other processes including high
priority shell can not get the time slice to run.  What's interesting
is that if the application is under heavy IO load (e.g. when blocks
per transfer/sector count is set to 256),  the problem goes away. I
also tested with open source code sg_utils and got the same result, so
this is not a problem specific to my user-space application.

..

Post the relevant code here, and then we'll be able to better understand
and explain it to you.

For example, if the code is using ATA opcodes 0x20, 0x21, 0x24,
0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops),
then this behaviour does not surprise me in the least.  Fully expected
and difficult to avoid.

Cheers

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-16 Thread Tejun Heo
Fajun Chen wrote:
> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
> and libata version 2.00 are loaded on ARM XScale board.  Under heavy
> cpu load (e.g. when blocks per transfer/sector count is set to 1),
> I've observed that the test application can suck cpu away for long
> time (more than 20 seconds) and other processes including high
> priority shell can not get the time slice to run.  What's interesting
> is that if the application is under heavy IO load (e.g. when blocks
> per transfer/sector count is set to 256),  the problem goes away. I
> also tested with open source code sg_utils and got the same result, so
> this is not a problem specific to my user-space application.
> 
> Since user preemption is checked when the kernel is about to return to
> user-space from a system call,  process scheduler should be invoked
> after each system call. Something seems to be broken here.  I found a
> similar issue below:
> http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2
> But that turns out to be an issue with MTD/JFFS2 drivers, which are
> not used in my system.
> 
> Has anyone experienced similar issues with sg/libata? Any information
> would be greatly appreciated.

That's one weird story.  Does kernel say anything during that 20 seconds?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Process Scheduling Issue using sg/libata

2007-11-16 Thread Fajun Chen
Hi All,

I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
and libata version 2.00 are loaded on ARM XScale board.  Under heavy
cpu load (e.g. when blocks per transfer/sector count is set to 1),
I've observed that the test application can suck cpu away for long
time (more than 20 seconds) and other processes including high
priority shell can not get the time slice to run.  What's interesting
is that if the application is under heavy IO load (e.g. when blocks
per transfer/sector count is set to 256),  the problem goes away. I
also tested with open source code sg_utils and got the same result, so
this is not a problem specific to my user-space application.

Since user preemption is checked when the kernel is about to return to
user-space from a system call,  process scheduler should be invoked
after each system call. Something seems to be broken here.  I found a
similar issue below:
http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2
But that turns out to be an issue with MTD/JFFS2 drivers, which are
not used in my system.

Has anyone experienced similar issues with sg/libata? Any information
would be greatly appreciated.

Thanks,
Fajun
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html