Re: Preempt & Xfs Question

2005-01-27 Thread Steve Lord
Matthias-Christian Ott wrote:
Hi!
I have a question: Why do I get such debug messages:
BUG: using smp_processor_id() in preemptible [0001] code: khelper/892
caller is _pagebuf_lookup_pages+0x11b/0x362
[] smp_processor_id+0xa3/0xb4
[] _pagebuf_lookup_pages+0x11b/0x362
[] _pagebuf_lookup_pages+0x11b/0x362
.
Does the XFS Module avoid preemption rules? If so, why?
It is probably coming from these macros which keep various statistics
inside xfs as per cpu variables.
in fs//xfs/linux-2.6/xfs_stats.h:
DECLARE_PER_CPU(struct xfsstats, xfsstats);
/* We don't disable preempt, not too worried about poking the
 * wrong cpu's stat for now */
#define XFS_STATS_INC(count)(__get_cpu_var(xfsstats).count++)
#define XFS_STATS_DEC(count)(__get_cpu_var(xfsstats).count--)
#define XFS_STATS_ADD(count, inc)   (__get_cpu_var(xfsstats).count += (inc))
So it knows about the fact that preemption can mess up the result of this,
but it does not really matter for the purpose it is used for here. The
stats are just informational but very handy for working out what is going
on inside xfs. Using a global instead of a per cpu variable would
lead to cache line contention.
If you want to make it go away on a preemptable kernel, then use the
alternate definition of the stat macros which is just below the
above code.
Steve
p.s. try running xfs_stats.pl -f which comes with the xfs-cmds source to
watch the stats.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Preempt Xfs Question

2005-01-27 Thread Steve Lord
Matthias-Christian Ott wrote:
Hi!
I have a question: Why do I get such debug messages:
BUG: using smp_processor_id() in preemptible [0001] code: khelper/892
caller is _pagebuf_lookup_pages+0x11b/0x362
[c03119c7] smp_processor_id+0xa3/0xb4
[c02ef802] _pagebuf_lookup_pages+0x11b/0x362
[c02ef802] _pagebuf_lookup_pages+0x11b/0x362
.
Does the XFS Module avoid preemption rules? If so, why?
It is probably coming from these macros which keep various statistics
inside xfs as per cpu variables.
in fs//xfs/linux-2.6/xfs_stats.h:
DECLARE_PER_CPU(struct xfsstats, xfsstats);
/* We don't disable preempt, not too worried about poking the
 * wrong cpu's stat for now */
#define XFS_STATS_INC(count)(__get_cpu_var(xfsstats).count++)
#define XFS_STATS_DEC(count)(__get_cpu_var(xfsstats).count--)
#define XFS_STATS_ADD(count, inc)   (__get_cpu_var(xfsstats).count += (inc))
So it knows about the fact that preemption can mess up the result of this,
but it does not really matter for the purpose it is used for here. The
stats are just informational but very handy for working out what is going
on inside xfs. Using a global instead of a per cpu variable would
lead to cache line contention.
If you want to make it go away on a preemptable kernel, then use the
alternate definition of the stat macros which is just below the
above code.
Steve
p.s. try running xfs_stats.pl -f which comes with the xfs-cmds source to
watch the stats.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Steve Lord
Mukker, Atul wrote:
LSI would leave no stone unturned to make the performance better for
megaraid controllers under Linux. If you have some hard data in relation to
comparison of performance for adapters from other vendors, please share with
us. We would definitely strive to better it.
The megaraid driver is open source, do you see anything that driver can do
to improve performance. We would greatly appreciate any feedback in this
regard and definitely incorporate in the driver. The FW under Linux and
windows is same, so I do not see how the megaraid stack should perform
differently under Linux and windows?
It is not the driver per se, but the way the memory which is the I/O
source/target is presented to the driver. In linux there is a good
chance it will have to use more scatter gather elements to represent
the same amount of data.
Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Steve Lord
Mukker, Atul wrote:
LSI would leave no stone unturned to make the performance better for
megaraid controllers under Linux. If you have some hard data in relation to
comparison of performance for adapters from other vendors, please share with
us. We would definitely strive to better it.
The megaraid driver is open source, do you see anything that driver can do
to improve performance. We would greatly appreciate any feedback in this
regard and definitely incorporate in the driver. The FW under Linux and
windows is same, so I do not see how the megaraid stack should perform
differently under Linux and windows?
It is not the driver per se, but the way the memory which is the I/O
source/target is presented to the driver. In linux there is a good
chance it will have to use more scatter gather elements to represent
the same amount of data.
Steve
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Steve Lord
James Bottomley wrote:
Well, the basic advice would be not to worry too much about
fragmentation from the point of view of I/O devices.  They mostly all do
scatter gather (SG) onboard as an intelligent processing operation and
they're very good at it.
No one has ever really measured an effect we can say "This is due to the
card's SG engine".  So, the rule we tend to follow is that if SG element
reduction comes for free, we take it.  The issue that actually causes
problems isn't the reduction in processing overhead, it's that the
device's SG list is usually finite in size and so it's worth conserving
if we can; however it's mostly not worth conserving at the expense of
processor cycles.
Depends on the device at the other end of the scsi/fiber channel.
We have seen the processor in raid devices get maxed out by linux
when it is not maxed out by windows. Windows tends to be more device
friendly (I hate to say it), by sending larger and fewer scatter gather
elements than linux does.
Running an LSI raid over fiberchannel with 4 ports, windows was
able to sustain ~830 Mbytes/sec, basically channel speed using
only 1500 commands a second. Linux peaked at 550 Mbytes/sec using
over 4000 scsi commands to do it - the sustained rate was more
like 350 Mbytes/sec, I think at the end of the day linux was
sending 128K per scsi request. These numbers predate the current
linux scsi and io code, and I do not have the hardware to rerun
them right now.
I realize this is one data point on one end of the scale, but I
just wanted to make the point that there are cases where it
does matter. Hopefully William's little change from last
year has helped out a lot.
Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Steve Lord
James Bottomley wrote:
Well, the basic advice would be not to worry too much about
fragmentation from the point of view of I/O devices.  They mostly all do
scatter gather (SG) onboard as an intelligent processing operation and
they're very good at it.
No one has ever really measured an effect we can say This is due to the
card's SG engine.  So, the rule we tend to follow is that if SG element
reduction comes for free, we take it.  The issue that actually causes
problems isn't the reduction in processing overhead, it's that the
device's SG list is usually finite in size and so it's worth conserving
if we can; however it's mostly not worth conserving at the expense of
processor cycles.
Depends on the device at the other end of the scsi/fiber channel.
We have seen the processor in raid devices get maxed out by linux
when it is not maxed out by windows. Windows tends to be more device
friendly (I hate to say it), by sending larger and fewer scatter gather
elements than linux does.
Running an LSI raid over fiberchannel with 4 ports, windows was
able to sustain ~830 Mbytes/sec, basically channel speed using
only 1500 commands a second. Linux peaked at 550 Mbytes/sec using
over 4000 scsi commands to do it - the sustained rate was more
like 350 Mbytes/sec, I think at the end of the day linux was
sending 128K per scsi request. These numbers predate the current
linux scsi and io code, and I do not have the hardware to rerun
them right now.
I realize this is one data point on one end of the scale, but I
just wanted to make the point that there are cases where it
does matter. Hopefully William's little change from last
year has helped out a lot.
Steve
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LVM2

2005-01-20 Thread Steve Lord
Trever L. Adams wrote:
It is for a group. For the most part it is data access/retention. Writes
and such would be more similar to a desktop. I would use SATA if they
were (nearly) equally priced and there were awesome 1394 to SATA bridge
chips that worked well with Linux. So, right now, I am looking at ATA to
1394.
So, to get 2TB of RAID5 you have 6 500 GB disks right? So, will this
work within on LV? Or is it 2TB of diskspace total? So, are volume
groups pretty fault tolerant if you have a bunch of RAID5 LVs below
them? This is my one worry about this.
Second, you mentioned file systems. We were talking about ext3. I have
never used any others in Linux (barring ext2, minixfs, and fat). I had
heard XFS from IBM was pretty good. I would rather not use reiserfs.
Any recommendations.
Trever
They all forgot to mention one more limitation, the maximum filesystem
size supported by the address_space structure in linux. If you are running
on ia32, then you get stuck with 2^32 filesystem blocks, or 16 Tbytes in
one filesystem because of the way an address space structure is used to
cache the metadata. If you use an Athlon 64 that limitation goes away.
Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LVM2

2005-01-20 Thread Steve Lord
Trever L. Adams wrote:
It is for a group. For the most part it is data access/retention. Writes
and such would be more similar to a desktop. I would use SATA if they
were (nearly) equally priced and there were awesome 1394 to SATA bridge
chips that worked well with Linux. So, right now, I am looking at ATA to
1394.
So, to get 2TB of RAID5 you have 6 500 GB disks right? So, will this
work within on LV? Or is it 2TB of diskspace total? So, are volume
groups pretty fault tolerant if you have a bunch of RAID5 LVs below
them? This is my one worry about this.
Second, you mentioned file systems. We were talking about ext3. I have
never used any others in Linux (barring ext2, minixfs, and fat). I had
heard XFS from IBM was pretty good. I would rather not use reiserfs.
Any recommendations.
Trever
They all forgot to mention one more limitation, the maximum filesystem
size supported by the address_space structure in linux. If you are running
on ia32, then you get stuck with 2^32 filesystem blocks, or 16 Tbytes in
one filesystem because of the way an address space structure is used to
cache the metadata. If you use an Athlon 64 that limitation goes away.
Steve
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bounce buffer deadlock

2001-06-30 Thread Steve Lord

> 
> On Sat, 30 Jun 2001, Steve Lord wrote:
> >
> > OK, sounds reasonable, time to go download and merge again I guess!
> 
> For 2.4.7 or so, I'll make a backwards-compatibility define (ie make
> GFP_BUFFER be the same as the new GFP_NOIO, which is the historical
> behaviour and the anally safe value, if not very efficient), but I'm
> planning on releasing 2.4.6 without it, to try to flush out people who are
> able to take advantage of the new extended semantics out of the
> woodworks..
> 
>   Linus

Consider XFS flushed out (once I merge). This, for us, is the tricky part:


[EMAIL PROTECTED] said:
>> That allows us to do the best we can - still flushing out dirty
>> buffers when that's ok (like when a filesystem wants more memory), and
>> giving the allocator better control over exactly _what_ he objects to.

XFS introduces the concept of the low level flush of a buffer not always
being really a low level flush, since a delayed allocate buffer can end
up reentering the filesystem in order to create the true on disk allocation.
This in turn can cause a transaction and more memory allocations. The really
nasty case we were using GFP_BUFFER for is a memory allocation which is from
within a transaction, we cannot afford to have another transaction start out
of the bottom of memory allocation as it may require resources locked by
the transaction which is allocating memory.

Steve



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Bounce buffer deadlock

2001-06-30 Thread Steve Lord


> Yes. 2.4.6-pre8 fixes that (not sure if its up already). 

It is up.

> 
> > If the fix is to avoid page_launder in these cases then the number of
> > occurrences when an alloc_pages fails will go up. 
> 
> > I was attempting to come up with a way of making try_to_free_buffers
> > fail on buffers which are being processed in the generic_make_request
> > path by marking them, the problem is there is no single place to reset
> > the state of a buffer so that try_to_free_buffers will wait for it.
> > Doing it after the end of the loop in generic_make_request is race
> > prone to say the least.
> 
> I really want to fix things like this in 2.5. (ie not avoid the deadlock
> by completly avoiding physical IO, but avoid the deadlock by avoiding
> physical IO on the "device" which is doing the allocation)
> 
> Could you send me your code ? No problem if it does not work at all :)
> 

Well, the basic idea is simple, but I suspect the implementation might
rapidly become historical in 2.5. Basically I added a new buffer state bit,
although BH_Req looks like it could be cannibalized, no one appears to check
for it (is it really dead code?). 

Using a flag to skip buffers in try_to_free_buffers is easy:


===
Index: linux/fs/buffer.c
===

--- /usr/tmp/TmpDir.3237-0/linux/fs/buffer.c_1.68   Sat Jun 30 12:56:29 2001
+++ linux/fs/buffer.c   Sat Jun 30 12:57:52 2001
@@ -2365,7 +2365,7 @@
 /*
  * Can the buffer be thrown out?
  */
-#define BUFFER_BUSY_BITS   ((1b_state 
& BUFFER_BUSY_BITS))
 
 /*
@@ -2430,7 +2430,11 @@
spin_unlock(_list[index].lock);
write_unlock(_table_lock);
spin_unlock(_list_lock);
-   if (wait) {
+   /* Buffers in the middle of generic_make_request processing cannot
+* be waited for, they may be allocating memory right now and be
+* locked by this thread.
+*/
+   if (wait && !buffer_clamped(tmp)) {
sync_page_buffers(bh, wait);
/* We waited synchronously, so we can free the buffers. */
if (wait > 1 && !loop) {

===
Index: linux/include/linux/fs.h
===

--- /usr/tmp/TmpDir.3237-0/linux/include/linux/fs.h_1.99Sat Jun 30 12:56:29 
2001
+++ linux/include/linux/fs.hSat Jun 30 07:05:37 2001
@@ -224,6 +224,8 @@
BH_Mapped,  /* 1 if the buffer has a disk mapping */
BH_New, /* 1 if the buffer is new and not yet written out */
BH_Protected,   /* 1 if the buffer is protected */
+   BH_Clamped, /* 1 if the buffer cannot be reclaimed
+* in it's current state */
BH_Delay,   /* 1 if the buffer is delayed allocate */
 
BH_PrivateStart,/* not a state bit, but the first bit available
@@ -286,6 +288,7 @@
 #define buffer_mapped(bh)  __buffer_state(bh,Mapped)
 #define buffer_new(bh) __buffer_state(bh,New)
 #define buffer_protected(bh)   __buffer_state(bh,Protected)
+#define buffer_clamped(bh) __buffer_state(bh,Clamped)
 #define buffer_delay(bh)   __buffer_state(bh,Delay)
 
 #define bh_offset(bh)  ((unsigned long)(bh)->b_data & ~PAGE_MASK)


The tricky part which I had not worked out how to do yet is to manage the
clearing of a state bit in all the correct places. You would have to set it
when the buffer got locked when I/O was about to start, it becomes clearable
after the last memory allocation during the I/O submission process. I do
not like the approach because there are so many ways a buffer can go
once you get into generic_make_request. At first I thought I could just
explicitly set and clear a flag around memory allocations like the bounce
buffer path. However, that can lead to AB BA deadlocks between multiple
threads submitting I/O requests. At this point I started to think I was
going to build an unmaintainable rats nest and decided I had not got
the correct answer.

I am not sure that an approach which avoids a specific device will fly either,
all the I/O can be on one device, and what does device mean when it comes
to md/lvm and request remapping?

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Bounce buffer deadlock

2001-06-30 Thread Steve Lord

> 
> On Sat, 30 Jun 2001, Steve Lord wrote:
> >
> > It looks to me as if all memory allocations of type GFP_BUFFER which happen
> > in generic_make_request downwards can hit the same type of deadlock, so
> > bounce buffers, the request functions of the raid and lvm paths can all
> > end up in try_to_free_buffers on a buffer they themselves hold the lock on.
> 
> .. which is why GFP_BUFFER doesn't exist any more in the most recent
> pre-kernels (oops, this is pre8 only, not pre7 like I said in the previous
> email)
> 
> The problem is that GFP_BUFFER used to mean two things: "don't call
> low-level filesystem" and "don't do IO". Some of the pre-kernels starting
> to make it mean "don't call low-level FS" only. The later ones split up
> the semantics, so that the cases which care about FS deadlocks use
> "GFP_NOFS", and the cases that care about IO recursion use "GFP_NOIO", so
> that we don't overload the meaning of GFP_BUFFER.
> 
> That allows us to do the best we can - still flushing out dirty buffers
> when that's ok (like when a filesystem wants more memory), and giving the
> allocator better control over exactly _what_ he objects to.
> 
>   Linus

OK, sounds reasonable, time to go download and merge again I guess!

Steve



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Bounce buffer deadlock

2001-06-30 Thread Steve Lord

> 
> 
> On Fri, 29 Jun 2001, Steve Lord wrote:
> 
> > 
> > Has anyone else seen a hang like this:
> > 
> >   bdflush()
> > flush_dirty_buffers()
> >   ll_rw_block()
> > submit_bh(buffer X)
> >   generic_make_request()
> > __make_request()
> > create_bounce()
> >   alloc_bounce_page()
> > alloc_page()
> >   try_to_free_pages()
> > do_try_to_free_pages()
> >   page_launder()
> > try_to_free_buffers( , 2)  -- i.e. wait for buffers
> >   sync_page_buffers()
> > __wait_on_buffer(buffer X)
> > 
> > Where the buffer head X going in the top of the stack is the same as the on
> e
> > we wait on at the bottom.
> > 
> > There still seems to be nothing to prevent the try to free buffers from
> > blocking on a buffer like this. Setting a flag on the buffer around the
> > create_bounce call, and skipping it in the try_to_free_buffers path would
> > be one approach to avoiding this.
> 
> Yes there is a bug: Linus is going to put a new fix soon. 
> 
> > I hit this in 2.4.6-pre6, and I don't see anything in the ac series to prot
> ect
> > against it.
> 
> Thats because the -ac series does not contain the __GFP_BUFFER/__GFP_IO
> moditications which are in the -ac series.

It looks to me as if all memory allocations of type GFP_BUFFER which happen
in generic_make_request downwards can hit the same type of deadlock, so
bounce buffers, the request functions of the raid and lvm paths can all
end up in try_to_free_buffers on a buffer they themselves hold the lock on.

If the fix is to avoid page_launder in these cases then the number of
occurrences when an alloc_pages fails will go up. I was attempting to
come up with a way of making try_to_free_buffers fail on buffers which
are being processed in the generic_make_request path by marking them,
the problem is there is no single place to reset the state of a buffer so
that  try_to_free_buffers will wait for it. Doing it after the end of the
loop in generic_make_request is race prone to say the least.

Steve



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Bounce buffer deadlock

2001-06-30 Thread Steve Lord

 
 
 On Fri, 29 Jun 2001, Steve Lord wrote:
 
  
  Has anyone else seen a hang like this:
  
bdflush()
  flush_dirty_buffers()
ll_rw_block()
  submit_bh(buffer X)
generic_make_request()
  __make_request()
  create_bounce()
alloc_bounce_page()
  alloc_page()
try_to_free_pages()
  do_try_to_free_pages()
page_launder()
  try_to_free_buffers( , 2)  -- i.e. wait for buffers
sync_page_buffers()
  __wait_on_buffer(buffer X)
  
  Where the buffer head X going in the top of the stack is the same as the on
 e
  we wait on at the bottom.
  
  There still seems to be nothing to prevent the try to free buffers from
  blocking on a buffer like this. Setting a flag on the buffer around the
  create_bounce call, and skipping it in the try_to_free_buffers path would
  be one approach to avoiding this.
 
 Yes there is a bug: Linus is going to put a new fix soon. 
 
  I hit this in 2.4.6-pre6, and I don't see anything in the ac series to prot
 ect
  against it.
 
 Thats because the -ac series does not contain the __GFP_BUFFER/__GFP_IO
 moditications which are in the -ac series.

It looks to me as if all memory allocations of type GFP_BUFFER which happen
in generic_make_request downwards can hit the same type of deadlock, so
bounce buffers, the request functions of the raid and lvm paths can all
end up in try_to_free_buffers on a buffer they themselves hold the lock on.

If the fix is to avoid page_launder in these cases then the number of
occurrences when an alloc_pages fails will go up. I was attempting to
come up with a way of making try_to_free_buffers fail on buffers which
are being processed in the generic_make_request path by marking them,
the problem is there is no single place to reset the state of a buffer so
that  try_to_free_buffers will wait for it. Doing it after the end of the
loop in generic_make_request is race prone to say the least.

Steve



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Bounce buffer deadlock

2001-06-30 Thread Steve Lord

 
 On Sat, 30 Jun 2001, Steve Lord wrote:
 
  It looks to me as if all memory allocations of type GFP_BUFFER which happen
  in generic_make_request downwards can hit the same type of deadlock, so
  bounce buffers, the request functions of the raid and lvm paths can all
  end up in try_to_free_buffers on a buffer they themselves hold the lock on.
 
 .. which is why GFP_BUFFER doesn't exist any more in the most recent
 pre-kernels (oops, this is pre8 only, not pre7 like I said in the previous
 email)
 
 The problem is that GFP_BUFFER used to mean two things: don't call
 low-level filesystem and don't do IO. Some of the pre-kernels starting
 to make it mean don't call low-level FS only. The later ones split up
 the semantics, so that the cases which care about FS deadlocks use
 GFP_NOFS, and the cases that care about IO recursion use GFP_NOIO, so
 that we don't overload the meaning of GFP_BUFFER.
 
 That allows us to do the best we can - still flushing out dirty buffers
 when that's ok (like when a filesystem wants more memory), and giving the
 allocator better control over exactly _what_ he objects to.
 
   Linus

OK, sounds reasonable, time to go download and merge again I guess!

Steve



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Bounce buffer deadlock

2001-06-30 Thread Steve Lord


 Yes. 2.4.6-pre8 fixes that (not sure if its up already). 

It is up.

 
  If the fix is to avoid page_launder in these cases then the number of
  occurrences when an alloc_pages fails will go up. 
 
  I was attempting to come up with a way of making try_to_free_buffers
  fail on buffers which are being processed in the generic_make_request
  path by marking them, the problem is there is no single place to reset
  the state of a buffer so that try_to_free_buffers will wait for it.
  Doing it after the end of the loop in generic_make_request is race
  prone to say the least.
 
 I really want to fix things like this in 2.5. (ie not avoid the deadlock
 by completly avoiding physical IO, but avoid the deadlock by avoiding
 physical IO on the device which is doing the allocation)
 
 Could you send me your code ? No problem if it does not work at all :)
 

Well, the basic idea is simple, but I suspect the implementation might
rapidly become historical in 2.5. Basically I added a new buffer state bit,
although BH_Req looks like it could be cannibalized, no one appears to check
for it (is it really dead code?). 

Using a flag to skip buffers in try_to_free_buffers is easy:


===
Index: linux/fs/buffer.c
===

--- /usr/tmp/TmpDir.3237-0/linux/fs/buffer.c_1.68   Sat Jun 30 12:56:29 2001
+++ linux/fs/buffer.c   Sat Jun 30 12:57:52 2001
@@ -2365,7 +2365,7 @@
 /*
  * Can the buffer be thrown out?
  */
-#define BUFFER_BUSY_BITS   ((1BH_Dirty) | (1BH_Lock) | (1BH_Protected))
+#define BUFFER_BUSY_BITS   ((1BH_Dirty) | (1BH_Lock) | (1BH_Protected) | 
+(1BH_Clamped))
 #define buffer_busy(bh)(atomic_read((bh)-b_count) | ((bh)-b_state 
 BUFFER_BUSY_BITS))
 
 /*
@@ -2430,7 +2430,11 @@
spin_unlock(free_list[index].lock);
write_unlock(hash_table_lock);
spin_unlock(lru_list_lock);
-   if (wait) {
+   /* Buffers in the middle of generic_make_request processing cannot
+* be waited for, they may be allocating memory right now and be
+* locked by this thread.
+*/
+   if (wait  !buffer_clamped(tmp)) {
sync_page_buffers(bh, wait);
/* We waited synchronously, so we can free the buffers. */
if (wait  1  !loop) {

===
Index: linux/include/linux/fs.h
===

--- /usr/tmp/TmpDir.3237-0/linux/include/linux/fs.h_1.99Sat Jun 30 12:56:29 
2001
+++ linux/include/linux/fs.hSat Jun 30 07:05:37 2001
@@ -224,6 +224,8 @@
BH_Mapped,  /* 1 if the buffer has a disk mapping */
BH_New, /* 1 if the buffer is new and not yet written out */
BH_Protected,   /* 1 if the buffer is protected */
+   BH_Clamped, /* 1 if the buffer cannot be reclaimed
+* in it's current state */
BH_Delay,   /* 1 if the buffer is delayed allocate */
 
BH_PrivateStart,/* not a state bit, but the first bit available
@@ -286,6 +288,7 @@
 #define buffer_mapped(bh)  __buffer_state(bh,Mapped)
 #define buffer_new(bh) __buffer_state(bh,New)
 #define buffer_protected(bh)   __buffer_state(bh,Protected)
+#define buffer_clamped(bh) __buffer_state(bh,Clamped)
 #define buffer_delay(bh)   __buffer_state(bh,Delay)
 
 #define bh_offset(bh)  ((unsigned long)(bh)-b_data  ~PAGE_MASK)


The tricky part which I had not worked out how to do yet is to manage the
clearing of a state bit in all the correct places. You would have to set it
when the buffer got locked when I/O was about to start, it becomes clearable
after the last memory allocation during the I/O submission process. I do
not like the approach because there are so many ways a buffer can go
once you get into generic_make_request. At first I thought I could just
explicitly set and clear a flag around memory allocations like the bounce
buffer path. However, that can lead to AB BA deadlocks between multiple
threads submitting I/O requests. At this point I started to think I was
going to build an unmaintainable rats nest and decided I had not got
the correct answer.

I am not sure that an approach which avoids a specific device will fly either,
all the I/O can be on one device, and what does device mean when it comes
to md/lvm and request remapping?

Steve


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Bounce buffer deadlock

2001-06-30 Thread Steve Lord

 
 On Sat, 30 Jun 2001, Steve Lord wrote:
 
  OK, sounds reasonable, time to go download and merge again I guess!
 
 For 2.4.7 or so, I'll make a backwards-compatibility define (ie make
 GFP_BUFFER be the same as the new GFP_NOIO, which is the historical
 behaviour and the anally safe value, if not very efficient), but I'm
 planning on releasing 2.4.6 without it, to try to flush out people who are
 able to take advantage of the new extended semantics out of the
 woodworks..
 
   Linus

Consider XFS flushed out (once I merge). This, for us, is the tricky part:


[EMAIL PROTECTED] said:
 That allows us to do the best we can - still flushing out dirty
 buffers when that's ok (like when a filesystem wants more memory), and
 giving the allocator better control over exactly _what_ he objects to.

XFS introduces the concept of the low level flush of a buffer not always
being really a low level flush, since a delayed allocate buffer can end
up reentering the filesystem in order to create the true on disk allocation.
This in turn can cause a transaction and more memory allocations. The really
nasty case we were using GFP_BUFFER for is a memory allocation which is from
within a transaction, we cannot afford to have another transaction start out
of the bottom of memory allocation as it may require resources locked by
the transaction which is allocating memory.

Steve



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Bounce buffer deadlock

2001-06-29 Thread Steve Lord


Has anyone else seen a hang like this:

  bdflush()
flush_dirty_buffers()
  ll_rw_block()
submit_bh(buffer X)
  generic_make_request()
__make_request()
create_bounce()
  alloc_bounce_page()
alloc_page()
  try_to_free_pages()
do_try_to_free_pages()
  page_launder()
try_to_free_buffers( , 2)  -- i.e. wait for buffers
  sync_page_buffers()
__wait_on_buffer(buffer X)

Where the buffer head X going in the top of the stack is the same as the one
we wait on at the bottom.

There still seems to be nothing to prevent the try to free buffers from
blocking on a buffer like this. Setting a flag on the buffer around the
create_bounce call, and skipping it in the try_to_free_buffers path would
be one approach to avoiding this.

I hit this in 2.4.6-pre6, and I don't see anything in the ac series to protect
against it.

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: O_DIRECT please; Sybase 12.5

2001-06-29 Thread Steve Lord


XFS supports O_DIRECT on linux, has done for a while.

Steve

> At work I had to sit through a meeting where I heard
> the boss say "If Linux makes Sybase go through the page cache on
> reads, maybe we'll just have to switch to Solaris.  That's
> a serious performance problem."
> All I could say was "I expect Linux will support O_DIRECT
> soon, and Sybase will support that within a year."  
> 
> Er, so did I promise too much?  Andrea mentioned O_DIRECT recently
> ( http://marc.theaimsgroup.com/?l=linux-kernel=99253913516599=2,
>  http://lwn.net/2001/0510/bigpage.php3 )
> Is it supported yet in 2.4, or is this a 2.5 thing?
> 
> And what are the chances Sybase will support that flag any time
> soon?  I just read on news://forums.sybase.com/sybase.public.ase.linux
> that Sybase ASE 12.5 was released today, and a 60 day eval is downloadable
> for NT and Linux.  I'm downloading now; it's a biggie.
> 
> It supports raw partitions, which is good; that might satisfy my
> boss (although the administration will be a pain, and I'm not
> sure whether it's really supported by Dell RAID devices).
> I'd prefer O_DIRECT :-(
> 
> Hope somebody can give me encouraging news.
> 
> Thanks,
> Dan
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: O_DIRECT please; Sybase 12.5

2001-06-29 Thread Steve Lord


XFS supports O_DIRECT on linux, has done for a while.

Steve

 At work I had to sit through a meeting where I heard
 the boss say If Linux makes Sybase go through the page cache on
 reads, maybe we'll just have to switch to Solaris.  That's
 a serious performance problem.
 All I could say was I expect Linux will support O_DIRECT
 soon, and Sybase will support that within a year.  
 
 Er, so did I promise too much?  Andrea mentioned O_DIRECT recently
 ( http://marc.theaimsgroup.com/?l=linux-kernelm=99253913516599w=2,
  http://lwn.net/2001/0510/bigpage.php3 )
 Is it supported yet in 2.4, or is this a 2.5 thing?
 
 And what are the chances Sybase will support that flag any time
 soon?  I just read on news://forums.sybase.com/sybase.public.ase.linux
 that Sybase ASE 12.5 was released today, and a 60 day eval is downloadable
 for NT and Linux.  I'm downloading now; it's a biggie.
 
 It supports raw partitions, which is good; that might satisfy my
 boss (although the administration will be a pain, and I'm not
 sure whether it's really supported by Dell RAID devices).
 I'd prefer O_DIRECT :-(
 
 Hope somebody can give me encouraging news.
 
 Thanks,
 Dan
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Bounce buffer deadlock

2001-06-29 Thread Steve Lord


Has anyone else seen a hang like this:

  bdflush()
flush_dirty_buffers()
  ll_rw_block()
submit_bh(buffer X)
  generic_make_request()
__make_request()
create_bounce()
  alloc_bounce_page()
alloc_page()
  try_to_free_pages()
do_try_to_free_pages()
  page_launder()
try_to_free_buffers( , 2)  -- i.e. wait for buffers
  sync_page_buffers()
__wait_on_buffer(buffer X)

Where the buffer head X going in the top of the stack is the same as the one
we wait on at the bottom.

There still seems to be nothing to prevent the try to free buffers from
blocking on a buffer like this. Setting a flag on the buffer around the
create_bounce call, and skipping it in the try_to_free_buffers path would
be one approach to avoiding this.

I hit this in 2.4.6-pre6, and I don't see anything in the ac series to protect
against it.

Steve


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Announcing Journaled File System (JFS) release 1.0.0 available

2001-06-28 Thread Steve Lord

> Hi,

> So I only hope that the smart guys at SGI find a way to prepare the 
> patches the way Linus loves because now the file 
> "patch-2.4.5-xfs-1.0.1-core" (which contains the modifs to the kernel 
> and not the new files) is about 174090 bytes which is a lot.
> 
> YA
> 

But that is not a patch intended for Linus, it is intended to enable all
the XFS features. I have a couple of kernel patches which total 46298 bytes
which get you a working XFS filesystem in the kernel, and I could do
lots of things to make them smaller. When you hit header files in the
correct manner for different platforms the size tends to mushroom.
These lines are all in different fcntl.h files for example:

+#define O_INVISIBLE0100 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0x8 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0200 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0100 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0100 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0x20 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0100 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0x20 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0100 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0200 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0x8 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0200 /* invisible I/O, for DMAPI/XDSM */

You make the patches look a lot bigger than they really are. There is
a difference between a patch which is placing things in the correct
places and one which is designed to be as short as possible.

Steve



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Announcing Journaled File System (JFS) release 1.0.0 available

2001-06-28 Thread Steve Lord

 Hi,

 So I only hope that the smart guys at SGI find a way to prepare the 
 patches the way Linus loves because now the file 
 patch-2.4.5-xfs-1.0.1-core (which contains the modifs to the kernel 
 and not the new files) is about 174090 bytes which is a lot.
 
 YA
 

But that is not a patch intended for Linus, it is intended to enable all
the XFS features. I have a couple of kernel patches which total 46298 bytes
which get you a working XFS filesystem in the kernel, and I could do
lots of things to make them smaller. When you hit header files in the
correct manner for different platforms the size tends to mushroom.
These lines are all in different fcntl.h files for example:

+#define O_INVISIBLE0100 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0x8 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0200 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0100 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0100 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0x20 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0100 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0x20 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0100 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0200 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0x8 /* invisible I/O, for DMAPI/XDSM */
+#define O_INVISIBLE0200 /* invisible I/O, for DMAPI/XDSM */

You make the patches look a lot bigger than they really are. There is
a difference between a patch which is placing things in the correct
places and one which is designed to be as short as possible.

Steve



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Busy buffers and try_to_free_pages

2001-06-07 Thread Steve Lord


I am chasing around in circles with an issue where buffers pointing at
highmem pages are getting put onto the buffer free list, and later on
causing oops in ext2 when it gets assigned them for metadata via getblk.

Say one thread is performing a truncate on an inode and is currently in
truncate_inode_pages walking the pages and removing them from the
address space of the inode. If the try_to_free_buffers call fails to remove
the buffers from the page because the buffer_busy test fails, then
the buffers become anonymous and we disconnet the page from the address
space anyway. During unmount, these anonymous buffers get put on the free
list. A simple sync call running in parallel with the truncate can cause
the buffer to be seen as busy and get the system into this state.

What is supposed to prevent this from happening? It seems that pages
allocated from highmem should never be allowed to be cleaned up this
way.

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Busy buffers and try_to_free_pages

2001-06-07 Thread Steve Lord


I am chasing around in circles with an issue where buffers pointing at
highmem pages are getting put onto the buffer free list, and later on
causing oops in ext2 when it gets assigned them for metadata via getblk.

Say one thread is performing a truncate on an inode and is currently in
truncate_inode_pages walking the pages and removing them from the
address space of the inode. If the try_to_free_buffers call fails to remove
the buffers from the page because the buffer_busy test fails, then
the buffers become anonymous and we disconnet the page from the address
space anyway. During unmount, these anonymous buffers get put on the free
list. A simple sync call running in parallel with the truncate can cause
the buffer to be seen as busy and get the system into this state.

What is supposed to prevent this from happening? It seems that pages
allocated from highmem should never be allowed to be cleaned up this
way.

Steve


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel oops with 2.4.3-xfs

2001-05-23 Thread Steve Lord


Hmm, we 'released' version 1 of XFS against a 2.4.2 base - and packaged
it into a RedHat 7.1 Kernel RPM, we also have a development CVS tree
currently running at 2.4.4. If you are running a production server
with what you describe below, you might want to switch to one of the
other two kernels I mentioned. The development tree just got a fix
which does make xfs much more responsive to cache pruning, and has
numerous other xfs fixes so you might want to use this.

Steve 

> hi,
> i use kernel 2.4.3 with xfs (release 1) on a dell poweredge 2450.
> 
> it happens about every week that the system completely hangs (network
> down,console does not accept
> any input,sysreq useless...).
> i think this has anything to do with xfs or other fs issues,because kupdated
> always uses about 98% of cpu
> time.i couldn't report any errors because no oops or something else was
> generated until yesterday night:
> 
> ksymoops 2.4.0 on i686 2.4.3-XFS.  Options used
>  -V (default)
>  -k /proc/ksyms (default)
>  -l /proc/modules (default)
>  -o /lib/modules/2.4.3-XFS/ (default)
>  -m /boot/System.map-2.4.3-XFS (default)
> 
> Warning: You did not tell me where to find symbol information.  I will
> assume that the log matches the kernel and modules that are running
> right now and I'll use the default options above for symbol resolution.
> If the current kernel and/or modules do not match the log, you can get
> more accurate output by telling me the kernel version and where to find
> map, modules, ksyms etc.  ksymoops -h explains the options.
> 
> No modules in ksyms, skipping objects
> Warning (read_lsmod): no symbols in lsmod, is /proc/modules a valid lsmod fil
> e?
> May 23 00:43:08 twasrv1 kernel: invalid operand: 
> May 23 00:43:08 twasrv1 kernel: CPU:1
> May 23 00:43:08 twasrv1 kernel: EIP:0010:[do_timer+67/152]
> May 23 00:43:08 twasrv1 kernel: EFLAGS: 00010002
> May 23 00:43:08 twasrv1 kernel: eax: 0020   ebx: d121dfc4   ecx: 0086
>   
> edx: 
> May 23 00:43:08 twasrv1 kernel: esi: 2001   edi: 0020   ebp: 
>   
> esp: d121df68
> May 23 00:43:08 twasrv1 kernel: ds: 0018   es: 0018   ss: 0018
> May 23 00:43:08 twasrv1 kernel: Process exp816 (pid: 15671, stackpage=d121d00
> 0)
> May 23 00:43:08 twasrv1 kernel: Stack: c010b7ec d121dfc4 c033c234 c01086c1
>   d121dfc4 c03cc880 
> May 23 00:43:08 twasrv1 kernel:c03ad800  d121dfbc c01088a6
>  d121dfc4 c033c234 40572404 
> May 23 00:43:08 twasrv1 kernel:080c928c 080cf3e1 0001 0020
> c033c234 bfffc5f4 c0107014 40572404 
> May 23 00:43:08 twasrv1 kernel: Call Trace: [timer_interrupt+168/304]
> [handle_IRQ_event+77/120] [do_IRQ+166/244] [ret_from_intr+0/32]
> [startup_32+43/203] 
> May 23 00:43:08 twasrv1 kernel: Code: c7 9d 81 3d 4c db 33 c0 4c db 33 c0 0f 
> 94
> c0 0f b6 c0 85 c0 
> Using defaults from ksymoops -t elf32-i386 -a i386
> 
> Code;   Before first symbol
>  <_EIP>:
> Code;   Before first symbol
>0:   c7 9d 81 3d 4c db 33  movl   $0xdb4cc033,0xdb4c3d81(%ebp)
> Code;  0007 Before first symbol
>7:   c0 4c db 
> Code;  000a Before first symbol
>a:   33 c0 xor%eax,%eax
> Code;  000c Before first symbol
>c:   0f 94 c0  sete   %al
> Code;  000f Before first symbol
>f:   0f b6 c0  movzbl %al,%eax
> Code;  0012 Before first symbol
>   12:   85 c0 test   %eax,%eax
> 
> 
> 2 warnings issued.  Results may not be reliable.
> 
> 
> i'm using a suse7.1 system,the kernel was compiled with gcc-2.95.3.
> 
> i'm not sure if this oops has anything to do with xfs or fs in general,but th
> is
> is the first oops i ever found
> in the logfiles.
> 
> if anyone needs more information,please let me know.
> the system is a production system and i need to get this thing stable.
> 
> please cc me because i'm not subscribed to the list.
> 
> regards,
> gerald weber
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel oops with 2.4.3-xfs

2001-05-23 Thread Steve Lord


Hmm, we 'released' version 1 of XFS against a 2.4.2 base - and packaged
it into a RedHat 7.1 Kernel RPM, we also have a development CVS tree
currently running at 2.4.4. If you are running a production server
with what you describe below, you might want to switch to one of the
other two kernels I mentioned. The development tree just got a fix
which does make xfs much more responsive to cache pruning, and has
numerous other xfs fixes so you might want to use this.

Steve 

 hi,
 i use kernel 2.4.3 with xfs (release 1) on a dell poweredge 2450.
 
 it happens about every week that the system completely hangs (network
 down,console does not accept
 any input,sysreq useless...).
 i think this has anything to do with xfs or other fs issues,because kupdated
 always uses about 98% of cpu
 time.i couldn't report any errors because no oops or something else was
 generated until yesterday night:
 
 ksymoops 2.4.0 on i686 2.4.3-XFS.  Options used
  -V (default)
  -k /proc/ksyms (default)
  -l /proc/modules (default)
  -o /lib/modules/2.4.3-XFS/ (default)
  -m /boot/System.map-2.4.3-XFS (default)
 
 Warning: You did not tell me where to find symbol information.  I will
 assume that the log matches the kernel and modules that are running
 right now and I'll use the default options above for symbol resolution.
 If the current kernel and/or modules do not match the log, you can get
 more accurate output by telling me the kernel version and where to find
 map, modules, ksyms etc.  ksymoops -h explains the options.
 
 No modules in ksyms, skipping objects
 Warning (read_lsmod): no symbols in lsmod, is /proc/modules a valid lsmod fil
 e?
 May 23 00:43:08 twasrv1 kernel: invalid operand: 
 May 23 00:43:08 twasrv1 kernel: CPU:1
 May 23 00:43:08 twasrv1 kernel: EIP:0010:[do_timer+67/152]
 May 23 00:43:08 twasrv1 kernel: EFLAGS: 00010002
 May 23 00:43:08 twasrv1 kernel: eax: 0020   ebx: d121dfc4   ecx: 0086
   
 edx: 
 May 23 00:43:08 twasrv1 kernel: esi: 2001   edi: 0020   ebp: 
   
 esp: d121df68
 May 23 00:43:08 twasrv1 kernel: ds: 0018   es: 0018   ss: 0018
 May 23 00:43:08 twasrv1 kernel: Process exp816 (pid: 15671, stackpage=d121d00
 0)
 May 23 00:43:08 twasrv1 kernel: Stack: c010b7ec d121dfc4 c033c234 c01086c1
   d121dfc4 c03cc880 
 May 23 00:43:08 twasrv1 kernel:c03ad800  d121dfbc c01088a6
  d121dfc4 c033c234 40572404 
 May 23 00:43:08 twasrv1 kernel:080c928c 080cf3e1 0001 0020
 c033c234 bfffc5f4 c0107014 40572404 
 May 23 00:43:08 twasrv1 kernel: Call Trace: [timer_interrupt+168/304]
 [handle_IRQ_event+77/120] [do_IRQ+166/244] [ret_from_intr+0/32]
 [startup_32+43/203] 
 May 23 00:43:08 twasrv1 kernel: Code: c7 9d 81 3d 4c db 33 c0 4c db 33 c0 0f 
 94
 c0 0f b6 c0 85 c0 
 Using defaults from ksymoops -t elf32-i386 -a i386
 
 Code;   Before first symbol
  _EIP:
 Code;   Before first symbol
0:   c7 9d 81 3d 4c db 33  movl   $0xdb4cc033,0xdb4c3d81(%ebp)
 Code;  0007 Before first symbol
7:   c0 4c db 
 Code;  000a Before first symbol
a:   33 c0 xor%eax,%eax
 Code;  000c Before first symbol
c:   0f 94 c0  sete   %al
 Code;  000f Before first symbol
f:   0f b6 c0  movzbl %al,%eax
 Code;  0012 Before first symbol
   12:   85 c0 test   %eax,%eax
 
 
 2 warnings issued.  Results may not be reliable.
 
 
 i'm using a suse7.1 system,the kernel was compiled with gcc-2.95.3.
 
 i'm not sure if this oops has anything to do with xfs or fs in general,but th
 is
 is the first oops i ever found
 in the logfiles.
 
 if anyone needs more information,please let me know.
 the system is a production system and i need to get this thing stable.
 
 please cc me because i'm not subscribed to the list.
 
 regards,
 gerald weber
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Oops with 2.4.3-XFS

2001-05-16 Thread Steve Lord

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hi,
> 
> [ I'm not subscribed to linux-xfs, please cc me ]
> 
> We have managed to get a Debian potato system (with the 2.4 updates from
> http://people.debian.org/~bunk/debian plus xfs-tools which we imported
> from woody) to run 2.4.3-XFS.
> 
> However, in testing a directory with lots (~177000) of files, we get the
> following oops (copied by hand, and run through ksymoops on a Red Hat box
> since the Debian one segfaulted :( )
> 
> HTH,
> 
> Matt
> 

Can you describe your testing beyond using a directory with 177000 files
in it?

Also, can you explain how you obtained the xfs code, from a patch, from
the cvs development tree, or from somewhere else?

Thanks

   Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Oops with 2.4.3-XFS

2001-05-16 Thread Steve Lord

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hi,
 
 [ I'm not subscribed to linux-xfs, please cc me ]
 
 We have managed to get a Debian potato system (with the 2.4 updates from
 http://people.debian.org/~bunk/debian plus xfs-tools which we imported
 from woody) to run 2.4.3-XFS.
 
 However, in testing a directory with lots (~177000) of files, we get the
 following oops (copied by hand, and run through ksymoops on a Red Hat box
 since the Debian one segfaulted :( )
 
 HTH,
 
 Matt
 

Can you describe your testing beyond using a directory with 177000 files
in it?

Also, can you explain how you obtained the xfs code, from a patch, from
the cvs development tree, or from somewhere else?

Thanks

   Steve


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: reiserfs, xfs, ext2, ext3

2001-05-09 Thread Steve Lord


Hans Reiser wrote:
> XFS used to have the performance problems that Alan described but fixed them 
> in
> the linux port, yes?
> 
> Hans

Hmm, we do things somewhat differently on linux, but I suspect most of it
is due to hardware getting faster underneath us. 

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: reiserfs, xfs, ext2, ext3

2001-05-09 Thread Steve Lord


> 
> XFS is very fast most of the time (deleting a file is so slow its like us
> ing
> old BSD systems). Im not familiar enough with its behaviour under Linux yet.

Hmm, I just removed 2.2 Gbytes of data in 3 files in 37 seconds (14.4
seconds system time), not tooo slow. And that is on a pretty vanilla 2 cpu
linux box with a not very exciting scsi drive.

> 
> What you might want to do is to make a partition for 'mystery journalling fs'
> and benchmark a bit.
> 
> Alan
> 

I agree with Alan here, the only sure fire way to find out which filesystem
will work best for your application is to try it out. I have found reiserfs
to be very fast in some tests, especially those operating on lots of small
files, but contrary to some peoples, belief XFS is good for a lot more than
just messing with Gbyte long data files.

Steve Lord


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: reiserfs, xfs, ext2, ext3

2001-05-09 Thread Steve Lord


Hans Reiser wrote:
 XFS used to have the performance problems that Alan described but fixed them 
 in
 the linux port, yes?
 
 Hans

Hmm, we do things somewhat differently on linux, but I suspect most of it
is due to hardware getting faster underneath us. 

Steve


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: reiserfs, xfs, ext2, ext3

2001-05-09 Thread Steve Lord


 
 XFS is very fast most of the time (deleting a file is so slow its like us
 ing
 old BSD systems). Im not familiar enough with its behaviour under Linux yet.

Hmm, I just removed 2.2 Gbytes of data in 3 files in 37 seconds (14.4
seconds system time), not tooo slow. And that is on a pretty vanilla 2 cpu
linux box with a not very exciting scsi drive.

 
 What you might want to do is to make a partition for 'mystery journalling fs'
 and benchmark a bit.
 
 Alan
 

I agree with Alan here, the only sure fire way to find out which filesystem
will work best for your application is to try it out. I have found reiserfs
to be very fast in some tests, especially those operating on lots of small
files, but contrary to some peoples, belief XFS is good for a lot more than
just messing with Gbyte long data files.

Steve Lord


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 64-bit block sizes on 32-bit systems

2001-03-27 Thread Steve Lord


Hi,

Just a brief add to the discussion, besides which I have a vested interest
in this!

I do not believe that you can make the addressability of a device larger at
the expense of granularity of address space at the bottom end. Just because
ext2 has a single size for metadata does not mean everything you put on the
disks does. XFS filesystems, for example, can be made with block sizes from
512 bytes to 64Kbytes (ok not working on linux across this range yet, but it
will).

In all of these cases we have chunks of metadata which are 512 bytes
long, and we have chunks bigger than the blocksize.  The 512 byte chunks
are the superblock and the heads of the freespace structures, there
are multilples of them through the filesystem.

To top that, we have disk write ordering constraints that could mean that
for two of the 512 byte chunks next to each other one must be written to
disk now to free log space, the other must not be written to disk because it
is in a transaction. We would be forced to do read-modify-write down at
some lower level - wait the lower levels would not have the addressability.

There are probably other things which will not fly if you lose the
addressing granularity. Volume headers and such like would be one
possibility.

No I don't have a magic bullet solution, but I do not think that just
increasing the granularity of the addressing is the correct answer,
and yes I do agree that just growing the buffer_head fields is not
perfect either.

Steve Lord

p.s. there was mention of bigger page size, it is not hard to fix, but the
swap path will not even work with 64K pages right now.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 64-bit block sizes on 32-bit systems

2001-03-27 Thread Steve Lord


Hi,

Just a brief add to the discussion, besides which I have a vested interest
in this!

I do not believe that you can make the addressability of a device larger at
the expense of granularity of address space at the bottom end. Just because
ext2 has a single size for metadata does not mean everything you put on the
disks does. XFS filesystems, for example, can be made with block sizes from
512 bytes to 64Kbytes (ok not working on linux across this range yet, but it
will).

In all of these cases we have chunks of metadata which are 512 bytes
long, and we have chunks bigger than the blocksize.  The 512 byte chunks
are the superblock and the heads of the freespace structures, there
are multilples of them through the filesystem.

To top that, we have disk write ordering constraints that could mean that
for two of the 512 byte chunks next to each other one must be written to
disk now to free log space, the other must not be written to disk because it
is in a transaction. We would be forced to do read-modify-write down at
some lower level - wait the lower levels would not have the addressability.

There are probably other things which will not fly if you lose the
addressing granularity. Volume headers and such like would be one
possibility.

No I don't have a magic bullet solution, but I do not think that just
increasing the granularity of the addressing is the correct answer,
and yes I do agree that just growing the buffer_head fields is not
perfect either.

Steve Lord

p.s. there was mention of bigger page size, it is not hard to fix, but the
swap path will not even work with 64K pages right now.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-02 Thread Steve Lord

> 
> 
> On Friday, March 02, 2001 01:25:25 PM -0600 Steve Lord <[EMAIL PROTECTED]> wrote:
> 
> >> For why ide is beating scsi in this benchmark...make sure tagged queueing
> >> is on (or increase the queue length?).  For the xlog.c test posted, I
> >> would expect scsi to get faster than ide as the size of the write
> >> increases.
> > 
> > I think the issue is the call being used now is going to get slower the
> > larger the device is, just from the point of view of how many buffers it
> > has to scan.
> 
> filemap_fdatawait, filemap_fdatasync, and fsync_inode_buffers all restrict
> their scans to a list of dirty buffers for that specific file.  Only
> file_fsync goes through all the dirty buffers on the device, and the ext2
> fsync path never calls file_fsync.
> 
> Or am I missing something?
> 
> -chris
> 
> 

No you are not, I will now go put on the brown paper bag.

The scsi thing is wierd though, we have seen it here too.

Steve



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-02 Thread Steve Lord

> 
> 
> On Friday, March 02, 2001 12:39:01 PM -0600 Steve Lord <[EMAIL PROTECTED]> wrote:
> 
> [ file_fsync syncs all dirty buffers on the FS ]
> > 
> > So it looks like fsync is going to cost more for bigger devices. Given the
> > O_SYNC changes Stephen Tweedie did, couldnt fsync look more like this:
> > 
> >  down(>i_sem);
> > filemap_fdatasync(ip->i_mapping);
> > fsync_inode_buffers(ip);
> > filemap_fdatawait(ip->i_mapping);
> >  up(>i_sem);
> > 
> 
> reiserfs might need to trigger a commit on fsync, so the fs specific fsync
> op needs to be called.  But, you should not need to call file_fsync in the
> XFS fsync call (check out ext2's)


Right, this was just a generic example, the fsync_inode_buffers would be in
the filesystem specific fsync callout - this was more of a logical
example of what ext2 could do. XFS does completely different stuff in there
anyway. 

> 
> For why ide is beating scsi in this benchmark...make sure tagged queueing
> is on (or increase the queue length?).  For the xlog.c test posted, I would
> expect scsi to get faster than ide as the size of the write increases.

I think the issue is the call being used now is going to get slower the
larger the device is, just from the point of view of how many buffers it
has to scan.

> 
> -chris

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-02 Thread Steve Lord


> 
> 
> We're doing some mysql benchmarking.  For some reason it seems that ide
> drives are currently beating a scsi raid array and it seems to be related
> to fsync's.  Bonnie stats show the scsi array to blow away ide as
> expected, but mysql tests still have the idea beating on plain insert
> speeds.  Can anyone explain how this is possible, or perhaps explain how
> our testing may be flawed?
> 

The fsync system call does this:

down(>i_sem);
filemap_fdatasync(inode->i_mapping);
err = file->f_op->fsync(file, dentry, 0);
filemap_fdatawait(inode->i_mapping);
up(>i_sem);

the f_op->fsync part of this calls file_fsync() which does:

sync_buffers(dev, 1);

So it looks like fsync is going to cost more for bigger devices. Given the
O_SYNC changes Stephen Tweedie did, couldnt fsync look more like this:

down(>i_sem);
filemap_fdatasync(ip->i_mapping);
fsync_inode_buffers(ip);
filemap_fdatawait(ip->i_mapping);
up(>i_sem);

Steve Lord



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-02 Thread Steve Lord


 
 
 We're doing some mysql benchmarking.  For some reason it seems that ide
 drives are currently beating a scsi raid array and it seems to be related
 to fsync's.  Bonnie stats show the scsi array to blow away ide as
 expected, but mysql tests still have the idea beating on plain insert
 speeds.  Can anyone explain how this is possible, or perhaps explain how
 our testing may be flawed?
 

The fsync system call does this:

down(inode-i_sem);
filemap_fdatasync(inode-i_mapping);
err = file-f_op-fsync(file, dentry, 0);
filemap_fdatawait(inode-i_mapping);
up(inode-i_sem);

the f_op-fsync part of this calls file_fsync() which does:

sync_buffers(dev, 1);

So it looks like fsync is going to cost more for bigger devices. Given the
O_SYNC changes Stephen Tweedie did, couldnt fsync look more like this:

down(inode-i_sem);
filemap_fdatasync(ip-i_mapping);
fsync_inode_buffers(ip);
filemap_fdatawait(ip-i_mapping);
up(inode-i_sem);

Steve Lord



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-02 Thread Steve Lord

 
 
 On Friday, March 02, 2001 12:39:01 PM -0600 Steve Lord [EMAIL PROTECTED] wrote:
 
 [ file_fsync syncs all dirty buffers on the FS ]
  
  So it looks like fsync is going to cost more for bigger devices. Given the
  O_SYNC changes Stephen Tweedie did, couldnt fsync look more like this:
  
   down(inode-i_sem);
  filemap_fdatasync(ip-i_mapping);
  fsync_inode_buffers(ip);
  filemap_fdatawait(ip-i_mapping);
   up(inode-i_sem);
  
 
 reiserfs might need to trigger a commit on fsync, so the fs specific fsync
 op needs to be called.  But, you should not need to call file_fsync in the
 XFS fsync call (check out ext2's)


Right, this was just a generic example, the fsync_inode_buffers would be in
the filesystem specific fsync callout - this was more of a logical
example of what ext2 could do. XFS does completely different stuff in there
anyway. 

 
 For why ide is beating scsi in this benchmark...make sure tagged queueing
 is on (or increase the queue length?).  For the xlog.c test posted, I would
 expect scsi to get faster than ide as the size of the write increases.

I think the issue is the call being used now is going to get slower the
larger the device is, just from the point of view of how many buffers it
has to scan.

 
 -chris

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-02 Thread Steve Lord

 
 
 On Friday, March 02, 2001 01:25:25 PM -0600 Steve Lord [EMAIL PROTECTED] wrote:
 
  For why ide is beating scsi in this benchmark...make sure tagged queueing
  is on (or increase the queue length?).  For the xlog.c test posted, I
  would expect scsi to get faster than ide as the size of the write
  increases.
  
  I think the issue is the call being used now is going to get slower the
  larger the device is, just from the point of view of how many buffers it
  has to scan.
 
 filemap_fdatawait, filemap_fdatasync, and fsync_inode_buffers all restrict
 their scans to a list of dirty buffers for that specific file.  Only
 file_fsync goes through all the dirty buffers on the device, and the ext2
 fsync path never calls file_fsync.
 
 Or am I missing something?
 
 -chris
 
 

No you are not, I will now go put on the brown paper bag.

The scsi thing is wierd though, we have seen it here too.

Steve



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: File IO performance

2001-02-14 Thread Steve Lord


Marcelo Tosatti wrote:
> 
> On Wed, 14 Feb 2001, Steve Lord wrote:
> 
> 
> 
> > A break in the on disk mapping of data could be used to stop readahead
> > I suppose, especially if getting that readahead page is going to
> > involve evicting other pages. I suspect that doing this time of thing
> > is probably getting too complex for it's own good though.
> >
> > Try breaking the readahead loop apart, folding the page_cache_read into
> > the loop, doing all the page allocates first, and then all the readpage
> > calls. 
> 
> Its too dangerous it seems --- the amount of pages which are
> allocated/locked/mapped/submitted together must be based on the number of
> free pages otherwise you can run into an oom deadlock when you have a
> relatively high number of pages allocated/locked. 

Which says that as you ask for pages to put the readahead in, you want to
get a failure back under memory pressure, you push out what you allocated
already and carry on.

> 
> > I suspect you really need to go a bit further and get the mapping of
> > all the pages fixed up before you do the actual reads.
> 
> Hum, also think about a no-buffer-head deadlock when we're under a
> critical number of buffer heads while having quite a few buffer heads
> locked which are not going to be queued until all needed buffer heads are 
> allocated.

All this is probably attempting to be too clever for its own good, there is
probably a much simpler way to get more things happening in parallel. Plus, in
reality, lots of apps will spend some time between read calls processing
data, so there is overlap, a benchmark doing just reads is the end case
of all of this.

Steve



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: File IO performance

2001-02-14 Thread Steve Lord

> 
> On Wed, 14 Feb 2001,  wrote:
> 
> > I have been performing some IO tests under Linux on SCSI disks.
> 
> ext2 filesystem? 
> 
> > I noticed gaps between the commands and decided to investigate.
> > I am new to the kernel and do not profess to underatand what 
> > actually happens. My observations suggest that the file 
> > structured part of the io consists of the following file phases 
> > which mainly reside in mm/filemap.c . The user read call ends up in
> > a generic file read routine. 
> >
> > If the requested buffer is not in the file cache then the data is
> > requested from disk via the disk readahead routine.
> >
> > When this routine completes the data is copied to user space. I have
> > been looking at these phases on an analyzer and it seems that none of
> > them overlap for a single user process.
> > 
> > This creates gaps in the scsi commands which significantly reduce
> > bandwidth, particularly at todays disk speeds.
> > 
> > I am interested in making changes to the readahead routine. In this 
> > routine there is a loop
> > 
> >  /* Try to read ahead pages.
> >   * We hope that ll_rw_blk() plug/unplug, coalescence, requests sort
> >   * and the scheduler, will work enough for us to avoid too bad 
> >   * actuals IO requests. 
> >   */ 
> > 
> >  while (ahead < max_ahead) {
> >   ahead ++;
> >   if ((raend + ahead) >= end_index)
> >break;
> >   if (page_cache_read(filp, raend + ahead) < 0)
> >  }
> > 
> > 
> > this whole loop completes before the disk command starts. If the 
> > commands are large and it is for a maximum read ahead this loops 
> > takes some time and is followed by disk commands.
> 
> Well in reality its worse than you think ;)
> 
> > It seems that the performance could be improved if the disk commands 
> > were overlapped in some way with the time taken in this loop. 
> > I have not traced page_cache_read so I have no idea what is happening
> > but I guess this is some page location and entry onto the specific
> > device buffer queues ?
> 
> page_cache_read searches for the given page in the page cache and returns
> it in case its found. 
> 
> If the page is not already in cache, a new page is allocated.
> 
> This allocation can block if we're running out of free memory. To free
> more memory, the allocation routines may try to sync dirty pages and/or
> swap out pages.
> 
> After the page is allocated, the mapping->readpage() function is called to
> read the page. The ->readpage() job is to map the page to its correct
> on-disk block (which may involve reading indirect blocks).
> 
> Finally, the page is queued to IO which again may block in case the
> request queue is full.
> 
> Another issue is that we do readahead of logically contiguous pages, which
> means we may be queuing pages for readahead which are not physically
> contiguous. In this case, we are generating disk seeks.
> 
> > I am really looking for some help in underatanding what is happening 
> > here and suggestions in ways which operations may be overlapped.
> 
> I have some ideas...
> 
> The main problem of file readahead, IMHO, is its completly "per page"
> behaviour --- allocation, mapping, and queuing are done separately for
> each page and each of these three steps can block multiple times. This is
> bad because we can loose the chance for queuing the IOs together while
> we're blocked, resulting in several smaller reads which suck.
> 
> The nicest solution for that, IMHO, is to make the IO clustering at
> generic_file_read() context and send big requests to the IO layer instead
> "cluster if we're lucky", which is more or less what happens today.
> 
> Unfortunately stock Linux 2.4 maximum request size is one page.
> 
> SGI's XFS CVS tree contains a different kind of IO mechanism which can
> make bigger requests. We will probably have the current IO mechanism
> support bigger request sizes as well sometime in the future. However,
> both are 2.5 only things.
> 
> Additionaly, the way Linux caches on-disk physical block information is
> not very efficient and can be optimized, resulting in less reads of fs
> data to map pages and/or know if pages are physically contiguous (the
> latter is very welcome for write clustering, too).
> 
> However, we may still optimize readahead a bit on Linux 2.4 without too
> much efforts: an IO read command which fails (and returns an error code
> back to the caller) if merging with other requests fail. 
> 
> Using this command for readahead pages (and quitting the read loop if we
> fail) can "fix" the logically!=physically contiguous problem and it also
> fixes the case were we sleep and the previous IO commands have been
> already sent to disk when we wakeup. This fix ugly and not as good as the
> IO clustering one, but _much_ simpler and thats all we can do for 2.4, I
> suppose.

We could break the loop apart somewhat and grab pages first, map them,
then submit all the I/Os together. This has other costs assoiated with
it, the earlier pages in the readahead - 

Re: File IO performance

2001-02-14 Thread Steve Lord

 
 On Wed, 14 Feb 2001,  wrote:
 
  I have been performing some IO tests under Linux on SCSI disks.
 
 ext2 filesystem? 
 
  I noticed gaps between the commands and decided to investigate.
  I am new to the kernel and do not profess to underatand what 
  actually happens. My observations suggest that the file 
  structured part of the io consists of the following file phases 
  which mainly reside in mm/filemap.c . The user read call ends up in
  a generic file read routine. 
 
  If the requested buffer is not in the file cache then the data is
  requested from disk via the disk readahead routine.
 
  When this routine completes the data is copied to user space. I have
  been looking at these phases on an analyzer and it seems that none of
  them overlap for a single user process.
  
  This creates gaps in the scsi commands which significantly reduce
  bandwidth, particularly at todays disk speeds.
  
  I am interested in making changes to the readahead routine. In this 
  routine there is a loop
  
   /* Try to read ahead pages.
* We hope that ll_rw_blk() plug/unplug, coalescence, requests sort
* and the scheduler, will work enough for us to avoid too bad 
* actuals IO requests. 
*/ 
  
   while (ahead  max_ahead) {
ahead ++;
if ((raend + ahead) = end_index)
 break;
if (page_cache_read(filp, raend + ahead)  0)
   }
  
  
  this whole loop completes before the disk command starts. If the 
  commands are large and it is for a maximum read ahead this loops 
  takes some time and is followed by disk commands.
 
 Well in reality its worse than you think ;)
 
  It seems that the performance could be improved if the disk commands 
  were overlapped in some way with the time taken in this loop. 
  I have not traced page_cache_read so I have no idea what is happening
  but I guess this is some page location and entry onto the specific
  device buffer queues ?
 
 page_cache_read searches for the given page in the page cache and returns
 it in case its found. 
 
 If the page is not already in cache, a new page is allocated.
 
 This allocation can block if we're running out of free memory. To free
 more memory, the allocation routines may try to sync dirty pages and/or
 swap out pages.
 
 After the page is allocated, the mapping-readpage() function is called to
 read the page. The -readpage() job is to map the page to its correct
 on-disk block (which may involve reading indirect blocks).
 
 Finally, the page is queued to IO which again may block in case the
 request queue is full.
 
 Another issue is that we do readahead of logically contiguous pages, which
 means we may be queuing pages for readahead which are not physically
 contiguous. In this case, we are generating disk seeks.
 
  I am really looking for some help in underatanding what is happening 
  here and suggestions in ways which operations may be overlapped.
 
 I have some ideas...
 
 The main problem of file readahead, IMHO, is its completly "per page"
 behaviour --- allocation, mapping, and queuing are done separately for
 each page and each of these three steps can block multiple times. This is
 bad because we can loose the chance for queuing the IOs together while
 we're blocked, resulting in several smaller reads which suck.
 
 The nicest solution for that, IMHO, is to make the IO clustering at
 generic_file_read() context and send big requests to the IO layer instead
 "cluster if we're lucky", which is more or less what happens today.
 
 Unfortunately stock Linux 2.4 maximum request size is one page.
 
 SGI's XFS CVS tree contains a different kind of IO mechanism which can
 make bigger requests. We will probably have the current IO mechanism
 support bigger request sizes as well sometime in the future. However,
 both are 2.5 only things.
 
 Additionaly, the way Linux caches on-disk physical block information is
 not very efficient and can be optimized, resulting in less reads of fs
 data to map pages and/or know if pages are physically contiguous (the
 latter is very welcome for write clustering, too).
 
 However, we may still optimize readahead a bit on Linux 2.4 without too
 much efforts: an IO read command which fails (and returns an error code
 back to the caller) if merging with other requests fail. 
 
 Using this command for readahead pages (and quitting the read loop if we
 fail) can "fix" the logically!=physically contiguous problem and it also
 fixes the case were we sleep and the previous IO commands have been
 already sent to disk when we wakeup. This fix ugly and not as good as the
 IO clustering one, but _much_ simpler and thats all we can do for 2.4, I
 suppose.

We could break the loop apart somewhat and grab pages first, map them,
then submit all the I/Os together. This has other costs assoiated with
it, the earlier pages in the readahead - the ones likely to be used
first, will be delayed by the setup of the other pages. So the calling
thread is less likely to find the first of these pages in cache 

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Steve Lord

> 
> On Tue, 6 Feb 2001, Marcelo Tosatti wrote:
> 
> > Think about a given number of pages which are physically contiguous on
> > disk -- you dont need to cache the block number for each page, you
> > just need to cache the physical block number of the first page of the
> > "cluster".
> 
> ranges are a hell of a lot more trouble to get right than page or
> block-sized objects - and typical access patterns are rarely 'ranged'. As
> long as the basic unit is not 'too small' (ie. not 512 byte, but something
> more sane, like 4096 bytes), i dont think ranging done in higher levels
> buys us anything valuable. And we do ranging at the request layer already
> ... Guess why most CPUs ended up having pages, and not "memory ranges"?
> It's simpler, thus faster in the common case and easier to debug.
> 
> > Usually we need to cache only block information (for clustering), and
> > not all the other stuff which buffer_head holds.
> 
> well, the other issue is that buffer_heads hold buffer-cache details as
> well. But i think it's too small right now to justify any splitup - and
> those issues are related enough to have significant allocation-merging
> effects.
> 
>   Ingo

Think about it from the point of view of being able to reduce the number of
times you need to talk to the allocator in a filesystem. You can talk to
the allocator about all of your readahead pages in one go, or you can do
things like allocate on flush rather than allocating page at a time (that is
a bit more complex, but not too much).

Having to talk to the allocator on a page by page basis is my pet peeve about
the current mechanisms.

Steve



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Steve Lord

 
 On Tue, 6 Feb 2001, Marcelo Tosatti wrote:
 
  Think about a given number of pages which are physically contiguous on
  disk -- you dont need to cache the block number for each page, you
  just need to cache the physical block number of the first page of the
  "cluster".
 
 ranges are a hell of a lot more trouble to get right than page or
 block-sized objects - and typical access patterns are rarely 'ranged'. As
 long as the basic unit is not 'too small' (ie. not 512 byte, but something
 more sane, like 4096 bytes), i dont think ranging done in higher levels
 buys us anything valuable. And we do ranging at the request layer already
 ... Guess why most CPUs ended up having pages, and not "memory ranges"?
 It's simpler, thus faster in the common case and easier to debug.
 
  Usually we need to cache only block information (for clustering), and
  not all the other stuff which buffer_head holds.
 
 well, the other issue is that buffer_heads hold buffer-cache details as
 well. But i think it's too small right now to justify any splitup - and
 those issues are related enough to have significant allocation-merging
 effects.
 
   Ingo

Think about it from the point of view of being able to reduce the number of
times you need to talk to the allocator in a filesystem. You can talk to
the allocator about all of your readahead pages in one go, or you can do
things like allocate on flush rather than allocating page at a time (that is
a bit more complex, but not too much).

Having to talk to the allocator on a page by page basis is my pet peeve about
the current mechanisms.

Steve



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Direct (unbuffered) I/O status ...

2001-02-02 Thread Steve Lord

> We're trying to port some code that currently runs on SGI using the IRIX
> direct I/O facility.  From searching the web, it appears that a similar
> feature either already is or will soon be available under Linux.  Could
> anyone fill me in on what the status is?
> 
> (I know about mapping block devices to raw devices, but that alone will
> not work for the application we're contemplating: we'd like conventional
> file-system support as well as unbuffered I/O capability).
> 
> Thanks in advance!
> 
> -Arun
>

I was going to let Stephen Tweedie respond to this one, but since he has
not got to it yet...

Yes there has been talk of implementing filesystem I/O direct between user
memory and the disk device. Stephen's approach was to use similar techniques to
the raw I/O path to lock down the user pages, these would then be placed
in the address space of the inode, and the filesystem would do its usual
thing in terms of read or write. There are lots of end cases to this
which make it more complex than it sounds, what happens if there is already
data in the cache, what happens if someone memory maps the file in the
middle of the I/O and lots of other goodies.

I suspect implementing this is quite a ways off yet, and almost certainly
a 2.5 feature for quite a while before it could possibly get into a 2.4
kernel.

Stephen is the one to give a real explaination of how he sees this working
and when it might be done.

However, given the open source work SGI is doing with XFS, we are pretty much
committed to supporting O_DIRECT on Linux XFS before this. There is
a very basic implementation of O_DIRECT read in the current Linux XFS,
it has not been tested in quite some time (i.e. it may be broken), and it is
not coherent with the buffer cache. I hope we can have this cleaned up and
write implemented in the next month or so.

This would have the added advantage that even if you are moving stuff from
Irix to Linux, you could at least take your existing filesystems with you.

Steve

 
> 
> --
> Arun Rao
> Pixar Animation Studios
> 1200 Park Ave
> Emeryville, CA 94608
> (510) 752-3526
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Direct (unbuffered) I/O status ...

2001-02-02 Thread Steve Lord

 We're trying to port some code that currently runs on SGI using the IRIX
 direct I/O facility.  From searching the web, it appears that a similar
 feature either already is or will soon be available under Linux.  Could
 anyone fill me in on what the status is?
 
 (I know about mapping block devices to raw devices, but that alone will
 not work for the application we're contemplating: we'd like conventional
 file-system support as well as unbuffered I/O capability).
 
 Thanks in advance!
 
 -Arun


I was going to let Stephen Tweedie respond to this one, but since he has
not got to it yet...

Yes there has been talk of implementing filesystem I/O direct between user
memory and the disk device. Stephen's approach was to use similar techniques to
the raw I/O path to lock down the user pages, these would then be placed
in the address space of the inode, and the filesystem would do its usual
thing in terms of read or write. There are lots of end cases to this
which make it more complex than it sounds, what happens if there is already
data in the cache, what happens if someone memory maps the file in the
middle of the I/O and lots of other goodies.

I suspect implementing this is quite a ways off yet, and almost certainly
a 2.5 feature for quite a while before it could possibly get into a 2.4
kernel.

Stephen is the one to give a real explaination of how he sees this working
and when it might be done.

However, given the open source work SGI is doing with XFS, we are pretty much
committed to supporting O_DIRECT on Linux XFS before this. There is
a very basic implementation of O_DIRECT read in the current Linux XFS,
it has not been tested in quite some time (i.e. it may be broken), and it is
not coherent with the buffer cache. I hope we can have this cleaned up and
write implemented in the next month or so.

This would have the added advantage that even if you are moving stuff from
Irix to Linux, you could at least take your existing filesystems with you.

Steve

 
 
 --
 Arun Rao
 Pixar Animation Studios
 1200 Park Ave
 Emeryville, CA 94608
 (510) 752-3526
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: XFS file system Pre-Release

2001-02-01 Thread Steve Lord

> Hi!
> 
> > We will be demonstrating XFS as the root file system for high availability
> > and clustering solutions in SGI systems at LinuxWorld New York from January
> > 31 to February 2. Free XFS CDs will also be available at LinuxWorld.
> 
> What support does XFS provide for clustering?
>   Pavel

This statement is a little misleading, the clustering software is other
stuff from SGI, they just have xfs filesystems on the machines. Now CXFS
is another story, but it only exists for Irix now and it is almost certainly
not going to be open source when it is available for Linux (and yes I know
that makes packaging and support really interesting).

Steve

> -- 
> I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
> Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Steve Lord

> On Thu, Feb 01, 2001 at 02:56:47PM -0600, Steve Lord wrote:
> > And if you are writing to a striped volume via a filesystem which can do
> > it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
> > is striped on 64K boundaries.
> 
> But usually I want to have pages 0-63, 128-191, etc together, because they ar
> e
> contingous on disk, or?

I was just giving an example of how kiobufs might need splitting up more often
than you think, crossing a stripe boundary is one obvious case. Yes you do
want to keep the pages which are contiguous on disk together, but you will
often get requests which cover multiple stripes, otherwise you don't really
get much out of stripes and may as well just concatenate drives.

Ideally the file is striped across the various disks in the volume, and one
large write (direct or from the cache) gets scattered across the disks. All
the I/O's run in parallel (and on different controllers if you have the 
budget).

Steve

> 
>   Christoph
> 
> -- 
> Of course it doesn't work. We've performed a software upgrade.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Steve Lord

> In article <[EMAIL PROTECTED]> you wrote:
> > Hi,
> 
> > On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
> > In the disk IO case, you basically don't get that (the only thing
> > which comes close is raid5 parity blocks).  The data which the user
> > started with is the data sent out on the wire.  You do get some
> > interesting cases such as soft raid and LVM, or even in the scsi stack
> > if you run out of mailbox space, where you need to send only a
> > sub-chunk of the input buffer. 
> 
> Though your describption is right, I don't think the case is very common:
> Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.


And if you are writing to a striped volume via a filesystem which can do
it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
is striped on 64K boundaries.

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Steve Lord

Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote:
> > 
> > That would require the vfs interfaces themselves (address space
> > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > page *  . That's not the case right now, is it ?
> 
> No, and with the current kiobufs it would not make sense, because they
> are to heavy-weight.  With page,length,offsett iobufs this makes sense
> and is IMHO the way to go.
> 
>   Christoph
> 

Enquiring minds would like to know if you are working towards this 
revamp of the kiobuf structure at the moment, you have been very quiet
recently. 

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Steve Lord

 In article [EMAIL PROTECTED] you wrote:
  Hi,
 
  On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
  In the disk IO case, you basically don't get that (the only thing
  which comes close is raid5 parity blocks).  The data which the user
  started with is the data sent out on the wire.  You do get some
  interesting cases such as soft raid and LVM, or even in the scsi stack
  if you run out of mailbox space, where you need to send only a
  sub-chunk of the input buffer. 
 
 Though your describption is right, I don't think the case is very common:
 Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.


And if you are writing to a striped volume via a filesystem which can do
it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
is striped on 64K boundaries.

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Steve Lord

 On Thu, Feb 01, 2001 at 02:56:47PM -0600, Steve Lord wrote:
  And if you are writing to a striped volume via a filesystem which can do
  it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
  is striped on 64K boundaries.
 
 But usually I want to have pages 0-63, 128-191, etc together, because they ar
 e
 contingous on disk, or?

I was just giving an example of how kiobufs might need splitting up more often
than you think, crossing a stripe boundary is one obvious case. Yes you do
want to keep the pages which are contiguous on disk together, but you will
often get requests which cover multiple stripes, otherwise you don't really
get much out of stripes and may as well just concatenate drives.

Ideally the file is striped across the various disks in the volume, and one
large write (direct or from the cache) gets scattered across the disks. All
the I/O's run in parallel (and on different controllers if you have the 
budget).

Steve

 
   Christoph
 
 -- 
 Of course it doesn't work. We've performed a software upgrade.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: XFS file system Pre-Release

2001-02-01 Thread Steve Lord

 Hi!
 
  We will be demonstrating XFS as the root file system for high availability
  and clustering solutions in SGI systems at LinuxWorld New York from January
  31 to February 2. Free XFS CDs will also be available at LinuxWorld.
 
 What support does XFS provide for clustering?
   Pavel

This statement is a little misleading, the clustering software is other
stuff from SGI, they just have xfs filesystems on the machines. Now CXFS
is another story, but it only exists for Irix now and it is almost certainly
not going to be open source when it is available for Linux (and yes I know
that makes packaging and support really interesting).

Steve

 -- 
 I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
 Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: XFS file system Pre-Release

2001-01-29 Thread Steve Lord

> 
>Any information on XFS interoperability with current kernel nfsd? 

You can NFS export XFS, I would have to say that this is not something we
test regularly and you may find problems under high load.

Steve

> 
> 
> Pedro
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: XFS file system Pre-Release

2001-01-29 Thread Steve Lord

 
Any information on XFS interoperability with current kernel nfsd? 

You can NFS export XFS, I would have to say that this is not something we
test regularly and you may find problems under high load.

Steve

 
 
 Pedro
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/