Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-17 Thread Randy Dunlap
On Sun, 16 Dec 2007 21:55:20 + Mel Gorman wrote:

   Just using cp to read the file is enough to cause problems but I included
   a very basic program below that produces the BUG_ON checks. Is this a 
   known
   issue or am I using the interface incorrectly?
  
  I'd say you're using it correctly but you've found a hitherto unknown bug. 
  On i386 highmem machines with CONFIG_HIGHPTE (at least) pte_offset_map()
  takes kmap_atomic(), so pagemap_pte_range() can't do copy_to_user() as it
  presently does.
  
  Drat.
  
  Still, that shouldn't really disrupt the testing which you're doing.  You
  could disable CONFIG_HIGHPTE to shut it up.
  
 
 Yes, that did the trick. Using pagemap, it was trivial to show that the
 2.6.24-rc5-mm1 kernel was placing pages in reverse physical order like
 the following output shows
 
 b:  32763 v:   753091 p:65559 . 65558 contig: 1
 b:  32764 v:   753092 p:65558 . 65557 contig: 1
 b:  32765 v:   753093 p:65557 . 65556 contig: 1
 b:  32766 v:   753094 p:65556 . 6 contig: 1
 b:  32767 v:   753095 p:6 . 6 contig: 1
 
 p: is the PFN of the page v: is the page offset within an anonymous
 mapping and b: is the number of non-contiguous blocks in the anonymous
 mapping. With the patch applied, it looks more like;
 
 b:   1232 v:   752964 p:58944  87328 contig: 15
 b:   1233 v:   752980 p:87328  91200 contig: 15
 b:   1234 v:   752996 p:91200  40272 contig: 15
 b:   1235 v:   753012 p:40272  85664 contig: 15
 b:   1236 v:   753028 p:85664  87312 contig: 15
 
 so mappings are using contiguous pages again. This was the final test
 program I used in case it's of any interest.
 
 Thanks
 
 /*
  * showcontiguous.c
  *
  * Use the /proc/pid/pagemap interface to give an indication of how contiguous
  * physical memory is in an anonymous virtual memory mapping
  */

Matt,
Did you ever make your python pagemap scripts available?
If not, would you?

---
~Randy
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-17 Thread Matt Mackall
On Mon, Dec 17, 2007 at 11:24:57AM -0800, Randy Dunlap wrote:
 On Sun, 16 Dec 2007 21:55:20 + Mel Gorman wrote:
 
Just using cp to read the file is enough to cause problems but I 
included
a very basic program below that produces the BUG_ON checks. Is this a 
known
issue or am I using the interface incorrectly?
   
   I'd say you're using it correctly but you've found a hitherto unknown 
   bug. 
   On i386 highmem machines with CONFIG_HIGHPTE (at least) pte_offset_map()
   takes kmap_atomic(), so pagemap_pte_range() can't do copy_to_user() as it
   presently does.
   
   Drat.
   
   Still, that shouldn't really disrupt the testing which you're doing.  You
   could disable CONFIG_HIGHPTE to shut it up.
   
  
  Yes, that did the trick. Using pagemap, it was trivial to show that the
  2.6.24-rc5-mm1 kernel was placing pages in reverse physical order like
  the following output shows
  
  b:  32763 v:   753091 p:65559 . 65558 contig: 1
  b:  32764 v:   753092 p:65558 . 65557 contig: 1
  b:  32765 v:   753093 p:65557 . 65556 contig: 1
  b:  32766 v:   753094 p:65556 . 6 contig: 1
  b:  32767 v:   753095 p:6 . 6 contig: 1
  
  p: is the PFN of the page v: is the page offset within an anonymous
  mapping and b: is the number of non-contiguous blocks in the anonymous
  mapping. With the patch applied, it looks more like;
  
  b:   1232 v:   752964 p:58944  87328 contig: 15
  b:   1233 v:   752980 p:87328  91200 contig: 15
  b:   1234 v:   752996 p:91200  40272 contig: 15
  b:   1235 v:   753012 p:40272  85664 contig: 15
  b:   1236 v:   753028 p:85664  87312 contig: 15
  
  so mappings are using contiguous pages again. This was the final test
  program I used in case it's of any interest.
  
  Thanks
  
  /*
   * showcontiguous.c
   *
   * Use the /proc/pid/pagemap interface to give an indication of how 
  contiguous
   * physical memory is in an anonymous virtual memory mapping
   */
 
 Matt,
 Did you ever make your python pagemap scripts available?
 If not, would you?

There's a collection of them at http://selenic.com/repo/pagemap.
They're largely proof of concept, and I'm not sure I finished adapting
them all to the final 64-bit interface.

As it happens, the above regression I actually spotted immediately by
doing a simple hexdump on my very first test of the interface - lots
of pfns counting backwards. Mentioned it a few times to various people
in the cc: list and on lkml but never got around to tracking it down
myself..

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-16 Thread Mel Gorman
  Just using cp to read the file is enough to cause problems but I included
  a very basic program below that produces the BUG_ON checks. Is this a known
  issue or am I using the interface incorrectly?
 
 I'd say you're using it correctly but you've found a hitherto unknown bug. 
 On i386 highmem machines with CONFIG_HIGHPTE (at least) pte_offset_map()
 takes kmap_atomic(), so pagemap_pte_range() can't do copy_to_user() as it
 presently does.
 
 Drat.
 
 Still, that shouldn't really disrupt the testing which you're doing.  You
 could disable CONFIG_HIGHPTE to shut it up.
 

Yes, that did the trick. Using pagemap, it was trivial to show that the
2.6.24-rc5-mm1 kernel was placing pages in reverse physical order like
the following output shows

b:  32763 v:   753091 p:65559 . 65558 contig: 1
b:  32764 v:   753092 p:65558 . 65557 contig: 1
b:  32765 v:   753093 p:65557 . 65556 contig: 1
b:  32766 v:   753094 p:65556 . 6 contig: 1
b:  32767 v:   753095 p:6 . 6 contig: 1

p: is the PFN of the page v: is the page offset within an anonymous
mapping and b: is the number of non-contiguous blocks in the anonymous
mapping. With the patch applied, it looks more like;

b:   1232 v:   752964 p:58944  87328 contig: 15
b:   1233 v:   752980 p:87328  91200 contig: 15
b:   1234 v:   752996 p:91200  40272 contig: 15
b:   1235 v:   753012 p:40272  85664 contig: 15
b:   1236 v:   753028 p:85664  87312 contig: 15

so mappings are using contiguous pages again. This was the final test
program I used in case it's of any interest.

Thanks

/*
 * showcontiguous.c
 *
 * Use the /proc/pid/pagemap interface to give an indication of how contiguous
 * physical memory is in an anonymous virtual memory mapping
 */
#include stdio.h
#include sys/mman.h
#include stdlib.h
#include unistd.h
#include linux/types.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h

#define MAPSIZE (128*1048576)
#define PM_ENTRY_BYTES sizeof(__u64)

int main(int argc, char **argv)
{
int pagemap_fd;
unsigned long *anonmapping;
__u64 pagemap_entry = 0ULL;

unsigned long vpfn, ppfn, ppfn_last;
int block_number = 0;
int contig_count = 1;
size_t mmap_offset;
int pagesize = getpagesize();

if (sizeof(pagemap_entry)  PM_ENTRY_BYTES) {
printf(ERROR: Failed assumption on size of pagemap_entry\n);
exit(EXIT_FAILURE);
}

/* Open the pagemap interface */
pagemap_fd = open(/proc/self/pagemap, O_RDONLY);
if (pagemap_fd == -1) {
perror(fopen);
exit(EXIT_FAILURE);
}

/* Create an anonymous mapping */
anonmapping = mmap(NULL, MAPSIZE,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE,
-1, 0);
if (anonmapping == MAP_FAILED) {
perror(mmap);
exit(1);
}

/* Work out the VPN the mapping is at and seek to it in pagemap */
vpfn = ((unsigned long)anonmapping) / pagesize;
mmap_offset = lseek(pagemap_fd, vpfn * PM_ENTRY_BYTES, SEEK_SET);
if (mmap_offset == -1) {
perror(fseek);
exit(EXIT_FAILURE);
}
ppfn_last = 0;

/* Read the PFN of each page in the mapping */
for (mmap_offset = 0; mmap_offset  MAPSIZE; mmap_offset += pagesize) {
vpfn = ((unsigned long)anonmapping + mmap_offset) / pagesize;

if (read(pagemap_fd, pagemap_entry, PM_ENTRY_BYTES) == 0) {
perror(fread);
exit(EXIT_FAILURE);
}

ppfn = (unsigned long)pagemap_entry;
if (ppfn == ppfn_last + 1) {
printf(.);
contig_count++;
} else {
printf( %lu contig: %d\nb: %6d v: %8lu p: %8lu .,
ppfn, contig_count,
block_number, vpfn, ppfn);
contig_count = 1;
block_number++;
}
ppfn_last = ppfn;
}
printf( %lu config: %d\n, ppfn, contig_count);

close(pagemap_fd);
munmap(anonmapping, MAPSIZE);
exit(EXIT_SUCCESS);
}
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-14 Thread Andrew Morton
On Sat, 15 Dec 2007 01:09:41 + Mel Gorman [EMAIL PROTECTED] wrote:

 On (13/12/07 14:29), Andrew Morton didst pronounce:
   The simple way seems to be to malloc a large area, touch every page and
   then look at the physical pages assigned ... they now mostly seem to be
   descending in physical address.
   
  
  OIC.  -mm's /proc/pid/pagemap can be used to get the pfn's...
  
 
 I tried using pagemap to verify the patch but it triggered BUG_ON
 checks. Perhaps I am using the interface wrong but I would still not
 expect it to break in this fashion. I tried 2.6.24-rc4-mm1, 2.6.24-rc5-mm1,
 2.6.24-rc5 with just the maps4 patches applied and 2.6.23 with maps4 patches
 applied. Each time I get errors like this;
 
 [   90.108315] BUG: sleeping function called from invalid context at 
 include/asm/uaccess_32.h:457
 [   90.211227] in_atomic():1, irqs_disabled():0
 [   90.262251] no locks held by showcontiguous/2814.
 [   90.318475] Pid: 2814, comm: showcontiguous Not tainted 2.6.24-rc5 #1
 [   90.395344]  [c010522a] show_trace_log_lvl+0x1a/0x30
 [   90.456948]  [c0105bb2] show_trace+0x12/0x20
 [   90.510173]  [c0105eee] dump_stack+0x6e/0x80
 [   90.563409]  [c01205b3] __might_sleep+0xc3/0xe0
 [   90.619765]  [c02264fd] copy_to_user+0x3d/0x60
 [   90.675153]  [c01b3e9c] add_to_pagemap+0x5c/0x80
 [   90.732513]  [c01b43e8] pagemap_pte_range+0x68/0xb0
 [   90.793010]  [c0175ed2] walk_page_range+0x112/0x210
 [   90.853482]  [c01b47c6] pagemap_read+0x176/0x220
 [   90.910863]  [c0182dc4] vfs_read+0x94/0x150
 [   90.963058]  [c01832fd] sys_read+0x3d/0x70
 [   91.014219]  [c0104262] syscall_call+0x7/0xb
 
 ...

 Just using cp to read the file is enough to cause problems but I included
 a very basic program below that produces the BUG_ON checks. Is this a known
 issue or am I using the interface incorrectly?

I'd say you're using it correctly but you've found a hitherto unknown bug. 
On i386 highmem machines with CONFIG_HIGHPTE (at least) pte_offset_map()
takes kmap_atomic(), so pagemap_pte_range() can't do copy_to_user() as it
presently does.

Drat.

Still, that shouldn't really disrupt the testing which you're doing.  You
could disable CONFIG_HIGHPTE to shut it up.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-14 Thread Matt Mackall
On Fri, Dec 14, 2007 at 06:02:06PM -0800, Andrew Morton wrote:
 On Sat, 15 Dec 2007 01:09:41 + Mel Gorman [EMAIL PROTECTED] wrote:
 
  On (13/12/07 14:29), Andrew Morton didst pronounce:
The simple way seems to be to malloc a large area, touch every page and
then look at the physical pages assigned ... they now mostly seem to be
descending in physical address.

   
   OIC.  -mm's /proc/pid/pagemap can be used to get the pfn's...
   
  
  I tried using pagemap to verify the patch but it triggered BUG_ON
  checks. Perhaps I am using the interface wrong but I would still not
  expect it to break in this fashion. I tried 2.6.24-rc4-mm1, 2.6.24-rc5-mm1,
  2.6.24-rc5 with just the maps4 patches applied and 2.6.23 with maps4 patches
  applied. Each time I get errors like this;
  
  [   90.108315] BUG: sleeping function called from invalid context at 
  include/asm/uaccess_32.h:457
  [   90.211227] in_atomic():1, irqs_disabled():0
  [   90.262251] no locks held by showcontiguous/2814.
  [   90.318475] Pid: 2814, comm: showcontiguous Not tainted 2.6.24-rc5 #1
  [   90.395344]  [c010522a] show_trace_log_lvl+0x1a/0x30
  [   90.456948]  [c0105bb2] show_trace+0x12/0x20
  [   90.510173]  [c0105eee] dump_stack+0x6e/0x80
  [   90.563409]  [c01205b3] __might_sleep+0xc3/0xe0
  [   90.619765]  [c02264fd] copy_to_user+0x3d/0x60
  [   90.675153]  [c01b3e9c] add_to_pagemap+0x5c/0x80
  [   90.732513]  [c01b43e8] pagemap_pte_range+0x68/0xb0
  [   90.793010]  [c0175ed2] walk_page_range+0x112/0x210
  [   90.853482]  [c01b47c6] pagemap_read+0x176/0x220
  [   90.910863]  [c0182dc4] vfs_read+0x94/0x150
  [   90.963058]  [c01832fd] sys_read+0x3d/0x70
  [   91.014219]  [c0104262] syscall_call+0x7/0xb
  
  ...
 
  Just using cp to read the file is enough to cause problems but I included
  a very basic program below that produces the BUG_ON checks. Is this a known
  issue or am I using the interface incorrectly?
 
 I'd say you're using it correctly but you've found a hitherto unknown bug. 
 On i386 highmem machines with CONFIG_HIGHPTE (at least) pte_offset_map()
 takes kmap_atomic(), so pagemap_pte_range() can't do copy_to_user() as it
 presently does.

Damn, I coulda sworn I fixed that.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Matthew Wilcox
On Thu, Dec 13, 2007 at 01:37:59PM -0500, Mark Lord wrote:
 The problem is, the block layer *never* sends an SG entry larger than 8192 
 bytes,
 and even that size is exceptionally rare.  Nearly all I/O segments are 4096 
 bytes,
 so I never see a single I/O larger than 512KB (128 * 4096).
 
 If I patch various parts of block and SCSI, this limit doesn't budge,
 but when I change the hardware PRD limit in libata, it scales by exactly
 whatever I set the new value to.  This tells me that adjacent I/O segments
 are not being combined.
 
 I thought that QUEUE_FLAG_CLUSTER (aka. SCSI host .use_clustering=1) should
 result in adjacent single pages being combined into larger physical 
 segments?

I was recently debugging a driver and noticed that consecutive pages in
an sg list are in the reverse order.  ie first you get page 918, then
917, 916, 915, 914, etc.  I vaguely remember James having patches to
correct this, but maybe they weren't merged?

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread James Bottomley

On Thu, 2007-12-13 at 11:42 -0700, Matthew Wilcox wrote:
 On Thu, Dec 13, 2007 at 01:37:59PM -0500, Mark Lord wrote:
  The problem is, the block layer *never* sends an SG entry larger than 8192 
  bytes,
  and even that size is exceptionally rare.  Nearly all I/O segments are 4096 
  bytes,
  so I never see a single I/O larger than 512KB (128 * 4096).
  
  If I patch various parts of block and SCSI, this limit doesn't budge,
  but when I change the hardware PRD limit in libata, it scales by exactly
  whatever I set the new value to.  This tells me that adjacent I/O segments
  are not being combined.
  
  I thought that QUEUE_FLAG_CLUSTER (aka. SCSI host .use_clustering=1) should
  result in adjacent single pages being combined into larger physical 
  segments?
 
 I was recently debugging a driver and noticed that consecutive pages in
 an sg list are in the reverse order.  ie first you get page 918, then
 917, 916, 915, 914, etc.  I vaguely remember James having patches to
 correct this, but maybe they weren't merged?

Yes, they were ... it was actually Bill Irwin's patch.  The old problem
was that we fault allocations in reverse order (because we were taking
from the end of the zone list).  I can't remember when his patches went
in, but it was several years ago.  After they did, I was getting a 33%
chance of physical merging (as opposed to zero before).  Probably
someone redid the vm or the zones without understanding this and we've
gone back to the original position.

James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

(resending with corrected email address for Jens)

Jens,

I'm experimenting here with trying to generate large I/O through libata,
and not having much luck.

The limit seems to be the number of hardware PRD (SG) entries permitted
by the driver (libata:ata_piix), which is 128 by default.

The problem is, the block layer *never* sends an SG entry larger than 
8192 bytes,
and even that size is exceptionally rare.  Nearly all I/O segments are 
4096 bytes,

so I never see a single I/O larger than 512KB (128 * 4096).

If I patch various parts of block and SCSI, this limit doesn't budge,
but when I change the hardware PRD limit in libata, it scales by exactly
whatever I set the new value to.  This tells me that adjacent I/O segments
are not being combined.

I thought that QUEUE_FLAG_CLUSTER (aka. SCSI host .use_clustering=1) should
result in adjacent single pages being combined into larger physical 
segments?


This is x86-32 with latest 2.6.24-rc*.
I'll re-test on older kernels next.

...

Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB for libata,
but 2.6.24 uses only 4KB segments and a *few* 8KB segments.

???
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Matthew Wilcox
On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
 Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB for 
 libata,
 but 2.6.24 uses only 4KB segments and a *few* 8KB segments.

Just a suspicion ... could this be slab vs slub?  ie check your configs
are the same / similar between the two kernels.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB for 
libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.


Just a suspicion ... could this be slab vs slub?  ie check your configs
are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Mark Lord wrote:
 Matthew Wilcox wrote:
 On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
 Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB for 
 libata,
 but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
 
 Just a suspicion ... could this be slab vs slub?  ie check your configs
 are the same / similar between the two kernels.
 ..
 
 Mmmm.. a good thought, that one.
 But I just rechecked, and both have CONFIG_SLAB=y
 
 My guess is that something got changed around when Jens
 reworked the block layer for 2.6.24.
 I'm going to dig around in there now.

I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
  this area, since it changes some of the code involved with merges and
  blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB for 
libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.

Just a suspicion ... could this be slab vs slub?  ie check your configs
are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.


I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
  this area, since it changes some of the code involved with merges and
  blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

..

Because the early 2.6.24 series failed to boot on this machine
due to bugs in the block layer -- so the code that caused this regression
is probably in the stuff from before the kernels became usable here.

Cheers

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 
64KB for libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.

Just a suspicion ... could this be slab vs slub?  ie check your configs
are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.


I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
  this area, since it changes some of the code involved with merges and
  blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

..

Because the early 2.6.24 series failed to boot on this machine
due to bugs in the block layer -- so the code that caused this regression
is probably in the stuff from before the kernels became usable here.

..

That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to
the first couple of -rc* ones failed here because of incompatibilities
between the block/bio changes and libata.

That's better, I think! 


Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Mark Lord wrote:
 Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Matthew Wilcox wrote:
 On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
 Problem confirmed.  2.6.23.8 regularly generates segments up to 
 64KB for libata,
 but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
 Just a suspicion ... could this be slab vs slub?  ie check your configs
 are the same / similar between the two kernels.
 ..
 
 Mmmm.. a good thought, that one.
 But I just rechecked, and both have CONFIG_SLAB=y
 
 My guess is that something got changed around when Jens
 reworked the block layer for 2.6.24.
 I'm going to dig around in there now.
 
 I didn't rework the block layer for 2.6.24 :-). The core block layer
 changes since 2.6.23 are:
 
 - Support for empty barriers. Not a likely candidate.
 - Shared tag queue fixes. Totally unlikely.
 - sg chaining support. Not likely.
 - The bio changes from Neil. Of the bunch, the most likely suspects in
   this area, since it changes some of the code involved with merges and
   blk_rq_map_sg().
 - Lots of simple stuff, again very unlikely.
 
 Anyway, it sounds odd for this to be a block layer problem if you do see
 occasional segments being merged. So it sounds more like the input data
 having changed.
 
 Why not just bisect it?
 ..
 
 Because the early 2.6.24 series failed to boot on this machine
 due to bugs in the block layer -- so the code that caused this regression
 is probably in the stuff from before the kernels became usable here.
 ..
 
 That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to
 the first couple of -rc* ones failed here because of incompatibilities
 between the block/bio changes and libata.
 
 That's better, I think! 

No worries, I didn't pick it up as harsh just as an odd conclusion :-)

If I were you, I'd just start from the first -rc that booted for you. If
THAT has the bug, then we'll think of something else. If you don't get
anywhere, I can run some tests tomorrow and see if I can reproduce it
here.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 
64KB for libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
Just a suspicion ... could this be slab vs slub?  ie check your 
configs

are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.

I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
this area, since it changes some of the code involved with merges and
blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

..

Because the early 2.6.24 series failed to boot on this machine
due to bugs in the block layer -- so the code that caused this regression
is probably in the stuff from before the kernels became usable here.

..

That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to
the first couple of -rc* ones failed here because of incompatibilities
between the block/bio changes and libata.

That's better, I think! 

No worries, I didn't pick it up as harsh just as an odd conclusion :-)

If I were you, I'd just start from the first -rc that booted for you. If
THAT has the bug, then we'll think of something else. If you don't get
anywhere, I can run some tests tomorrow and see if I can reproduce it
here.

..

I believe that *anyone* can reproduce it, since it's broken long before
the requests ever get to SCSI or libata.  Which also means that *anyone*
who wants to can bisect it, as well.

I don't do bisects.


It was just a suggestion on how to narrow it down, do as you see fit.


But I will dig a bit more and see if I can find the culprit.


Sure, I'll dig around as well.

..

I wonder if it's 9dfa52831e96194b8649613e3131baa2c109f7dc:
Merge blk_recount_segments into blk_recalc_rq_segments ?

That particular commit does some rather innocent code-shuffling,
but also introduces a couple of new if (nr_hw_segs == 1 conditions
that were not there before.

Okay git experts:  how do I pull out a kernel at the point of this exact commit 
?

Thanks!




-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
  Jens Axboe wrote:
  On Thu, Dec 13 2007, Mark Lord wrote:
  Mark Lord wrote:
  Jens Axboe wrote:
  On Thu, Dec 13 2007, Mark Lord wrote:
  Matthew Wilcox wrote:
  On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
  Problem confirmed.  2.6.23.8 regularly generates segments up to 
  64KB for libata,
  but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
  Just a suspicion ... could this be slab vs slub?  ie check your 
  configs
  are the same / similar between the two kernels.
  ..
  
  Mmmm.. a good thought, that one.
  But I just rechecked, and both have CONFIG_SLAB=y
  
  My guess is that something got changed around when Jens
  reworked the block layer for 2.6.24.
  I'm going to dig around in there now.
  I didn't rework the block layer for 2.6.24 :-). The core block layer
  changes since 2.6.23 are:
  
  - Support for empty barriers. Not a likely candidate.
  - Shared tag queue fixes. Totally unlikely.
  - sg chaining support. Not likely.
  - The bio changes from Neil. Of the bunch, the most likely suspects in
   this area, since it changes some of the code involved with merges and
   blk_rq_map_sg().
  - Lots of simple stuff, again very unlikely.
  
  Anyway, it sounds odd for this to be a block layer problem if you do see
  occasional segments being merged. So it sounds more like the input data
  having changed.
  
  Why not just bisect it?
  ..
  
  Because the early 2.6.24 series failed to boot on this machine
  due to bugs in the block layer -- so the code that caused this regression
  is probably in the stuff from before the kernels became usable here.
  ..
  
  That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to
  the first couple of -rc* ones failed here because of incompatibilities
  between the block/bio changes and libata.
  
  That's better, I think! 
  
  No worries, I didn't pick it up as harsh just as an odd conclusion :-)
  
  If I were you, I'd just start from the first -rc that booted for you. If
  THAT has the bug, then we'll think of something else. If you don't get
  anywhere, I can run some tests tomorrow and see if I can reproduce it
  here.
  ..
  
  I believe that *anyone* can reproduce it, since it's broken long before
  the requests ever get to SCSI or libata.  Which also means that *anyone*
  who wants to can bisect it, as well.
  
  I don't do bisects.
 
 It was just a suggestion on how to narrow it down, do as you see fit.
 
  But I will dig a bit more and see if I can find the culprit.
 
 Sure, I'll dig around as well.

Just tried something simple. I only see one 12kb segment so far, so not
a lot by any stretch. I also DONT see any missed merges signs, so it
would appear that the pages in the request are simply not contigious
physically.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e30b1a4..1e34b6f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request 
*rq,
goto new_segment;
 
sg-length += nbytes;
+   if (sg-length  8192)
+   printk(sg_len=%d\n, sg-length);
} else {
 new_segment:
if (!sg)
@@ -1349,6 +1351,8 @@ new_segment:
sg = sg_next(sg);
}
 
+   if (bvprv  (page_address(bvprv-bv_page) + 
bvprv-bv_len == page_address(bvec-bv_page)))
+   printk(missed merge\n);
sg_set_page(sg, bvec-bv_page, nbytes, bvec-bv_offset);
nsegs++;
}

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Matthew Wilcox wrote:
 On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
 Problem confirmed.  2.6.23.8 regularly generates segments up to 
 64KB for libata,
 but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
 Just a suspicion ... could this be slab vs slub?  ie check your 
 configs
 are the same / similar between the two kernels.
 ..
 
 Mmmm.. a good thought, that one.
 But I just rechecked, and both have CONFIG_SLAB=y
 
 My guess is that something got changed around when Jens
 reworked the block layer for 2.6.24.
 I'm going to dig around in there now.
 I didn't rework the block layer for 2.6.24 :-). The core block layer
 changes since 2.6.23 are:
 
 - Support for empty barriers. Not a likely candidate.
 - Shared tag queue fixes. Totally unlikely.
 - sg chaining support. Not likely.
 - The bio changes from Neil. Of the bunch, the most likely suspects in
 this area, since it changes some of the code involved with merges and
 blk_rq_map_sg().
 - Lots of simple stuff, again very unlikely.
 
 Anyway, it sounds odd for this to be a block layer problem if you do 
 see
 occasional segments being merged. So it sounds more like the input 
 data
 having changed.
 
 Why not just bisect it?
 ..
 
 Because the early 2.6.24 series failed to boot on this machine
 due to bugs in the block layer -- so the code that caused this 
 regression
 is probably in the stuff from before the kernels became usable here.
 ..
 
 That sounds more harsh than intended -- the earlier 2.6.24 kernels (up 
 to
 the first couple of -rc* ones failed here because of incompatibilities
 between the block/bio changes and libata.
 
 That's better, I think! 
 No worries, I didn't pick it up as harsh just as an odd conclusion :-)
 
 If I were you, I'd just start from the first -rc that booted for you. If
 THAT has the bug, then we'll think of something else. If you don't get
 anywhere, I can run some tests tomorrow and see if I can reproduce it
 here.
 ..
 
 I believe that *anyone* can reproduce it, since it's broken long before
 the requests ever get to SCSI or libata.  Which also means that *anyone*
 who wants to can bisect it, as well.
 
 I don't do bisects.
 
 It was just a suggestion on how to narrow it down, do as you see fit.
 
 But I will dig a bit more and see if I can find the culprit.
 
 Sure, I'll dig around as well.
 ..
 
 I wonder if it's 9dfa52831e96194b8649613e3131baa2c109f7dc:
 Merge blk_recount_segments into blk_recalc_rq_segments ?
 
 That particular commit does some rather innocent code-shuffling,
 but also introduces a couple of new if (nr_hw_segs == 1 conditions
 that were not there before.

You can try and revert it of course, but I think you are looking at the
wrong bits. If the segment counts were totally off, you'd never be
anywhere close to reaching the set limit. Your problems seems to be
missed contig segment merges.

 Okay git experts:  how do I pull out a kernel at the point of this exact 
 commit ?

Dummy approach - git log and grep for
9dfa52831e96194b8649613e3131baa2c109f7dc, then see what commit is before
that. Then do a git checkout commit.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Jens Axboe wrote:

On Thu, Dec 13 2007, Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 
64KB for libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
Just a suspicion ... could this be slab vs slub?  ie check your 
configs

are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.

I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
this area, since it changes some of the code involved with merges and
blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

..

Because the early 2.6.24 series failed to boot on this machine
due to bugs in the block layer -- so the code that caused this regression
is probably in the stuff from before the kernels became usable here.

..

That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to
the first couple of -rc* ones failed here because of incompatibilities
between the block/bio changes and libata.

That's better, I think! 

No worries, I didn't pick it up as harsh just as an odd conclusion :-)

If I were you, I'd just start from the first -rc that booted for you. If
THAT has the bug, then we'll think of something else. If you don't get
anywhere, I can run some tests tomorrow and see if I can reproduce it
here.

..

I believe that *anyone* can reproduce it, since it's broken long before
the requests ever get to SCSI or libata.  Which also means that *anyone*
who wants to can bisect it, as well.

I don't do bisects.

It was just a suggestion on how to narrow it down, do as you see fit.


But I will dig a bit more and see if I can find the culprit.

Sure, I'll dig around as well.


Just tried something simple. I only see one 12kb segment so far, so not
a lot by any stretch. I also DONT see any missed merges signs, so it
would appear that the pages in the request are simply not contigious
physically.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e30b1a4..1e34b6f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request 
*rq,
goto new_segment;
 
 			sg-length += nbytes;

+   if (sg-length  8192)
+   printk(sg_len=%d\n, sg-length);
} else {
 new_segment:
if (!sg)
@@ -1349,6 +1351,8 @@ new_segment:
sg = sg_next(sg);
}
 
+			if (bvprv  (page_address(bvprv-bv_page) + bvprv-bv_len == page_address(bvec-bv_page)))

+   printk(missed merge\n);
sg_set_page(sg, bvec-bv_page, nbytes, bvec-bv_offset);
nsegs++;
}


..

Yeah, the first part is similar to my own hack.

For testing, try dd if=/dev/sda of=/dev/null bs=4096k.
That *really* should end up using contiguous pages on most systems.

I figured out the git thing, and am now building some in-between kernels to try.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Matthew Wilcox wrote:
 On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
 Problem confirmed.  2.6.23.8 regularly generates segments up to 
 64KB for libata,
 but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
 Just a suspicion ... could this be slab vs slub?  ie check your 
 configs
 are the same / similar between the two kernels.
 ..
 
 Mmmm.. a good thought, that one.
 But I just rechecked, and both have CONFIG_SLAB=y
 
 My guess is that something got changed around when Jens
 reworked the block layer for 2.6.24.
 I'm going to dig around in there now.
 I didn't rework the block layer for 2.6.24 :-). The core block layer
 changes since 2.6.23 are:
 
 - Support for empty barriers. Not a likely candidate.
 - Shared tag queue fixes. Totally unlikely.
 - sg chaining support. Not likely.
 - The bio changes from Neil. Of the bunch, the most likely suspects 
 in
 this area, since it changes some of the code involved with merges and
 blk_rq_map_sg().
 - Lots of simple stuff, again very unlikely.
 
 Anyway, it sounds odd for this to be a block layer problem if you do 
 see
 occasional segments being merged. So it sounds more like the input 
 data
 having changed.
 
 Why not just bisect it?
 ..
 
 Because the early 2.6.24 series failed to boot on this machine
 due to bugs in the block layer -- so the code that caused this 
 regression
 is probably in the stuff from before the kernels became usable here.
 ..
 
 That sounds more harsh than intended -- the earlier 2.6.24 kernels 
 (up to
 the first couple of -rc* ones failed here because of incompatibilities
 between the block/bio changes and libata.
 
 That's better, I think! 
 No worries, I didn't pick it up as harsh just as an odd conclusion :-)
 
 If I were you, I'd just start from the first -rc that booted for you. If
 THAT has the bug, then we'll think of something else. If you don't get
 anywhere, I can run some tests tomorrow and see if I can reproduce it
 here.
 ..
 
 I believe that *anyone* can reproduce it, since it's broken long before
 the requests ever get to SCSI or libata.  Which also means that *anyone*
 who wants to can bisect it, as well.
 
 I don't do bisects.
 It was just a suggestion on how to narrow it down, do as you see fit.
 
 But I will dig a bit more and see if I can find the culprit.
 Sure, I'll dig around as well.
 
 Just tried something simple. I only see one 12kb segment so far, so not
 a lot by any stretch. I also DONT see any missed merges signs, so it
 would appear that the pages in the request are simply not contigious
 physically.
 
 diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
 index e30b1a4..1e34b6f 100644
 --- a/block/ll_rw_blk.c
 +++ b/block/ll_rw_blk.c
 @@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct 
 request *rq,
  goto new_segment;
  
  sg-length += nbytes;
 +if (sg-length  8192)
 +printk(sg_len=%d\n, sg-length);
  } else {
  new_segment:
  if (!sg)
 @@ -1349,6 +1351,8 @@ new_segment:
  sg = sg_next(sg);
  }
  
 +if (bvprv  (page_address(bvprv-bv_page) + 
 bvprv-bv_len == page_address(bvec-bv_page)))
 +printk(missed merge\n);
  sg_set_page(sg, bvec-bv_page, nbytes, 
  bvec-bv_offset);
  nsegs++;
  }
 
 ..
 
 Yeah, the first part is similar to my own hack.
 
 For testing, try dd if=/dev/sda of=/dev/null bs=4096k.
 That *really* should end up using contiguous pages on most systems.
 
 I figured out the git thing, and am now building some in-between kernels to 
 try.

OK, it's a vm issue, I have tens of thousand backward pages after a
boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
reverse. So it looks like that bug got reintroduced.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 
64KB for libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
Just a suspicion ... could this be slab vs slub?  ie check your 
configs

are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.

I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects 
in

this area, since it changes some of the code involved with merges and
blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do 
see
occasional segments being merged. So it sounds more like the input 
data

having changed.

Why not just bisect it?

..

Because the early 2.6.24 series failed to boot on this machine
due to bugs in the block layer -- so the code that caused this 
regression

is probably in the stuff from before the kernels became usable here.

..

That sounds more harsh than intended -- the earlier 2.6.24 kernels 
(up to

the first couple of -rc* ones failed here because of incompatibilities
between the block/bio changes and libata.

That's better, I think! 

No worries, I didn't pick it up as harsh just as an odd conclusion :-)

If I were you, I'd just start from the first -rc that booted for you. If
THAT has the bug, then we'll think of something else. If you don't get
anywhere, I can run some tests tomorrow and see if I can reproduce it
here.

..

I believe that *anyone* can reproduce it, since it's broken long before
the requests ever get to SCSI or libata.  Which also means that *anyone*
who wants to can bisect it, as well.

I don't do bisects.

It was just a suggestion on how to narrow it down, do as you see fit.


But I will dig a bit more and see if I can find the culprit.

Sure, I'll dig around as well.

Just tried something simple. I only see one 12kb segment so far, so not
a lot by any stretch. I also DONT see any missed merges signs, so it
would appear that the pages in the request are simply not contigious
physically.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e30b1a4..1e34b6f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct 
request *rq,

goto new_segment;

sg-length += nbytes;
+   if (sg-length  8192)
+   printk(sg_len=%d\n, sg-length);
} else {
new_segment:
if (!sg)
@@ -1349,6 +1351,8 @@ new_segment:
sg = sg_next(sg);
}

+			if (bvprv  (page_address(bvprv-bv_page) + 
bvprv-bv_len == page_address(bvec-bv_page)))

+   printk(missed merge\n);
			sg_set_page(sg, bvec-bv_page, nbytes, 
			bvec-bv_offset);

nsegs++;
}


..

Yeah, the first part is similar to my own hack.

For testing, try dd if=/dev/sda of=/dev/null bs=4096k.
That *really* should end up using contiguous pages on most systems.

I figured out the git thing, and am now building some in-between kernels to 
try.


OK, it's a vm issue, I have tens of thousand backward pages after a
boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
reverse. So it looks like that bug got reintroduced.

...

Mmm.. shouldn't one of the front- or back- merge logics work for either order?


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Jens Axboe wrote:

..

OK, it's a vm issue, I have tens of thousand backward pages after a
boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
reverse. So it looks like that bug got reintroduced.

...

Mmm.. shouldn't one of the front- or back- merge logics work for either order?

..

Belay that thought.  I'm slowly remembering how this is supposed to work now.  
:)
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Matthew Wilcox wrote:
 On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
 Problem confirmed.  2.6.23.8 regularly generates segments up to 
 64KB for libata,
 but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
 Just a suspicion ... could this be slab vs slub?  ie check your 
 configs
 are the same / similar between the two kernels.
 ..
 
 Mmmm.. a good thought, that one.
 But I just rechecked, and both have CONFIG_SLAB=y
 
 My guess is that something got changed around when Jens
 reworked the block layer for 2.6.24.
 I'm going to dig around in there now.
 I didn't rework the block layer for 2.6.24 :-). The core block 
 layer
 changes since 2.6.23 are:
 
 - Support for empty barriers. Not a likely candidate.
 - Shared tag queue fixes. Totally unlikely.
 - sg chaining support. Not likely.
 - The bio changes from Neil. Of the bunch, the most likely 
 suspects in
 this area, since it changes some of the code involved with merges 
 and
 blk_rq_map_sg().
 - Lots of simple stuff, again very unlikely.
 
 Anyway, it sounds odd for this to be a block layer problem if you 
 do see
 occasional segments being merged. So it sounds more like the input 
 data
 having changed.
 
 Why not just bisect it?
 ..
 
 Because the early 2.6.24 series failed to boot on this machine
 due to bugs in the block layer -- so the code that caused this 
 regression
 is probably in the stuff from before the kernels became usable here.
 ..
 
 That sounds more harsh than intended -- the earlier 2.6.24 kernels 
 (up to
 the first couple of -rc* ones failed here because of 
 incompatibilities
 between the block/bio changes and libata.
 
 That's better, I think! 
 No worries, I didn't pick it up as harsh just as an odd conclusion :-)
 
 If I were you, I'd just start from the first -rc that booted for you. 
 If
 THAT has the bug, then we'll think of something else. If you don't get
 anywhere, I can run some tests tomorrow and see if I can reproduce it
 here.
 ..
 
 I believe that *anyone* can reproduce it, since it's broken long before
 the requests ever get to SCSI or libata.  Which also means that 
 *anyone*
 who wants to can bisect it, as well.
 
 I don't do bisects.
 It was just a suggestion on how to narrow it down, do as you see fit.
 
 But I will dig a bit more and see if I can find the culprit.
 Sure, I'll dig around as well.
 Just tried something simple. I only see one 12kb segment so far, so not
 a lot by any stretch. I also DONT see any missed merges signs, so it
 would appear that the pages in the request are simply not contigious
 physically.
 
 diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
 index e30b1a4..1e34b6f 100644
 --- a/block/ll_rw_blk.c
 +++ b/block/ll_rw_blk.c
 @@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct 
 request *rq,
goto new_segment;
 
sg-length += nbytes;
 +  if (sg-length  8192)
 +  printk(sg_len=%d\n, sg-length);
} else {
 new_segment:
if (!sg)
 @@ -1349,6 +1351,8 @@ new_segment:
sg = sg_next(sg);
}
 
 +  if (bvprv  (page_address(bvprv-bv_page) + 
 bvprv-bv_len == page_address(bvec-bv_page)))
 +  printk(missed merge\n);
sg_set_page(sg, bvec-bv_page, nbytes, 
bvec-bv_offset);
nsegs++;
}
 
 ..
 
 Yeah, the first part is similar to my own hack.
 
 For testing, try dd if=/dev/sda of=/dev/null bs=4096k.
 That *really* should end up using contiguous pages on most systems.
 
 I figured out the git thing, and am now building some in-between kernels 
 to try.
 
 OK, it's a vm issue, I have tens of thousand backward pages after a
 boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
 reverse. So it looks like that bug got reintroduced.
 ...
 
 Mmm.. shouldn't one of the front- or back- merge logics work for either 
 order?

I think you are misunderstanding the merging. The front/back bits are
for contig on disk, this is sg segment merging. We can only join pieces
that are contig in memory, otherwise the result would not be pretty :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Andrew Morton
On Thu, 13 Dec 2007 21:09:59 +0100
Jens Axboe [EMAIL PROTECTED] wrote:


 OK, it's a vm issue,

cc linux-mm and probable culprit.

  I have tens of thousand backward pages after a
 boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
 reverse. So it looks like that bug got reintroduced.

Bill Irwin fixed this a couple of years back: changed the page allocator so
that it mostly hands out pages in ascending physical-address order.

I guess we broke that, quite possibly in Mel's page allocator rework.

It would help if you could provide us with a simple recipe for
demonstrating this problem, please.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread James Bottomley

On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote:
 On Thu, 13 Dec 2007 21:09:59 +0100
 Jens Axboe [EMAIL PROTECTED] wrote:
 
 
  OK, it's a vm issue,
 
 cc linux-mm and probable culprit.
 
   I have tens of thousand backward pages after a
  boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
  reverse. So it looks like that bug got reintroduced.
 
 Bill Irwin fixed this a couple of years back: changed the page allocator so
 that it mostly hands out pages in ascending physical-address order.
 
 I guess we broke that, quite possibly in Mel's page allocator rework.
 
 It would help if you could provide us with a simple recipe for
 demonstrating this problem, please.

The simple way seems to be to malloc a large area, touch every page and
then look at the physical pages assigned ... they now mostly seem to be
descending in physical address.

James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Andrew Morton wrote:
 On Thu, 13 Dec 2007 21:09:59 +0100
 Jens Axboe [EMAIL PROTECTED] wrote:
 
 
  OK, it's a vm issue,
 
 cc linux-mm and probable culprit.
 
   I have tens of thousand backward pages after a
  boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
  reverse. So it looks like that bug got reintroduced.
 
 Bill Irwin fixed this a couple of years back: changed the page allocator so
 that it mostly hands out pages in ascending physical-address order.
 
 I guess we broke that, quite possibly in Mel's page allocator rework.
 
 It would help if you could provide us with a simple recipe for
 demonstrating this problem, please.

Basically anything involving IO :-). A boot here showed a handful of
good merges, and probably in the order of 100,000 descending
allocations. A kernel make is a fine test as well.

Something like the below should work fine - if you see oodles of these
basicaly doing any type of IO, then you are screwed.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e30b1a4..8ce3fcc 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1349,6 +1349,10 @@ new_segment:
sg = sg_next(sg);
}
 
+   if (bvprv) {
+   if (page_address(bvec-bv_page) + PAGE_SIZE == 
page_address(bvprv-bv_page)  printk_ratelimit())
+   printk(page alloc order backwards\n);
+   }
sg_set_page(sg, bvec-bv_page, nbytes, bvec-bv_offset);
nsegs++;
}

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Andrew Morton wrote:

On Thu, 13 Dec 2007 17:15:06 -0500
James Bottomley [EMAIL PROTECTED] wrote:


On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote:

On Thu, 13 Dec 2007 21:09:59 +0100
Jens Axboe [EMAIL PROTECTED] wrote:


OK, it's a vm issue,

cc linux-mm and probable culprit.


 I have tens of thousand backward pages after a
boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
reverse. So it looks like that bug got reintroduced.

Bill Irwin fixed this a couple of years back: changed the page allocator so
that it mostly hands out pages in ascending physical-address order.

I guess we broke that, quite possibly in Mel's page allocator rework.

It would help if you could provide us with a simple recipe for
demonstrating this problem, please.

The simple way seems to be to malloc a large area, touch every page and
then look at the physical pages assigned ... they now mostly seem to be
descending in physical address.



OIC.  -mm's /proc/pid/pagemap can be used to get the pfn's...

..

I'm actually running the treadmill right now (have been for many hours, 
actually,
to bisect it to a specific commit.

Thought I was almost done, and then noticed that git-bisect doesn't keep
the Makefile VERSION lines the same, so I was actually running the wrong
kernel after the first few times.. duh.

Wrote a script to fix it now.

-ml
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Andrew Morton
On Thu, 13 Dec 2007 17:15:06 -0500
James Bottomley [EMAIL PROTECTED] wrote:

 
 On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote:
  On Thu, 13 Dec 2007 21:09:59 +0100
  Jens Axboe [EMAIL PROTECTED] wrote:
  
  
   OK, it's a vm issue,
  
  cc linux-mm and probable culprit.
  
I have tens of thousand backward pages after a
   boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
   reverse. So it looks like that bug got reintroduced.
  
  Bill Irwin fixed this a couple of years back: changed the page allocator so
  that it mostly hands out pages in ascending physical-address order.
  
  I guess we broke that, quite possibly in Mel's page allocator rework.
  
  It would help if you could provide us with a simple recipe for
  demonstrating this problem, please.
 
 The simple way seems to be to malloc a large area, touch every page and
 then look at the physical pages assigned ... they now mostly seem to be
 descending in physical address.
 

OIC.  -mm's /proc/pid/pagemap can be used to get the pfn's...
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Andrew Morton wrote:

On Thu, 13 Dec 2007 17:15:06 -0500
James Bottomley [EMAIL PROTECTED] wrote:


On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote:

On Thu, 13 Dec 2007 21:09:59 +0100
Jens Axboe [EMAIL PROTECTED] wrote:


OK, it's a vm issue,

cc linux-mm and probable culprit.


 I have tens of thousand backward pages after a
boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
reverse. So it looks like that bug got reintroduced.
Bill Irwin fixed this a couple of years back: changed the page 
allocator so

that it mostly hands out pages in ascending physical-address order.

I guess we broke that, quite possibly in Mel's page allocator rework.

It would help if you could provide us with a simple recipe for
demonstrating this problem, please.

The simple way seems to be to malloc a large area, touch every page and
then look at the physical pages assigned ... they now mostly seem to be
descending in physical address.



OIC.  -mm's /proc/pid/pagemap can be used to get the pfn's...

..

I'm actually running the treadmill right now (have been for many hours, 
actually,

to bisect it to a specific commit.

Thought I was almost done, and then noticed that git-bisect doesn't keep
the Makefile VERSION lines the same, so I was actually running the wrong
kernel after the first few times.. duh.

Wrote a script to fix it now.

..

Well, that was a waste of three hours.

Somebody else can try it now.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Mark Lord wrote:

Andrew Morton wrote:

On Thu, 13 Dec 2007 17:15:06 -0500
James Bottomley [EMAIL PROTECTED] wrote:


On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote:

On Thu, 13 Dec 2007 21:09:59 +0100
Jens Axboe [EMAIL PROTECTED] wrote:


OK, it's a vm issue,

cc linux-mm and probable culprit.


 I have tens of thousand backward pages after a
boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
reverse. So it looks like that bug got reintroduced.
Bill Irwin fixed this a couple of years back: changed the page 
allocator so

that it mostly hands out pages in ascending physical-address order.

I guess we broke that, quite possibly in Mel's page allocator rework.

It would help if you could provide us with a simple recipe for
demonstrating this problem, please.

The simple way seems to be to malloc a large area, touch every page and
then look at the physical pages assigned ... they now mostly seem to be
descending in physical address.



OIC.  -mm's /proc/pid/pagemap can be used to get the pfn's...

..

I'm actually running the treadmill right now (have been for many 
hours, actually,

to bisect it to a specific commit.

Thought I was almost done, and then noticed that git-bisect doesn't keep
the Makefile VERSION lines the same, so I was actually running the wrong
kernel after the first few times.. duh.

Wrote a script to fix it now.

..

Well, that was a waste of three hours.

..

Ahh.. it seems to be sensitive to one/both of these:

CONFIG_HIGHMEM64G=y with 4GB RAM:  not so bad, frequently does 20KB - 48KB 
segments.
CONFIG_HIGHMEM4G=y  with 2GB RAM:  very severe, rarely does more than 8KB 
segments.
CONFIG_HIGHMEM4G=y  with 3GB RAM:  very severe, rarely does more than 8KB 
segments.

So if you want to reproduce this on a large memory machine, use mem=2GB for 
starters.

Still testing..



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Andrew Morton wrote:

On Thu, 13 Dec 2007 19:30:00 -0500
Mark Lord [EMAIL PROTECTED] wrote:


Here's the commit that causes the regression:

...

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -760,7 +760,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int 
order,
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
-   list_add_tail(page-lru, list);
+   list_add(page-lru, list);


well that looks fishy.

..

Yeah.  I missed that, and instead just posted a patch
to search the list in reverse order, which seems to work for me.

I'll try just reversing that line above here now.. gimme 5 minutes or so.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Andrew Morton
On Thu, 13 Dec 2007 19:30:00 -0500
Mark Lord [EMAIL PROTECTED] wrote:

 Here's the commit that causes the regression:
 
 ...

 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -760,7 +760,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int 
 order,
   struct page *page = __rmqueue(zone, order, migratetype);
   if (unlikely(page == NULL))
   break;
 - list_add_tail(page-lru, list);
 + list_add(page-lru, list);

well that looks fishy.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Andrew Morton wrote:

On Thu, 13 Dec 2007 19:30:00 -0500
Mark Lord [EMAIL PROTECTED] wrote:


Here's the commit that causes the regression:

...

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -760,7 +760,8 @@ static int rmqueue_bulk(struct zone *zone, 
unsigned int order,

 struct page *page = __rmqueue(zone, order, migratetype);
 if (unlikely(page == NULL))
 break;
-list_add_tail(page-lru, list);
+list_add(page-lru, list);


well that looks fishy.

..

Yeah.  I missed that, and instead just posted a patch
to search the list in reverse order, which seems to work for me.

I'll try just reversing that line above here now.. gimme 5 minutes or so.

..

Yep, that works too.  Alternative improved patch now posted.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html