Re: [BUG][ext2] XIP does not work on ext2

2013-11-11 Thread Jan Kara
On Fri 08-11-13 16:28:15, Andiry Xu wrote:
> On Thu, Nov 7, 2013 at 2:45 PM, Andiry Xu  wrote:
> > On Thu, Nov 7, 2013 at 2:20 PM, Jan Kara  wrote:
> >> On Thu 07-11-13 13:50:09, Andiry Xu wrote:
> >>> On Thu, Nov 7, 2013 at 1:07 PM, Jan Kara  wrote:
> >>> > On Thu 07-11-13 12:14:13, Andiry Xu wrote:
> >>> >> On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara  wrote:
> >>> >> > On Tue 05-11-13 17:28:35, Andiry Xu wrote:
> >>> >> >> >> Do you know the reason why write() outperforms mmap() in some 
> >>> >> >> >> cases? I
> >>> >> >> >> know it's not related the thread but I really appreciate if you 
> >>> >> >> >> can
> >>> >> >> >> answer my question.
> >>> >> >> >   Well, I'm not completely sure. mmap()ed memory always works on 
> >>> >> >> > page-by-page
> >>> >> >> > basis - you first access the page, it gets faulted in and you can 
> >>> >> >> > further
> >>> >> >> > access it. So for small (sub page size) accesses this is a win 
> >>> >> >> > because you
> >>> >> >> > don't have an overhead of syscall and fs write path. For accesses 
> >>> >> >> > larger
> >>> >> >> > than page size the overhead of syscall and some initial checks is 
> >>> >> >> > well
> >>> >> >> > hidden by other things. I guess write() ends up being more 
> >>> >> >> > efficient
> >>> >> >> > because write path taken for each page is somewhat lighter than 
> >>> >> >> > full page
> >>> >> >> > fault. But you'd need to look into perf data to get some hard 
> >>> >> >> > numbers on
> >>> >> >> > where the time is spent.
> >>> >> >> >
> >>> >> >>
> >>> >> >> Thanks for the reply. However I have filled up the whole RAM disk
> >>> >> >> before doing the test, i.e. asked the brd driver to allocate all the
> >>> >> >> pages initially.
> >>> >> >   Well, pages in ramdisk are always present, that's not an issue. 
> >>> >> > But you
> >>> >> > will get a page fault to map a particular physical page in process'
> >>> >> > virtual address space when you first access that virtual address in 
> >>> >> > the
> >>> >> > mapping from the process. The cost of setting up this 
> >>> >> > virtual->physical
> >>> >> > mapping is what I'm talking about.
> >>> >> >
> >>> >>
> >>> >> Yes, you are right, there are page faults observed with perf. I
> >>> >> misunderstood page fault as copying pages between backing store and
> >>> >> physical memory.
> >>> >>
> >>> >> > If you had a process which first mmaps the file and writes to all 
> >>> >> > pages in
> >>> >> > the mapping and *then* measure the cost of another round of writing 
> >>> >> > to the
> >>> >> > mapping, I would expect you should see speeds close to those of 
> >>> >> > memory bus.
> >>> >> >
> >>> >>
> >>> >> I've tried this as well. mmap() performance improves but still not as
> >>> >> good as write().
> >>> >> I used the perf report to compare write() and mmap() applications. For
> >>> >> write() version, top of perf report shows as:
> >>> >> 33.33%  __copy_user_nocache
> >>> >> 4.72%ext2_get_blocks
> >>> >> 4.42%mutex_unlock
> >>> >> 3.59%__find_get_block
> >>> >>
> >>> >> which looks reasonable.
> >>> >>
> >>> >> However, for mmap() version, the perf report looks strange:
> >>> >> 94.98% libc-2.15.so   [.] 0x0014698d
> >>> >> 2.25%   page_fault
> >>> >> 0.18%   handle_mm_fault
> >>> >>
> >>> >> I don't know what the first item is but it took the majority of cycles.
> >>> >   The first item means that it's some userspace code in libc. My guess
> >>> > would be that it's libc's memcpy() function (or whatever you use to 
> >>> > write
> >>> > to mmap). How do you access the mmap?
> >>> >
> >>>
> >>> Like this:
> >>>
> >>> fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, 0755);
> >>> dest = (char *)mmap(NULL, FILE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
> >>> for (i = 0; i < count; i++)
> >>> {
> >>>memcpy(dest, src, request_size);
> >>>dest += request_size;
> >>> }
> >>   OK, maybe libc memcpy isn't very well optimized for you cpu? Not sure how
> >> to tune that though...
> >>
> >
> > Hmm, I will try some different kinds of memcpy to see if there is a
> > difference. Just want to make sure I do not make some stupid mistakes
> > before trying that.
> > Thanks a lot for your help!
> >
> 
> Your advice does makes difference. I use a optimized version of memcpy
> and it does improve the mmap application performance: on a Ramdisk
> with Ext2 xip, mmap() version now achieves 11GB/s of bandwidth,
> comparing to posix write version with 7GB/s.
  Good :).

> Now I wonder if they have a plan to update the memcpy() in libc..
  You better ask at glibc devel list... I've google for a while whether
memcpy() in glibc can be somehow tuned (for a particular instruction set)
but didn't find anything useful.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [BUG][ext2] XIP does not work on ext2

2013-11-11 Thread Jan Kara
On Fri 08-11-13 16:28:15, Andiry Xu wrote:
 On Thu, Nov 7, 2013 at 2:45 PM, Andiry Xu and...@gmail.com wrote:
  On Thu, Nov 7, 2013 at 2:20 PM, Jan Kara j...@suse.cz wrote:
  On Thu 07-11-13 13:50:09, Andiry Xu wrote:
  On Thu, Nov 7, 2013 at 1:07 PM, Jan Kara j...@suse.cz wrote:
   On Thu 07-11-13 12:14:13, Andiry Xu wrote:
   On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara j...@suse.cz wrote:
On Tue 05-11-13 17:28:35, Andiry Xu wrote:
 Do you know the reason why write() outperforms mmap() in some 
 cases? I
 know it's not related the thread but I really appreciate if you 
 can
 answer my question.
   Well, I'm not completely sure. mmap()ed memory always works on 
 page-by-page
 basis - you first access the page, it gets faulted in and you can 
 further
 access it. So for small (sub page size) accesses this is a win 
 because you
 don't have an overhead of syscall and fs write path. For accesses 
 larger
 than page size the overhead of syscall and some initial checks is 
 well
 hidden by other things. I guess write() ends up being more 
 efficient
 because write path taken for each page is somewhat lighter than 
 full page
 fault. But you'd need to look into perf data to get some hard 
 numbers on
 where the time is spent.

   
Thanks for the reply. However I have filled up the whole RAM disk
before doing the test, i.e. asked the brd driver to allocate all the
pages initially.
  Well, pages in ramdisk are always present, that's not an issue. 
But you
will get a page fault to map a particular physical page in process'
virtual address space when you first access that virtual address in 
the
mapping from the process. The cost of setting up this 
virtual-physical
mapping is what I'm talking about.
   
  
   Yes, you are right, there are page faults observed with perf. I
   misunderstood page fault as copying pages between backing store and
   physical memory.
  
If you had a process which first mmaps the file and writes to all 
pages in
the mapping and *then* measure the cost of another round of writing 
to the
mapping, I would expect you should see speeds close to those of 
memory bus.
   
  
   I've tried this as well. mmap() performance improves but still not as
   good as write().
   I used the perf report to compare write() and mmap() applications. For
   write() version, top of perf report shows as:
   33.33%  __copy_user_nocache
   4.72%ext2_get_blocks
   4.42%mutex_unlock
   3.59%__find_get_block
  
   which looks reasonable.
  
   However, for mmap() version, the perf report looks strange:
   94.98% libc-2.15.so   [.] 0x0014698d
   2.25%   page_fault
   0.18%   handle_mm_fault
  
   I don't know what the first item is but it took the majority of cycles.
 The first item means that it's some userspace code in libc. My guess
   would be that it's libc's memcpy() function (or whatever you use to 
   write
   to mmap). How do you access the mmap?
  
 
  Like this:
 
  fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, 0755);
  dest = (char *)mmap(NULL, FILE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
  for (i = 0; i  count; i++)
  {
 memcpy(dest, src, request_size);
 dest += request_size;
  }
OK, maybe libc memcpy isn't very well optimized for you cpu? Not sure how
  to tune that though...
 
 
  Hmm, I will try some different kinds of memcpy to see if there is a
  difference. Just want to make sure I do not make some stupid mistakes
  before trying that.
  Thanks a lot for your help!
 
 
 Your advice does makes difference. I use a optimized version of memcpy
 and it does improve the mmap application performance: on a Ramdisk
 with Ext2 xip, mmap() version now achieves 11GB/s of bandwidth,
 comparing to posix write version with 7GB/s.
  Good :).

 Now I wonder if they have a plan to update the memcpy() in libc..
  You better ask at glibc devel list... I've google for a while whether
memcpy() in glibc can be somehow tuned (for a particular instruction set)
but didn't find anything useful.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-08 Thread Andiry Xu
On Thu, Nov 7, 2013 at 2:45 PM, Andiry Xu  wrote:
> On Thu, Nov 7, 2013 at 2:20 PM, Jan Kara  wrote:
>> On Thu 07-11-13 13:50:09, Andiry Xu wrote:
>>> On Thu, Nov 7, 2013 at 1:07 PM, Jan Kara  wrote:
>>> > On Thu 07-11-13 12:14:13, Andiry Xu wrote:
>>> >> On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara  wrote:
>>> >> > On Tue 05-11-13 17:28:35, Andiry Xu wrote:
>>> >> >> >> Do you know the reason why write() outperforms mmap() in some 
>>> >> >> >> cases? I
>>> >> >> >> know it's not related the thread but I really appreciate if you can
>>> >> >> >> answer my question.
>>> >> >> >   Well, I'm not completely sure. mmap()ed memory always works on 
>>> >> >> > page-by-page
>>> >> >> > basis - you first access the page, it gets faulted in and you can 
>>> >> >> > further
>>> >> >> > access it. So for small (sub page size) accesses this is a win 
>>> >> >> > because you
>>> >> >> > don't have an overhead of syscall and fs write path. For accesses 
>>> >> >> > larger
>>> >> >> > than page size the overhead of syscall and some initial checks is 
>>> >> >> > well
>>> >> >> > hidden by other things. I guess write() ends up being more efficient
>>> >> >> > because write path taken for each page is somewhat lighter than 
>>> >> >> > full page
>>> >> >> > fault. But you'd need to look into perf data to get some hard 
>>> >> >> > numbers on
>>> >> >> > where the time is spent.
>>> >> >> >
>>> >> >>
>>> >> >> Thanks for the reply. However I have filled up the whole RAM disk
>>> >> >> before doing the test, i.e. asked the brd driver to allocate all the
>>> >> >> pages initially.
>>> >> >   Well, pages in ramdisk are always present, that's not an issue. But 
>>> >> > you
>>> >> > will get a page fault to map a particular physical page in process'
>>> >> > virtual address space when you first access that virtual address in the
>>> >> > mapping from the process. The cost of setting up this virtual->physical
>>> >> > mapping is what I'm talking about.
>>> >> >
>>> >>
>>> >> Yes, you are right, there are page faults observed with perf. I
>>> >> misunderstood page fault as copying pages between backing store and
>>> >> physical memory.
>>> >>
>>> >> > If you had a process which first mmaps the file and writes to all 
>>> >> > pages in
>>> >> > the mapping and *then* measure the cost of another round of writing to 
>>> >> > the
>>> >> > mapping, I would expect you should see speeds close to those of memory 
>>> >> > bus.
>>> >> >
>>> >>
>>> >> I've tried this as well. mmap() performance improves but still not as
>>> >> good as write().
>>> >> I used the perf report to compare write() and mmap() applications. For
>>> >> write() version, top of perf report shows as:
>>> >> 33.33%  __copy_user_nocache
>>> >> 4.72%ext2_get_blocks
>>> >> 4.42%mutex_unlock
>>> >> 3.59%__find_get_block
>>> >>
>>> >> which looks reasonable.
>>> >>
>>> >> However, for mmap() version, the perf report looks strange:
>>> >> 94.98% libc-2.15.so   [.] 0x0014698d
>>> >> 2.25%   page_fault
>>> >> 0.18%   handle_mm_fault
>>> >>
>>> >> I don't know what the first item is but it took the majority of cycles.
>>> >   The first item means that it's some userspace code in libc. My guess
>>> > would be that it's libc's memcpy() function (or whatever you use to write
>>> > to mmap). How do you access the mmap?
>>> >
>>>
>>> Like this:
>>>
>>> fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, 0755);
>>> dest = (char *)mmap(NULL, FILE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
>>> for (i = 0; i < count; i++)
>>> {
>>>memcpy(dest, src, request_size);
>>>dest += request_size;
>>> }
>>   OK, maybe libc memcpy isn't very well optimized for you cpu? Not sure how
>> to tune that though...
>>
>
> Hmm, I will try some different kinds of memcpy to see if there is a
> difference. Just want to make sure I do not make some stupid mistakes
> before trying that.
> Thanks a lot for your help!
>

Your advice does makes difference. I use a optimized version of memcpy
and it does improve the mmap application performance: on a Ramdisk
with Ext2 xip, mmap() version now achieves 11GB/s of bandwidth,
comparing to posix write version with 7GB/s.

Now I wonder if they have a plan to update the memcpy() in libc..

Thanks,
Andiry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-08 Thread Andiry Xu
On Thu, Nov 7, 2013 at 2:45 PM, Andiry Xu and...@gmail.com wrote:
 On Thu, Nov 7, 2013 at 2:20 PM, Jan Kara j...@suse.cz wrote:
 On Thu 07-11-13 13:50:09, Andiry Xu wrote:
 On Thu, Nov 7, 2013 at 1:07 PM, Jan Kara j...@suse.cz wrote:
  On Thu 07-11-13 12:14:13, Andiry Xu wrote:
  On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara j...@suse.cz wrote:
   On Tue 05-11-13 17:28:35, Andiry Xu wrote:
Do you know the reason why write() outperforms mmap() in some 
cases? I
know it's not related the thread but I really appreciate if you can
answer my question.
  Well, I'm not completely sure. mmap()ed memory always works on 
page-by-page
basis - you first access the page, it gets faulted in and you can 
further
access it. So for small (sub page size) accesses this is a win 
because you
don't have an overhead of syscall and fs write path. For accesses 
larger
than page size the overhead of syscall and some initial checks is 
well
hidden by other things. I guess write() ends up being more efficient
because write path taken for each page is somewhat lighter than 
full page
fault. But you'd need to look into perf data to get some hard 
numbers on
where the time is spent.
   
  
   Thanks for the reply. However I have filled up the whole RAM disk
   before doing the test, i.e. asked the brd driver to allocate all the
   pages initially.
 Well, pages in ramdisk are always present, that's not an issue. But 
   you
   will get a page fault to map a particular physical page in process'
   virtual address space when you first access that virtual address in the
   mapping from the process. The cost of setting up this virtual-physical
   mapping is what I'm talking about.
  
 
  Yes, you are right, there are page faults observed with perf. I
  misunderstood page fault as copying pages between backing store and
  physical memory.
 
   If you had a process which first mmaps the file and writes to all 
   pages in
   the mapping and *then* measure the cost of another round of writing to 
   the
   mapping, I would expect you should see speeds close to those of memory 
   bus.
  
 
  I've tried this as well. mmap() performance improves but still not as
  good as write().
  I used the perf report to compare write() and mmap() applications. For
  write() version, top of perf report shows as:
  33.33%  __copy_user_nocache
  4.72%ext2_get_blocks
  4.42%mutex_unlock
  3.59%__find_get_block
 
  which looks reasonable.
 
  However, for mmap() version, the perf report looks strange:
  94.98% libc-2.15.so   [.] 0x0014698d
  2.25%   page_fault
  0.18%   handle_mm_fault
 
  I don't know what the first item is but it took the majority of cycles.
The first item means that it's some userspace code in libc. My guess
  would be that it's libc's memcpy() function (or whatever you use to write
  to mmap). How do you access the mmap?
 

 Like this:

 fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, 0755);
 dest = (char *)mmap(NULL, FILE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
 for (i = 0; i  count; i++)
 {
memcpy(dest, src, request_size);
dest += request_size;
 }
   OK, maybe libc memcpy isn't very well optimized for you cpu? Not sure how
 to tune that though...


 Hmm, I will try some different kinds of memcpy to see if there is a
 difference. Just want to make sure I do not make some stupid mistakes
 before trying that.
 Thanks a lot for your help!


Your advice does makes difference. I use a optimized version of memcpy
and it does improve the mmap application performance: on a Ramdisk
with Ext2 xip, mmap() version now achieves 11GB/s of bandwidth,
comparing to posix write version with 7GB/s.

Now I wonder if they have a plan to update the memcpy() in libc..

Thanks,
Andiry
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-07 Thread Andiry Xu
On Thu, Nov 7, 2013 at 2:20 PM, Jan Kara  wrote:
> On Thu 07-11-13 13:50:09, Andiry Xu wrote:
>> On Thu, Nov 7, 2013 at 1:07 PM, Jan Kara  wrote:
>> > On Thu 07-11-13 12:14:13, Andiry Xu wrote:
>> >> On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara  wrote:
>> >> > On Tue 05-11-13 17:28:35, Andiry Xu wrote:
>> >> >> >> Do you know the reason why write() outperforms mmap() in some 
>> >> >> >> cases? I
>> >> >> >> know it's not related the thread but I really appreciate if you can
>> >> >> >> answer my question.
>> >> >> >   Well, I'm not completely sure. mmap()ed memory always works on 
>> >> >> > page-by-page
>> >> >> > basis - you first access the page, it gets faulted in and you can 
>> >> >> > further
>> >> >> > access it. So for small (sub page size) accesses this is a win 
>> >> >> > because you
>> >> >> > don't have an overhead of syscall and fs write path. For accesses 
>> >> >> > larger
>> >> >> > than page size the overhead of syscall and some initial checks is 
>> >> >> > well
>> >> >> > hidden by other things. I guess write() ends up being more efficient
>> >> >> > because write path taken for each page is somewhat lighter than full 
>> >> >> > page
>> >> >> > fault. But you'd need to look into perf data to get some hard 
>> >> >> > numbers on
>> >> >> > where the time is spent.
>> >> >> >
>> >> >>
>> >> >> Thanks for the reply. However I have filled up the whole RAM disk
>> >> >> before doing the test, i.e. asked the brd driver to allocate all the
>> >> >> pages initially.
>> >> >   Well, pages in ramdisk are always present, that's not an issue. But 
>> >> > you
>> >> > will get a page fault to map a particular physical page in process'
>> >> > virtual address space when you first access that virtual address in the
>> >> > mapping from the process. The cost of setting up this virtual->physical
>> >> > mapping is what I'm talking about.
>> >> >
>> >>
>> >> Yes, you are right, there are page faults observed with perf. I
>> >> misunderstood page fault as copying pages between backing store and
>> >> physical memory.
>> >>
>> >> > If you had a process which first mmaps the file and writes to all pages 
>> >> > in
>> >> > the mapping and *then* measure the cost of another round of writing to 
>> >> > the
>> >> > mapping, I would expect you should see speeds close to those of memory 
>> >> > bus.
>> >> >
>> >>
>> >> I've tried this as well. mmap() performance improves but still not as
>> >> good as write().
>> >> I used the perf report to compare write() and mmap() applications. For
>> >> write() version, top of perf report shows as:
>> >> 33.33%  __copy_user_nocache
>> >> 4.72%ext2_get_blocks
>> >> 4.42%mutex_unlock
>> >> 3.59%__find_get_block
>> >>
>> >> which looks reasonable.
>> >>
>> >> However, for mmap() version, the perf report looks strange:
>> >> 94.98% libc-2.15.so   [.] 0x0014698d
>> >> 2.25%   page_fault
>> >> 0.18%   handle_mm_fault
>> >>
>> >> I don't know what the first item is but it took the majority of cycles.
>> >   The first item means that it's some userspace code in libc. My guess
>> > would be that it's libc's memcpy() function (or whatever you use to write
>> > to mmap). How do you access the mmap?
>> >
>>
>> Like this:
>>
>> fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, 0755);
>> dest = (char *)mmap(NULL, FILE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
>> for (i = 0; i < count; i++)
>> {
>>memcpy(dest, src, request_size);
>>dest += request_size;
>> }
>   OK, maybe libc memcpy isn't very well optimized for you cpu? Not sure how
> to tune that though...
>

Hmm, I will try some different kinds of memcpy to see if there is a
difference. Just want to make sure I do not make some stupid mistakes
before trying that.
Thanks a lot for your help!

Thanks,
Andiry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-07 Thread Jan Kara
On Thu 07-11-13 13:50:09, Andiry Xu wrote:
> On Thu, Nov 7, 2013 at 1:07 PM, Jan Kara  wrote:
> > On Thu 07-11-13 12:14:13, Andiry Xu wrote:
> >> On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara  wrote:
> >> > On Tue 05-11-13 17:28:35, Andiry Xu wrote:
> >> >> >> Do you know the reason why write() outperforms mmap() in some cases? 
> >> >> >> I
> >> >> >> know it's not related the thread but I really appreciate if you can
> >> >> >> answer my question.
> >> >> >   Well, I'm not completely sure. mmap()ed memory always works on 
> >> >> > page-by-page
> >> >> > basis - you first access the page, it gets faulted in and you can 
> >> >> > further
> >> >> > access it. So for small (sub page size) accesses this is a win 
> >> >> > because you
> >> >> > don't have an overhead of syscall and fs write path. For accesses 
> >> >> > larger
> >> >> > than page size the overhead of syscall and some initial checks is well
> >> >> > hidden by other things. I guess write() ends up being more efficient
> >> >> > because write path taken for each page is somewhat lighter than full 
> >> >> > page
> >> >> > fault. But you'd need to look into perf data to get some hard numbers 
> >> >> > on
> >> >> > where the time is spent.
> >> >> >
> >> >>
> >> >> Thanks for the reply. However I have filled up the whole RAM disk
> >> >> before doing the test, i.e. asked the brd driver to allocate all the
> >> >> pages initially.
> >> >   Well, pages in ramdisk are always present, that's not an issue. But you
> >> > will get a page fault to map a particular physical page in process'
> >> > virtual address space when you first access that virtual address in the
> >> > mapping from the process. The cost of setting up this virtual->physical
> >> > mapping is what I'm talking about.
> >> >
> >>
> >> Yes, you are right, there are page faults observed with perf. I
> >> misunderstood page fault as copying pages between backing store and
> >> physical memory.
> >>
> >> > If you had a process which first mmaps the file and writes to all pages 
> >> > in
> >> > the mapping and *then* measure the cost of another round of writing to 
> >> > the
> >> > mapping, I would expect you should see speeds close to those of memory 
> >> > bus.
> >> >
> >>
> >> I've tried this as well. mmap() performance improves but still not as
> >> good as write().
> >> I used the perf report to compare write() and mmap() applications. For
> >> write() version, top of perf report shows as:
> >> 33.33%  __copy_user_nocache
> >> 4.72%ext2_get_blocks
> >> 4.42%mutex_unlock
> >> 3.59%__find_get_block
> >>
> >> which looks reasonable.
> >>
> >> However, for mmap() version, the perf report looks strange:
> >> 94.98% libc-2.15.so   [.] 0x0014698d
> >> 2.25%   page_fault
> >> 0.18%   handle_mm_fault
> >>
> >> I don't know what the first item is but it took the majority of cycles.
> >   The first item means that it's some userspace code in libc. My guess
> > would be that it's libc's memcpy() function (or whatever you use to write
> > to mmap). How do you access the mmap?
> >
> 
> Like this:
> 
> fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, 0755);
> dest = (char *)mmap(NULL, FILE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
> for (i = 0; i < count; i++)
> {
>memcpy(dest, src, request_size);
>dest += request_size;
> }
  OK, maybe libc memcpy isn't very well optimized for you cpu? Not sure how
to tune that though...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-07 Thread Andiry Xu
On Thu, Nov 7, 2013 at 1:07 PM, Jan Kara  wrote:
> On Thu 07-11-13 12:14:13, Andiry Xu wrote:
>> On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara  wrote:
>> > On Tue 05-11-13 17:28:35, Andiry Xu wrote:
>> >> >> Do you know the reason why write() outperforms mmap() in some cases? I
>> >> >> know it's not related the thread but I really appreciate if you can
>> >> >> answer my question.
>> >> >   Well, I'm not completely sure. mmap()ed memory always works on 
>> >> > page-by-page
>> >> > basis - you first access the page, it gets faulted in and you can 
>> >> > further
>> >> > access it. So for small (sub page size) accesses this is a win because 
>> >> > you
>> >> > don't have an overhead of syscall and fs write path. For accesses larger
>> >> > than page size the overhead of syscall and some initial checks is well
>> >> > hidden by other things. I guess write() ends up being more efficient
>> >> > because write path taken for each page is somewhat lighter than full 
>> >> > page
>> >> > fault. But you'd need to look into perf data to get some hard numbers on
>> >> > where the time is spent.
>> >> >
>> >>
>> >> Thanks for the reply. However I have filled up the whole RAM disk
>> >> before doing the test, i.e. asked the brd driver to allocate all the
>> >> pages initially.
>> >   Well, pages in ramdisk are always present, that's not an issue. But you
>> > will get a page fault to map a particular physical page in process'
>> > virtual address space when you first access that virtual address in the
>> > mapping from the process. The cost of setting up this virtual->physical
>> > mapping is what I'm talking about.
>> >
>>
>> Yes, you are right, there are page faults observed with perf. I
>> misunderstood page fault as copying pages between backing store and
>> physical memory.
>>
>> > If you had a process which first mmaps the file and writes to all pages in
>> > the mapping and *then* measure the cost of another round of writing to the
>> > mapping, I would expect you should see speeds close to those of memory bus.
>> >
>>
>> I've tried this as well. mmap() performance improves but still not as
>> good as write().
>> I used the perf report to compare write() and mmap() applications. For
>> write() version, top of perf report shows as:
>> 33.33%  __copy_user_nocache
>> 4.72%ext2_get_blocks
>> 4.42%mutex_unlock
>> 3.59%__find_get_block
>>
>> which looks reasonable.
>>
>> However, for mmap() version, the perf report looks strange:
>> 94.98% libc-2.15.so   [.] 0x0014698d
>> 2.25%   page_fault
>> 0.18%   handle_mm_fault
>>
>> I don't know what the first item is but it took the majority of cycles.
>   The first item means that it's some userspace code in libc. My guess
> would be that it's libc's memcpy() function (or whatever you use to write
> to mmap). How do you access the mmap?
>

Like this:

fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, 0755);
dest = (char *)mmap(NULL, FILE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
for (i = 0; i < count; i++)
{
   memcpy(dest, src, request_size);
   dest += request_size;
}

Thanks,
Andiry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-07 Thread Jan Kara
On Thu 07-11-13 12:14:13, Andiry Xu wrote:
> On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara  wrote:
> > On Tue 05-11-13 17:28:35, Andiry Xu wrote:
> >> >> Do you know the reason why write() outperforms mmap() in some cases? I
> >> >> know it's not related the thread but I really appreciate if you can
> >> >> answer my question.
> >> >   Well, I'm not completely sure. mmap()ed memory always works on 
> >> > page-by-page
> >> > basis - you first access the page, it gets faulted in and you can further
> >> > access it. So for small (sub page size) accesses this is a win because 
> >> > you
> >> > don't have an overhead of syscall and fs write path. For accesses larger
> >> > than page size the overhead of syscall and some initial checks is well
> >> > hidden by other things. I guess write() ends up being more efficient
> >> > because write path taken for each page is somewhat lighter than full page
> >> > fault. But you'd need to look into perf data to get some hard numbers on
> >> > where the time is spent.
> >> >
> >>
> >> Thanks for the reply. However I have filled up the whole RAM disk
> >> before doing the test, i.e. asked the brd driver to allocate all the
> >> pages initially.
> >   Well, pages in ramdisk are always present, that's not an issue. But you
> > will get a page fault to map a particular physical page in process'
> > virtual address space when you first access that virtual address in the
> > mapping from the process. The cost of setting up this virtual->physical
> > mapping is what I'm talking about.
> >
> 
> Yes, you are right, there are page faults observed with perf. I
> misunderstood page fault as copying pages between backing store and
> physical memory.
> 
> > If you had a process which first mmaps the file and writes to all pages in
> > the mapping and *then* measure the cost of another round of writing to the
> > mapping, I would expect you should see speeds close to those of memory bus.
> >
> 
> I've tried this as well. mmap() performance improves but still not as
> good as write().
> I used the perf report to compare write() and mmap() applications. For
> write() version, top of perf report shows as:
> 33.33%  __copy_user_nocache
> 4.72%ext2_get_blocks
> 4.42%mutex_unlock
> 3.59%__find_get_block
> 
> which looks reasonable.
> 
> However, for mmap() version, the perf report looks strange:
> 94.98% libc-2.15.so   [.] 0x0014698d
> 2.25%   page_fault
> 0.18%   handle_mm_fault
> 
> I don't know what the first item is but it took the majority of cycles.
  The first item means that it's some userspace code in libc. My guess
would be that it's libc's memcpy() function (or whatever you use to write
to mmap). How do you access the mmap?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-07 Thread Andiry Xu
On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara  wrote:
> On Tue 05-11-13 17:28:35, Andiry Xu wrote:
>> Hi,
>>
>> On Tue, Nov 5, 2013 at 6:32 AM, Jan Kara  wrote:
>> >   Hello,
>> >
>> > On Mon 04-11-13 18:37:40, Andiry Xu wrote:
>> >> On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara  wrote:
>> >> >   Hello,
>> >> >
>> >> > On Mon 04-11-13 14:31:34, Andiry Xu wrote:
>> >> >> When I'm trying XIP on ext2, I find that xip does not work on ext2
>> >> >> with latest kernel.
>> >> >>
>> >> >> Reproduce steps:
>> >> >> Compile kernel with following configs:
>> >> >> CONFIG_BLK_DEV_XIP=y
>> >> >> CONFIG_EXT2_FS_XIP=y
>> >> >>
>> >> >> And run following commands:
>> >> >> # mke2fs -b 4096 /dev/ram0
>> >> >> # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
>> >> >> # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
>> >> >>
>> >> >> And it shows:
>> >> >> dd: writing `/mnt/ramdisk/test1': No space left on device
>> >> >>
>> >> >> df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
>> >> >> 16MB write should only occupy 1/4 capacity.
>> >> >>
>> >> >> Criminal commit:
>> >> >> After git bisect, it points to the following commit:
>> >> >> 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
>> >> >> Ext2: mark inode dirty after the function dquot_free_block_nodirty is 
>> >> >> called
>> >> >   Thanks for report and the bisection!
>> >> >
>> >> >> Particularly, the following code:
>> >> >> @@ -1412,9 +1415,11 @@ allocated:
>> >> >> *errp = 0;
>> >> >> brelse(bitmap_bh);
>> >> >> -dquot_free_block_nodirty(inode, *count-num);
>> >> >> -mark_inode_dirty(inode);
>> >> >> -*count = num;
>> >> >> +if (num < *count) {
>> >> >> +dquot_free_block_nodirty(inode, *count-num);
>> >> >> +mark_inode_dirty(inode);
>> >> >> +*count = num;
>> >> >> +}
>> >> >>   return ret_block;
>> >> >>
>> >> >> Not mark_inode_dirty() is called only when num is less than *count.
>> >> >> However, I've seen
>> >> >> with the dd command, there is case where num >= *count.
>> >> >>
>> >> >> Fix:
>> >> >> I've verified that the following patch fixes the issue:
>> >> >> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
>> >> >> index 9f9992b..5446a52 100644
>> >> >> --- a/fs/ext2/balloc.c
>> >> >> +++ b/fs/ext2/balloc.c
>> >> >> @@ -1406,11 +1406,10 @@ allocated:
>> >> >>
>> >> >> *errp = 0;
>> >> >> brelse(bitmap_bh);
>> >> >> -   if (num < *count) {
>> >> >> +   if (num <= *count)
>> >> >> dquot_free_block_nodirty(inode, *count-num);
>> >> >> -   mark_inode_dirty(inode);
>> >> >> -   *count = num;
>> >> >> -   }
>> >> >> +   mark_inode_dirty(inode);
>> >> >> +   *count = num;
>> >> >> return ret_block;
>> >> >>
>> >> >>  io_error:
>> >> >>
>> >> >> However, I'm not familiar with ext2 source code and cannot tell if
>> >> >> this is the correct fix. At least it fixes my issue.
>> >> >   With this, you have essentially reverted a hunk from commit
>> >> > 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why 
>> >> > it
>> >> > should be reverted. num should never ever be greater than *count and 
>> >> > when
>> >> > num == count, we the code inside if doesn't do anything useful.
>> >> >
>> >> > I've looked into the code and I think I see the problem. It is a long
>> >> > standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
>> >> > ext2_get_block() asking for 0 blocks to map (while we really want 1 
>> >> > block).
>> >> > ext2_get_block() just passes that request and ext2_get_blocks() actually
>> >> > allocates 1 block. And that's were the commit you have identified makes 
>> >> > a
>> >> > difference because previously we returned that 1 block was allocated 
>> >> > while
>> >> > now we return that 0 blocks were allocated and thus allocation is 
>> >> > repeated
>> >> > until all free blocks are exhaused.
>> >> >
>> >> > Attached patch should fix the problem.
>> >> >
>> >>
>> >> Thanks for the reply. I've verified that your patch fixes my issue.
>> >> And it's absolutely better than my solution.
>> >>
>> >> Tested-by: Andiry Xu 
>> >   Thanks for testing!
>> >
>> >> I have another question about ext2 XIP performance, although it's not
>> >> quite related to this thread.
>> >>
>> >> I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
>> >> The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
>> >> reside in main memory.
>> >> Then I use two different applications to write to the ram disk. One is
>> >> open() with O_DIRECT flag, and writing with Posix write(). Another is
>> >> open() with O_DIRECT, mmap() it to user space, then use memcpy() to
>> >> write data. I use different request size to write data, from 512 bytes
>> >> to 64MB.
>> >>
>> >> In my understanding, the mmap version bypasses the file system and
>> >> does not go to kernel space, hence it should have better performance
>> >> than the Posix-write version. However, my test result shows it's not
>> >> always 

Re: [BUG][ext2] XIP does not work on ext2

2013-11-07 Thread Andiry Xu
On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara j...@suse.cz wrote:
 On Tue 05-11-13 17:28:35, Andiry Xu wrote:
 Hi,

 On Tue, Nov 5, 2013 at 6:32 AM, Jan Kara j...@suse.cz wrote:
Hello,
 
  On Mon 04-11-13 18:37:40, Andiry Xu wrote:
  On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara j...@suse.cz wrote:
 Hello,
  
   On Mon 04-11-13 14:31:34, Andiry Xu wrote:
   When I'm trying XIP on ext2, I find that xip does not work on ext2
   with latest kernel.
  
   Reproduce steps:
   Compile kernel with following configs:
   CONFIG_BLK_DEV_XIP=y
   CONFIG_EXT2_FS_XIP=y
  
   And run following commands:
   # mke2fs -b 4096 /dev/ram0
   # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
   # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
  
   And it shows:
   dd: writing `/mnt/ramdisk/test1': No space left on device
  
   df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
   16MB write should only occupy 1/4 capacity.
  
   Criminal commit:
   After git bisect, it points to the following commit:
   8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
   Ext2: mark inode dirty after the function dquot_free_block_nodirty is 
   called
 Thanks for report and the bisection!
  
   Particularly, the following code:
   @@ -1412,9 +1415,11 @@ allocated:
   *errp = 0;
   brelse(bitmap_bh);
   -dquot_free_block_nodirty(inode, *count-num);
   -mark_inode_dirty(inode);
   -*count = num;
   +if (num  *count) {
   +dquot_free_block_nodirty(inode, *count-num);
   +mark_inode_dirty(inode);
   +*count = num;
   +}
 return ret_block;
  
   Not mark_inode_dirty() is called only when num is less than *count.
   However, I've seen
   with the dd command, there is case where num = *count.
  
   Fix:
   I've verified that the following patch fixes the issue:
   diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
   index 9f9992b..5446a52 100644
   --- a/fs/ext2/balloc.c
   +++ b/fs/ext2/balloc.c
   @@ -1406,11 +1406,10 @@ allocated:
  
   *errp = 0;
   brelse(bitmap_bh);
   -   if (num  *count) {
   +   if (num = *count)
   dquot_free_block_nodirty(inode, *count-num);
   -   mark_inode_dirty(inode);
   -   *count = num;
   -   }
   +   mark_inode_dirty(inode);
   +   *count = num;
   return ret_block;
  
io_error:
  
   However, I'm not familiar with ext2 source code and cannot tell if
   this is the correct fix. At least it fixes my issue.
 With this, you have essentially reverted a hunk from commit
   8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why 
   it
   should be reverted. num should never ever be greater than *count and 
   when
   num == count, we the code inside if doesn't do anything useful.
  
   I've looked into the code and I think I see the problem. It is a long
   standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
   ext2_get_block() asking for 0 blocks to map (while we really want 1 
   block).
   ext2_get_block() just passes that request and ext2_get_blocks() actually
   allocates 1 block. And that's were the commit you have identified makes 
   a
   difference because previously we returned that 1 block was allocated 
   while
   now we return that 0 blocks were allocated and thus allocation is 
   repeated
   until all free blocks are exhaused.
  
   Attached patch should fix the problem.
  
 
  Thanks for the reply. I've verified that your patch fixes my issue.
  And it's absolutely better than my solution.
 
  Tested-by: Andiry Xu andiry...@gmail.com
Thanks for testing!
 
  I have another question about ext2 XIP performance, although it's not
  quite related to this thread.
 
  I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
  The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
  reside in main memory.
  Then I use two different applications to write to the ram disk. One is
  open() with O_DIRECT flag, and writing with Posix write(). Another is
  open() with O_DIRECT, mmap() it to user space, then use memcpy() to
  write data. I use different request size to write data, from 512 bytes
  to 64MB.
 
  In my understanding, the mmap version bypasses the file system and
  does not go to kernel space, hence it should have better performance
  than the Posix-write version. However, my test result shows it's not
  always true: when the request size is between 8KB and 1MB, the
  Posix-write() version has bandwidth about 7GB/s and mmap version only
  has 5GB/s. The test is performed on a i7-3770K machine with 8GB
  memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap
  has really bad performance, only 2GB/s for all request sizes.
 
  Do you know the reason why write() outperforms mmap() in some cases? I
  know it's not related the thread but I really appreciate if you can
  answer my question.
Well, I'm not completely sure. mmap()ed memory always works on 
  page-by-page
  basis - you 

Re: [BUG][ext2] XIP does not work on ext2

2013-11-07 Thread Jan Kara
On Thu 07-11-13 12:14:13, Andiry Xu wrote:
 On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara j...@suse.cz wrote:
  On Tue 05-11-13 17:28:35, Andiry Xu wrote:
   Do you know the reason why write() outperforms mmap() in some cases? I
   know it's not related the thread but I really appreciate if you can
   answer my question.
 Well, I'm not completely sure. mmap()ed memory always works on 
   page-by-page
   basis - you first access the page, it gets faulted in and you can further
   access it. So for small (sub page size) accesses this is a win because 
   you
   don't have an overhead of syscall and fs write path. For accesses larger
   than page size the overhead of syscall and some initial checks is well
   hidden by other things. I guess write() ends up being more efficient
   because write path taken for each page is somewhat lighter than full page
   fault. But you'd need to look into perf data to get some hard numbers on
   where the time is spent.
  
 
  Thanks for the reply. However I have filled up the whole RAM disk
  before doing the test, i.e. asked the brd driver to allocate all the
  pages initially.
Well, pages in ramdisk are always present, that's not an issue. But you
  will get a page fault to map a particular physical page in process'
  virtual address space when you first access that virtual address in the
  mapping from the process. The cost of setting up this virtual-physical
  mapping is what I'm talking about.
 
 
 Yes, you are right, there are page faults observed with perf. I
 misunderstood page fault as copying pages between backing store and
 physical memory.
 
  If you had a process which first mmaps the file and writes to all pages in
  the mapping and *then* measure the cost of another round of writing to the
  mapping, I would expect you should see speeds close to those of memory bus.
 
 
 I've tried this as well. mmap() performance improves but still not as
 good as write().
 I used the perf report to compare write() and mmap() applications. For
 write() version, top of perf report shows as:
 33.33%  __copy_user_nocache
 4.72%ext2_get_blocks
 4.42%mutex_unlock
 3.59%__find_get_block
 
 which looks reasonable.
 
 However, for mmap() version, the perf report looks strange:
 94.98% libc-2.15.so   [.] 0x0014698d
 2.25%   page_fault
 0.18%   handle_mm_fault
 
 I don't know what the first item is but it took the majority of cycles.
  The first item means that it's some userspace code in libc. My guess
would be that it's libc's memcpy() function (or whatever you use to write
to mmap). How do you access the mmap?

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-07 Thread Andiry Xu
On Thu, Nov 7, 2013 at 1:07 PM, Jan Kara j...@suse.cz wrote:
 On Thu 07-11-13 12:14:13, Andiry Xu wrote:
 On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara j...@suse.cz wrote:
  On Tue 05-11-13 17:28:35, Andiry Xu wrote:
   Do you know the reason why write() outperforms mmap() in some cases? I
   know it's not related the thread but I really appreciate if you can
   answer my question.
 Well, I'm not completely sure. mmap()ed memory always works on 
   page-by-page
   basis - you first access the page, it gets faulted in and you can 
   further
   access it. So for small (sub page size) accesses this is a win because 
   you
   don't have an overhead of syscall and fs write path. For accesses larger
   than page size the overhead of syscall and some initial checks is well
   hidden by other things. I guess write() ends up being more efficient
   because write path taken for each page is somewhat lighter than full 
   page
   fault. But you'd need to look into perf data to get some hard numbers on
   where the time is spent.
  
 
  Thanks for the reply. However I have filled up the whole RAM disk
  before doing the test, i.e. asked the brd driver to allocate all the
  pages initially.
Well, pages in ramdisk are always present, that's not an issue. But you
  will get a page fault to map a particular physical page in process'
  virtual address space when you first access that virtual address in the
  mapping from the process. The cost of setting up this virtual-physical
  mapping is what I'm talking about.
 

 Yes, you are right, there are page faults observed with perf. I
 misunderstood page fault as copying pages between backing store and
 physical memory.

  If you had a process which first mmaps the file and writes to all pages in
  the mapping and *then* measure the cost of another round of writing to the
  mapping, I would expect you should see speeds close to those of memory bus.
 

 I've tried this as well. mmap() performance improves but still not as
 good as write().
 I used the perf report to compare write() and mmap() applications. For
 write() version, top of perf report shows as:
 33.33%  __copy_user_nocache
 4.72%ext2_get_blocks
 4.42%mutex_unlock
 3.59%__find_get_block

 which looks reasonable.

 However, for mmap() version, the perf report looks strange:
 94.98% libc-2.15.so   [.] 0x0014698d
 2.25%   page_fault
 0.18%   handle_mm_fault

 I don't know what the first item is but it took the majority of cycles.
   The first item means that it's some userspace code in libc. My guess
 would be that it's libc's memcpy() function (or whatever you use to write
 to mmap). How do you access the mmap?


Like this:

fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, 0755);
dest = (char *)mmap(NULL, FILE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
for (i = 0; i  count; i++)
{
   memcpy(dest, src, request_size);
   dest += request_size;
}

Thanks,
Andiry
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-07 Thread Jan Kara
On Thu 07-11-13 13:50:09, Andiry Xu wrote:
 On Thu, Nov 7, 2013 at 1:07 PM, Jan Kara j...@suse.cz wrote:
  On Thu 07-11-13 12:14:13, Andiry Xu wrote:
  On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara j...@suse.cz wrote:
   On Tue 05-11-13 17:28:35, Andiry Xu wrote:
Do you know the reason why write() outperforms mmap() in some cases? 
I
know it's not related the thread but I really appreciate if you can
answer my question.
  Well, I'm not completely sure. mmap()ed memory always works on 
page-by-page
basis - you first access the page, it gets faulted in and you can 
further
access it. So for small (sub page size) accesses this is a win 
because you
don't have an overhead of syscall and fs write path. For accesses 
larger
than page size the overhead of syscall and some initial checks is well
hidden by other things. I guess write() ends up being more efficient
because write path taken for each page is somewhat lighter than full 
page
fault. But you'd need to look into perf data to get some hard numbers 
on
where the time is spent.
   
  
   Thanks for the reply. However I have filled up the whole RAM disk
   before doing the test, i.e. asked the brd driver to allocate all the
   pages initially.
 Well, pages in ramdisk are always present, that's not an issue. But you
   will get a page fault to map a particular physical page in process'
   virtual address space when you first access that virtual address in the
   mapping from the process. The cost of setting up this virtual-physical
   mapping is what I'm talking about.
  
 
  Yes, you are right, there are page faults observed with perf. I
  misunderstood page fault as copying pages between backing store and
  physical memory.
 
   If you had a process which first mmaps the file and writes to all pages 
   in
   the mapping and *then* measure the cost of another round of writing to 
   the
   mapping, I would expect you should see speeds close to those of memory 
   bus.
  
 
  I've tried this as well. mmap() performance improves but still not as
  good as write().
  I used the perf report to compare write() and mmap() applications. For
  write() version, top of perf report shows as:
  33.33%  __copy_user_nocache
  4.72%ext2_get_blocks
  4.42%mutex_unlock
  3.59%__find_get_block
 
  which looks reasonable.
 
  However, for mmap() version, the perf report looks strange:
  94.98% libc-2.15.so   [.] 0x0014698d
  2.25%   page_fault
  0.18%   handle_mm_fault
 
  I don't know what the first item is but it took the majority of cycles.
The first item means that it's some userspace code in libc. My guess
  would be that it's libc's memcpy() function (or whatever you use to write
  to mmap). How do you access the mmap?
 
 
 Like this:
 
 fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, 0755);
 dest = (char *)mmap(NULL, FILE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
 for (i = 0; i  count; i++)
 {
memcpy(dest, src, request_size);
dest += request_size;
 }
  OK, maybe libc memcpy isn't very well optimized for you cpu? Not sure how
to tune that though...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-07 Thread Andiry Xu
On Thu, Nov 7, 2013 at 2:20 PM, Jan Kara j...@suse.cz wrote:
 On Thu 07-11-13 13:50:09, Andiry Xu wrote:
 On Thu, Nov 7, 2013 at 1:07 PM, Jan Kara j...@suse.cz wrote:
  On Thu 07-11-13 12:14:13, Andiry Xu wrote:
  On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara j...@suse.cz wrote:
   On Tue 05-11-13 17:28:35, Andiry Xu wrote:
Do you know the reason why write() outperforms mmap() in some 
cases? I
know it's not related the thread but I really appreciate if you can
answer my question.
  Well, I'm not completely sure. mmap()ed memory always works on 
page-by-page
basis - you first access the page, it gets faulted in and you can 
further
access it. So for small (sub page size) accesses this is a win 
because you
don't have an overhead of syscall and fs write path. For accesses 
larger
than page size the overhead of syscall and some initial checks is 
well
hidden by other things. I guess write() ends up being more efficient
because write path taken for each page is somewhat lighter than full 
page
fault. But you'd need to look into perf data to get some hard 
numbers on
where the time is spent.
   
  
   Thanks for the reply. However I have filled up the whole RAM disk
   before doing the test, i.e. asked the brd driver to allocate all the
   pages initially.
 Well, pages in ramdisk are always present, that's not an issue. But 
   you
   will get a page fault to map a particular physical page in process'
   virtual address space when you first access that virtual address in the
   mapping from the process. The cost of setting up this virtual-physical
   mapping is what I'm talking about.
  
 
  Yes, you are right, there are page faults observed with perf. I
  misunderstood page fault as copying pages between backing store and
  physical memory.
 
   If you had a process which first mmaps the file and writes to all pages 
   in
   the mapping and *then* measure the cost of another round of writing to 
   the
   mapping, I would expect you should see speeds close to those of memory 
   bus.
  
 
  I've tried this as well. mmap() performance improves but still not as
  good as write().
  I used the perf report to compare write() and mmap() applications. For
  write() version, top of perf report shows as:
  33.33%  __copy_user_nocache
  4.72%ext2_get_blocks
  4.42%mutex_unlock
  3.59%__find_get_block
 
  which looks reasonable.
 
  However, for mmap() version, the perf report looks strange:
  94.98% libc-2.15.so   [.] 0x0014698d
  2.25%   page_fault
  0.18%   handle_mm_fault
 
  I don't know what the first item is but it took the majority of cycles.
The first item means that it's some userspace code in libc. My guess
  would be that it's libc's memcpy() function (or whatever you use to write
  to mmap). How do you access the mmap?
 

 Like this:

 fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, 0755);
 dest = (char *)mmap(NULL, FILE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
 for (i = 0; i  count; i++)
 {
memcpy(dest, src, request_size);
dest += request_size;
 }
   OK, maybe libc memcpy isn't very well optimized for you cpu? Not sure how
 to tune that though...


Hmm, I will try some different kinds of memcpy to see if there is a
difference. Just want to make sure I do not make some stupid mistakes
before trying that.
Thanks a lot for your help!

Thanks,
Andiry
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-06 Thread Jan Kara
On Tue 05-11-13 17:28:35, Andiry Xu wrote:
> Hi,
> 
> On Tue, Nov 5, 2013 at 6:32 AM, Jan Kara  wrote:
> >   Hello,
> >
> > On Mon 04-11-13 18:37:40, Andiry Xu wrote:
> >> On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara  wrote:
> >> >   Hello,
> >> >
> >> > On Mon 04-11-13 14:31:34, Andiry Xu wrote:
> >> >> When I'm trying XIP on ext2, I find that xip does not work on ext2
> >> >> with latest kernel.
> >> >>
> >> >> Reproduce steps:
> >> >> Compile kernel with following configs:
> >> >> CONFIG_BLK_DEV_XIP=y
> >> >> CONFIG_EXT2_FS_XIP=y
> >> >>
> >> >> And run following commands:
> >> >> # mke2fs -b 4096 /dev/ram0
> >> >> # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
> >> >> # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
> >> >>
> >> >> And it shows:
> >> >> dd: writing `/mnt/ramdisk/test1': No space left on device
> >> >>
> >> >> df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
> >> >> 16MB write should only occupy 1/4 capacity.
> >> >>
> >> >> Criminal commit:
> >> >> After git bisect, it points to the following commit:
> >> >> 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
> >> >> Ext2: mark inode dirty after the function dquot_free_block_nodirty is 
> >> >> called
> >> >   Thanks for report and the bisection!
> >> >
> >> >> Particularly, the following code:
> >> >> @@ -1412,9 +1415,11 @@ allocated:
> >> >> *errp = 0;
> >> >> brelse(bitmap_bh);
> >> >> -dquot_free_block_nodirty(inode, *count-num);
> >> >> -mark_inode_dirty(inode);
> >> >> -*count = num;
> >> >> +if (num < *count) {
> >> >> +dquot_free_block_nodirty(inode, *count-num);
> >> >> +mark_inode_dirty(inode);
> >> >> +*count = num;
> >> >> +}
> >> >>   return ret_block;
> >> >>
> >> >> Not mark_inode_dirty() is called only when num is less than *count.
> >> >> However, I've seen
> >> >> with the dd command, there is case where num >= *count.
> >> >>
> >> >> Fix:
> >> >> I've verified that the following patch fixes the issue:
> >> >> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> >> >> index 9f9992b..5446a52 100644
> >> >> --- a/fs/ext2/balloc.c
> >> >> +++ b/fs/ext2/balloc.c
> >> >> @@ -1406,11 +1406,10 @@ allocated:
> >> >>
> >> >> *errp = 0;
> >> >> brelse(bitmap_bh);
> >> >> -   if (num < *count) {
> >> >> +   if (num <= *count)
> >> >> dquot_free_block_nodirty(inode, *count-num);
> >> >> -   mark_inode_dirty(inode);
> >> >> -   *count = num;
> >> >> -   }
> >> >> +   mark_inode_dirty(inode);
> >> >> +   *count = num;
> >> >> return ret_block;
> >> >>
> >> >>  io_error:
> >> >>
> >> >> However, I'm not familiar with ext2 source code and cannot tell if
> >> >> this is the correct fix. At least it fixes my issue.
> >> >   With this, you have essentially reverted a hunk from commit
> >> > 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
> >> > should be reverted. num should never ever be greater than *count and when
> >> > num == count, we the code inside if doesn't do anything useful.
> >> >
> >> > I've looked into the code and I think I see the problem. It is a long
> >> > standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
> >> > ext2_get_block() asking for 0 blocks to map (while we really want 1 
> >> > block).
> >> > ext2_get_block() just passes that request and ext2_get_blocks() actually
> >> > allocates 1 block. And that's were the commit you have identified makes a
> >> > difference because previously we returned that 1 block was allocated 
> >> > while
> >> > now we return that 0 blocks were allocated and thus allocation is 
> >> > repeated
> >> > until all free blocks are exhaused.
> >> >
> >> > Attached patch should fix the problem.
> >> >
> >>
> >> Thanks for the reply. I've verified that your patch fixes my issue.
> >> And it's absolutely better than my solution.
> >>
> >> Tested-by: Andiry Xu 
> >   Thanks for testing!
> >
> >> I have another question about ext2 XIP performance, although it's not
> >> quite related to this thread.
> >>
> >> I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
> >> The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
> >> reside in main memory.
> >> Then I use two different applications to write to the ram disk. One is
> >> open() with O_DIRECT flag, and writing with Posix write(). Another is
> >> open() with O_DIRECT, mmap() it to user space, then use memcpy() to
> >> write data. I use different request size to write data, from 512 bytes
> >> to 64MB.
> >>
> >> In my understanding, the mmap version bypasses the file system and
> >> does not go to kernel space, hence it should have better performance
> >> than the Posix-write version. However, my test result shows it's not
> >> always true: when the request size is between 8KB and 1MB, the
> >> Posix-write() version has bandwidth about 7GB/s and mmap version only
> >> has 5GB/s. The test is performed on a i7-3770K machine with 8GB

Re: [BUG][ext2] XIP does not work on ext2

2013-11-06 Thread Jan Kara
On Tue 05-11-13 17:28:35, Andiry Xu wrote:
 Hi,
 
 On Tue, Nov 5, 2013 at 6:32 AM, Jan Kara j...@suse.cz wrote:
Hello,
 
  On Mon 04-11-13 18:37:40, Andiry Xu wrote:
  On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara j...@suse.cz wrote:
 Hello,
  
   On Mon 04-11-13 14:31:34, Andiry Xu wrote:
   When I'm trying XIP on ext2, I find that xip does not work on ext2
   with latest kernel.
  
   Reproduce steps:
   Compile kernel with following configs:
   CONFIG_BLK_DEV_XIP=y
   CONFIG_EXT2_FS_XIP=y
  
   And run following commands:
   # mke2fs -b 4096 /dev/ram0
   # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
   # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
  
   And it shows:
   dd: writing `/mnt/ramdisk/test1': No space left on device
  
   df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
   16MB write should only occupy 1/4 capacity.
  
   Criminal commit:
   After git bisect, it points to the following commit:
   8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
   Ext2: mark inode dirty after the function dquot_free_block_nodirty is 
   called
 Thanks for report and the bisection!
  
   Particularly, the following code:
   @@ -1412,9 +1415,11 @@ allocated:
   *errp = 0;
   brelse(bitmap_bh);
   -dquot_free_block_nodirty(inode, *count-num);
   -mark_inode_dirty(inode);
   -*count = num;
   +if (num  *count) {
   +dquot_free_block_nodirty(inode, *count-num);
   +mark_inode_dirty(inode);
   +*count = num;
   +}
 return ret_block;
  
   Not mark_inode_dirty() is called only when num is less than *count.
   However, I've seen
   with the dd command, there is case where num = *count.
  
   Fix:
   I've verified that the following patch fixes the issue:
   diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
   index 9f9992b..5446a52 100644
   --- a/fs/ext2/balloc.c
   +++ b/fs/ext2/balloc.c
   @@ -1406,11 +1406,10 @@ allocated:
  
   *errp = 0;
   brelse(bitmap_bh);
   -   if (num  *count) {
   +   if (num = *count)
   dquot_free_block_nodirty(inode, *count-num);
   -   mark_inode_dirty(inode);
   -   *count = num;
   -   }
   +   mark_inode_dirty(inode);
   +   *count = num;
   return ret_block;
  
io_error:
  
   However, I'm not familiar with ext2 source code and cannot tell if
   this is the correct fix. At least it fixes my issue.
 With this, you have essentially reverted a hunk from commit
   8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
   should be reverted. num should never ever be greater than *count and when
   num == count, we the code inside if doesn't do anything useful.
  
   I've looked into the code and I think I see the problem. It is a long
   standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
   ext2_get_block() asking for 0 blocks to map (while we really want 1 
   block).
   ext2_get_block() just passes that request and ext2_get_blocks() actually
   allocates 1 block. And that's were the commit you have identified makes a
   difference because previously we returned that 1 block was allocated 
   while
   now we return that 0 blocks were allocated and thus allocation is 
   repeated
   until all free blocks are exhaused.
  
   Attached patch should fix the problem.
  
 
  Thanks for the reply. I've verified that your patch fixes my issue.
  And it's absolutely better than my solution.
 
  Tested-by: Andiry Xu andiry...@gmail.com
Thanks for testing!
 
  I have another question about ext2 XIP performance, although it's not
  quite related to this thread.
 
  I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
  The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
  reside in main memory.
  Then I use two different applications to write to the ram disk. One is
  open() with O_DIRECT flag, and writing with Posix write(). Another is
  open() with O_DIRECT, mmap() it to user space, then use memcpy() to
  write data. I use different request size to write data, from 512 bytes
  to 64MB.
 
  In my understanding, the mmap version bypasses the file system and
  does not go to kernel space, hence it should have better performance
  than the Posix-write version. However, my test result shows it's not
  always true: when the request size is between 8KB and 1MB, the
  Posix-write() version has bandwidth about 7GB/s and mmap version only
  has 5GB/s. The test is performed on a i7-3770K machine with 8GB
  memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap
  has really bad performance, only 2GB/s for all request sizes.
 
  Do you know the reason why write() outperforms mmap() in some cases? I
  know it's not related the thread but I really appreciate if you can
  answer my question.
Well, I'm not completely sure. mmap()ed memory always works on 
  page-by-page
  basis - you first access the page, it gets faulted in and you can further
  access 

Re: [BUG][ext2] XIP does not work on ext2

2013-11-05 Thread Andiry Xu
Hi,

On Tue, Nov 5, 2013 at 6:32 AM, Jan Kara  wrote:
>   Hello,
>
> On Mon 04-11-13 18:37:40, Andiry Xu wrote:
>> On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara  wrote:
>> >   Hello,
>> >
>> > On Mon 04-11-13 14:31:34, Andiry Xu wrote:
>> >> When I'm trying XIP on ext2, I find that xip does not work on ext2
>> >> with latest kernel.
>> >>
>> >> Reproduce steps:
>> >> Compile kernel with following configs:
>> >> CONFIG_BLK_DEV_XIP=y
>> >> CONFIG_EXT2_FS_XIP=y
>> >>
>> >> And run following commands:
>> >> # mke2fs -b 4096 /dev/ram0
>> >> # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
>> >> # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
>> >>
>> >> And it shows:
>> >> dd: writing `/mnt/ramdisk/test1': No space left on device
>> >>
>> >> df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
>> >> 16MB write should only occupy 1/4 capacity.
>> >>
>> >> Criminal commit:
>> >> After git bisect, it points to the following commit:
>> >> 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
>> >> Ext2: mark inode dirty after the function dquot_free_block_nodirty is 
>> >> called
>> >   Thanks for report and the bisection!
>> >
>> >> Particularly, the following code:
>> >> @@ -1412,9 +1415,11 @@ allocated:
>> >> *errp = 0;
>> >> brelse(bitmap_bh);
>> >> -dquot_free_block_nodirty(inode, *count-num);
>> >> -mark_inode_dirty(inode);
>> >> -*count = num;
>> >> +if (num < *count) {
>> >> +dquot_free_block_nodirty(inode, *count-num);
>> >> +mark_inode_dirty(inode);
>> >> +*count = num;
>> >> +}
>> >>   return ret_block;
>> >>
>> >> Not mark_inode_dirty() is called only when num is less than *count.
>> >> However, I've seen
>> >> with the dd command, there is case where num >= *count.
>> >>
>> >> Fix:
>> >> I've verified that the following patch fixes the issue:
>> >> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
>> >> index 9f9992b..5446a52 100644
>> >> --- a/fs/ext2/balloc.c
>> >> +++ b/fs/ext2/balloc.c
>> >> @@ -1406,11 +1406,10 @@ allocated:
>> >>
>> >> *errp = 0;
>> >> brelse(bitmap_bh);
>> >> -   if (num < *count) {
>> >> +   if (num <= *count)
>> >> dquot_free_block_nodirty(inode, *count-num);
>> >> -   mark_inode_dirty(inode);
>> >> -   *count = num;
>> >> -   }
>> >> +   mark_inode_dirty(inode);
>> >> +   *count = num;
>> >> return ret_block;
>> >>
>> >>  io_error:
>> >>
>> >> However, I'm not familiar with ext2 source code and cannot tell if
>> >> this is the correct fix. At least it fixes my issue.
>> >   With this, you have essentially reverted a hunk from commit
>> > 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
>> > should be reverted. num should never ever be greater than *count and when
>> > num == count, we the code inside if doesn't do anything useful.
>> >
>> > I've looked into the code and I think I see the problem. It is a long
>> > standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
>> > ext2_get_block() asking for 0 blocks to map (while we really want 1 block).
>> > ext2_get_block() just passes that request and ext2_get_blocks() actually
>> > allocates 1 block. And that's were the commit you have identified makes a
>> > difference because previously we returned that 1 block was allocated while
>> > now we return that 0 blocks were allocated and thus allocation is repeated
>> > until all free blocks are exhaused.
>> >
>> > Attached patch should fix the problem.
>> >
>>
>> Thanks for the reply. I've verified that your patch fixes my issue.
>> And it's absolutely better than my solution.
>>
>> Tested-by: Andiry Xu 
>   Thanks for testing!
>
>> I have another question about ext2 XIP performance, although it's not
>> quite related to this thread.
>>
>> I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
>> The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
>> reside in main memory.
>> Then I use two different applications to write to the ram disk. One is
>> open() with O_DIRECT flag, and writing with Posix write(). Another is
>> open() with O_DIRECT, mmap() it to user space, then use memcpy() to
>> write data. I use different request size to write data, from 512 bytes
>> to 64MB.
>>
>> In my understanding, the mmap version bypasses the file system and
>> does not go to kernel space, hence it should have better performance
>> than the Posix-write version. However, my test result shows it's not
>> always true: when the request size is between 8KB and 1MB, the
>> Posix-write() version has bandwidth about 7GB/s and mmap version only
>> has 5GB/s. The test is performed on a i7-3770K machine with 8GB
>> memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap
>> has really bad performance, only 2GB/s for all request sizes.
>>
>> Do you know the reason why write() outperforms mmap() in some cases? I
>> know it's not related the thread but I really appreciate if you can
>> answer my 

Re: [BUG][ext2] XIP does not work on ext2

2013-11-05 Thread Jan Kara
  Hello,

On Mon 04-11-13 18:37:40, Andiry Xu wrote:
> On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara  wrote:
> >   Hello,
> >
> > On Mon 04-11-13 14:31:34, Andiry Xu wrote:
> >> When I'm trying XIP on ext2, I find that xip does not work on ext2
> >> with latest kernel.
> >>
> >> Reproduce steps:
> >> Compile kernel with following configs:
> >> CONFIG_BLK_DEV_XIP=y
> >> CONFIG_EXT2_FS_XIP=y
> >>
> >> And run following commands:
> >> # mke2fs -b 4096 /dev/ram0
> >> # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
> >> # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
> >>
> >> And it shows:
> >> dd: writing `/mnt/ramdisk/test1': No space left on device
> >>
> >> df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
> >> 16MB write should only occupy 1/4 capacity.
> >>
> >> Criminal commit:
> >> After git bisect, it points to the following commit:
> >> 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
> >> Ext2: mark inode dirty after the function dquot_free_block_nodirty is 
> >> called
> >   Thanks for report and the bisection!
> >
> >> Particularly, the following code:
> >> @@ -1412,9 +1415,11 @@ allocated:
> >> *errp = 0;
> >> brelse(bitmap_bh);
> >> -dquot_free_block_nodirty(inode, *count-num);
> >> -mark_inode_dirty(inode);
> >> -*count = num;
> >> +if (num < *count) {
> >> +dquot_free_block_nodirty(inode, *count-num);
> >> +mark_inode_dirty(inode);
> >> +*count = num;
> >> +}
> >>   return ret_block;
> >>
> >> Not mark_inode_dirty() is called only when num is less than *count.
> >> However, I've seen
> >> with the dd command, there is case where num >= *count.
> >>
> >> Fix:
> >> I've verified that the following patch fixes the issue:
> >> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> >> index 9f9992b..5446a52 100644
> >> --- a/fs/ext2/balloc.c
> >> +++ b/fs/ext2/balloc.c
> >> @@ -1406,11 +1406,10 @@ allocated:
> >>
> >> *errp = 0;
> >> brelse(bitmap_bh);
> >> -   if (num < *count) {
> >> +   if (num <= *count)
> >> dquot_free_block_nodirty(inode, *count-num);
> >> -   mark_inode_dirty(inode);
> >> -   *count = num;
> >> -   }
> >> +   mark_inode_dirty(inode);
> >> +   *count = num;
> >> return ret_block;
> >>
> >>  io_error:
> >>
> >> However, I'm not familiar with ext2 source code and cannot tell if
> >> this is the correct fix. At least it fixes my issue.
> >   With this, you have essentially reverted a hunk from commit
> > 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
> > should be reverted. num should never ever be greater than *count and when
> > num == count, we the code inside if doesn't do anything useful.
> >
> > I've looked into the code and I think I see the problem. It is a long
> > standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
> > ext2_get_block() asking for 0 blocks to map (while we really want 1 block).
> > ext2_get_block() just passes that request and ext2_get_blocks() actually
> > allocates 1 block. And that's were the commit you have identified makes a
> > difference because previously we returned that 1 block was allocated while
> > now we return that 0 blocks were allocated and thus allocation is repeated
> > until all free blocks are exhaused.
> >
> > Attached patch should fix the problem.
> >
> 
> Thanks for the reply. I've verified that your patch fixes my issue.
> And it's absolutely better than my solution.
> 
> Tested-by: Andiry Xu 
  Thanks for testing!

> I have another question about ext2 XIP performance, although it's not
> quite related to this thread.
> 
> I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
> The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
> reside in main memory.
> Then I use two different applications to write to the ram disk. One is
> open() with O_DIRECT flag, and writing with Posix write(). Another is
> open() with O_DIRECT, mmap() it to user space, then use memcpy() to
> write data. I use different request size to write data, from 512 bytes
> to 64MB.
> 
> In my understanding, the mmap version bypasses the file system and
> does not go to kernel space, hence it should have better performance
> than the Posix-write version. However, my test result shows it's not
> always true: when the request size is between 8KB and 1MB, the
> Posix-write() version has bandwidth about 7GB/s and mmap version only
> has 5GB/s. The test is performed on a i7-3770K machine with 8GB
> memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap
> has really bad performance, only 2GB/s for all request sizes.
> 
> Do you know the reason why write() outperforms mmap() in some cases? I
> know it's not related the thread but I really appreciate if you can
> answer my question.
  Well, I'm not completely sure. mmap()ed memory always works on page-by-page
basis - you first access the page, it gets faulted in and you can further
access it. So for 

Re: [BUG][ext2] XIP does not work on ext2

2013-11-05 Thread Jan Kara
  Hello,

On Mon 04-11-13 18:37:40, Andiry Xu wrote:
 On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara j...@suse.cz wrote:
Hello,
 
  On Mon 04-11-13 14:31:34, Andiry Xu wrote:
  When I'm trying XIP on ext2, I find that xip does not work on ext2
  with latest kernel.
 
  Reproduce steps:
  Compile kernel with following configs:
  CONFIG_BLK_DEV_XIP=y
  CONFIG_EXT2_FS_XIP=y
 
  And run following commands:
  # mke2fs -b 4096 /dev/ram0
  # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
  # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
 
  And it shows:
  dd: writing `/mnt/ramdisk/test1': No space left on device
 
  df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
  16MB write should only occupy 1/4 capacity.
 
  Criminal commit:
  After git bisect, it points to the following commit:
  8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
  Ext2: mark inode dirty after the function dquot_free_block_nodirty is 
  called
Thanks for report and the bisection!
 
  Particularly, the following code:
  @@ -1412,9 +1415,11 @@ allocated:
  *errp = 0;
  brelse(bitmap_bh);
  -dquot_free_block_nodirty(inode, *count-num);
  -mark_inode_dirty(inode);
  -*count = num;
  +if (num  *count) {
  +dquot_free_block_nodirty(inode, *count-num);
  +mark_inode_dirty(inode);
  +*count = num;
  +}
return ret_block;
 
  Not mark_inode_dirty() is called only when num is less than *count.
  However, I've seen
  with the dd command, there is case where num = *count.
 
  Fix:
  I've verified that the following patch fixes the issue:
  diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
  index 9f9992b..5446a52 100644
  --- a/fs/ext2/balloc.c
  +++ b/fs/ext2/balloc.c
  @@ -1406,11 +1406,10 @@ allocated:
 
  *errp = 0;
  brelse(bitmap_bh);
  -   if (num  *count) {
  +   if (num = *count)
  dquot_free_block_nodirty(inode, *count-num);
  -   mark_inode_dirty(inode);
  -   *count = num;
  -   }
  +   mark_inode_dirty(inode);
  +   *count = num;
  return ret_block;
 
   io_error:
 
  However, I'm not familiar with ext2 source code and cannot tell if
  this is the correct fix. At least it fixes my issue.
With this, you have essentially reverted a hunk from commit
  8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
  should be reverted. num should never ever be greater than *count and when
  num == count, we the code inside if doesn't do anything useful.
 
  I've looked into the code and I think I see the problem. It is a long
  standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
  ext2_get_block() asking for 0 blocks to map (while we really want 1 block).
  ext2_get_block() just passes that request and ext2_get_blocks() actually
  allocates 1 block. And that's were the commit you have identified makes a
  difference because previously we returned that 1 block was allocated while
  now we return that 0 blocks were allocated and thus allocation is repeated
  until all free blocks are exhaused.
 
  Attached patch should fix the problem.
 
 
 Thanks for the reply. I've verified that your patch fixes my issue.
 And it's absolutely better than my solution.
 
 Tested-by: Andiry Xu andiry...@gmail.com
  Thanks for testing!

 I have another question about ext2 XIP performance, although it's not
 quite related to this thread.
 
 I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
 The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
 reside in main memory.
 Then I use two different applications to write to the ram disk. One is
 open() with O_DIRECT flag, and writing with Posix write(). Another is
 open() with O_DIRECT, mmap() it to user space, then use memcpy() to
 write data. I use different request size to write data, from 512 bytes
 to 64MB.
 
 In my understanding, the mmap version bypasses the file system and
 does not go to kernel space, hence it should have better performance
 than the Posix-write version. However, my test result shows it's not
 always true: when the request size is between 8KB and 1MB, the
 Posix-write() version has bandwidth about 7GB/s and mmap version only
 has 5GB/s. The test is performed on a i7-3770K machine with 8GB
 memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap
 has really bad performance, only 2GB/s for all request sizes.
 
 Do you know the reason why write() outperforms mmap() in some cases? I
 know it's not related the thread but I really appreciate if you can
 answer my question.
  Well, I'm not completely sure. mmap()ed memory always works on page-by-page
basis - you first access the page, it gets faulted in and you can further
access it. So for small (sub page size) accesses this is a win because you
don't have an overhead of syscall and fs write path. For accesses larger
than page size the overhead of syscall and some initial checks is well
hidden by other things. I guess write() 

Re: [BUG][ext2] XIP does not work on ext2

2013-11-05 Thread Andiry Xu
Hi,

On Tue, Nov 5, 2013 at 6:32 AM, Jan Kara j...@suse.cz wrote:
   Hello,

 On Mon 04-11-13 18:37:40, Andiry Xu wrote:
 On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara j...@suse.cz wrote:
Hello,
 
  On Mon 04-11-13 14:31:34, Andiry Xu wrote:
  When I'm trying XIP on ext2, I find that xip does not work on ext2
  with latest kernel.
 
  Reproduce steps:
  Compile kernel with following configs:
  CONFIG_BLK_DEV_XIP=y
  CONFIG_EXT2_FS_XIP=y
 
  And run following commands:
  # mke2fs -b 4096 /dev/ram0
  # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
  # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
 
  And it shows:
  dd: writing `/mnt/ramdisk/test1': No space left on device
 
  df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
  16MB write should only occupy 1/4 capacity.
 
  Criminal commit:
  After git bisect, it points to the following commit:
  8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
  Ext2: mark inode dirty after the function dquot_free_block_nodirty is 
  called
Thanks for report and the bisection!
 
  Particularly, the following code:
  @@ -1412,9 +1415,11 @@ allocated:
  *errp = 0;
  brelse(bitmap_bh);
  -dquot_free_block_nodirty(inode, *count-num);
  -mark_inode_dirty(inode);
  -*count = num;
  +if (num  *count) {
  +dquot_free_block_nodirty(inode, *count-num);
  +mark_inode_dirty(inode);
  +*count = num;
  +}
return ret_block;
 
  Not mark_inode_dirty() is called only when num is less than *count.
  However, I've seen
  with the dd command, there is case where num = *count.
 
  Fix:
  I've verified that the following patch fixes the issue:
  diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
  index 9f9992b..5446a52 100644
  --- a/fs/ext2/balloc.c
  +++ b/fs/ext2/balloc.c
  @@ -1406,11 +1406,10 @@ allocated:
 
  *errp = 0;
  brelse(bitmap_bh);
  -   if (num  *count) {
  +   if (num = *count)
  dquot_free_block_nodirty(inode, *count-num);
  -   mark_inode_dirty(inode);
  -   *count = num;
  -   }
  +   mark_inode_dirty(inode);
  +   *count = num;
  return ret_block;
 
   io_error:
 
  However, I'm not familiar with ext2 source code and cannot tell if
  this is the correct fix. At least it fixes my issue.
With this, you have essentially reverted a hunk from commit
  8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
  should be reverted. num should never ever be greater than *count and when
  num == count, we the code inside if doesn't do anything useful.
 
  I've looked into the code and I think I see the problem. It is a long
  standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
  ext2_get_block() asking for 0 blocks to map (while we really want 1 block).
  ext2_get_block() just passes that request and ext2_get_blocks() actually
  allocates 1 block. And that's were the commit you have identified makes a
  difference because previously we returned that 1 block was allocated while
  now we return that 0 blocks were allocated and thus allocation is repeated
  until all free blocks are exhaused.
 
  Attached patch should fix the problem.
 

 Thanks for the reply. I've verified that your patch fixes my issue.
 And it's absolutely better than my solution.

 Tested-by: Andiry Xu andiry...@gmail.com
   Thanks for testing!

 I have another question about ext2 XIP performance, although it's not
 quite related to this thread.

 I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
 The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
 reside in main memory.
 Then I use two different applications to write to the ram disk. One is
 open() with O_DIRECT flag, and writing with Posix write(). Another is
 open() with O_DIRECT, mmap() it to user space, then use memcpy() to
 write data. I use different request size to write data, from 512 bytes
 to 64MB.

 In my understanding, the mmap version bypasses the file system and
 does not go to kernel space, hence it should have better performance
 than the Posix-write version. However, my test result shows it's not
 always true: when the request size is between 8KB and 1MB, the
 Posix-write() version has bandwidth about 7GB/s and mmap version only
 has 5GB/s. The test is performed on a i7-3770K machine with 8GB
 memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap
 has really bad performance, only 2GB/s for all request sizes.

 Do you know the reason why write() outperforms mmap() in some cases? I
 know it's not related the thread but I really appreciate if you can
 answer my question.
   Well, I'm not completely sure. mmap()ed memory always works on page-by-page
 basis - you first access the page, it gets faulted in and you can further
 access it. So for small (sub page size) accesses this is a win because you
 don't have an overhead of syscall and fs write path. For accesses larger
 than page size the overhead of syscall and 

Re: [BUG][ext2] XIP does not work on ext2

2013-11-04 Thread Andiry Xu
Hi Jan,

On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara  wrote:
>   Hello,
>
> On Mon 04-11-13 14:31:34, Andiry Xu wrote:
>> When I'm trying XIP on ext2, I find that xip does not work on ext2
>> with latest kernel.
>>
>> Reproduce steps:
>> Compile kernel with following configs:
>> CONFIG_BLK_DEV_XIP=y
>> CONFIG_EXT2_FS_XIP=y
>>
>> And run following commands:
>> # mke2fs -b 4096 /dev/ram0
>> # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
>> # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
>>
>> And it shows:
>> dd: writing `/mnt/ramdisk/test1': No space left on device
>>
>> df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
>> 16MB write should only occupy 1/4 capacity.
>>
>> Criminal commit:
>> After git bisect, it points to the following commit:
>> 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
>> Ext2: mark inode dirty after the function dquot_free_block_nodirty is called
>   Thanks for report and the bisection!
>
>> Particularly, the following code:
>> @@ -1412,9 +1415,11 @@ allocated:
>> *errp = 0;
>> brelse(bitmap_bh);
>> -dquot_free_block_nodirty(inode, *count-num);
>> -mark_inode_dirty(inode);
>> -*count = num;
>> +if (num < *count) {
>> +dquot_free_block_nodirty(inode, *count-num);
>> +mark_inode_dirty(inode);
>> +*count = num;
>> +}
>>   return ret_block;
>>
>> Not mark_inode_dirty() is called only when num is less than *count.
>> However, I've seen
>> with the dd command, there is case where num >= *count.
>>
>> Fix:
>> I've verified that the following patch fixes the issue:
>> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
>> index 9f9992b..5446a52 100644
>> --- a/fs/ext2/balloc.c
>> +++ b/fs/ext2/balloc.c
>> @@ -1406,11 +1406,10 @@ allocated:
>>
>> *errp = 0;
>> brelse(bitmap_bh);
>> -   if (num < *count) {
>> +   if (num <= *count)
>> dquot_free_block_nodirty(inode, *count-num);
>> -   mark_inode_dirty(inode);
>> -   *count = num;
>> -   }
>> +   mark_inode_dirty(inode);
>> +   *count = num;
>> return ret_block;
>>
>>  io_error:
>>
>> However, I'm not familiar with ext2 source code and cannot tell if
>> this is the correct fix. At least it fixes my issue.
>   With this, you have essentially reverted a hunk from commit
> 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
> should be reverted. num should never ever be greater than *count and when
> num == count, we the code inside if doesn't do anything useful.
>
> I've looked into the code and I think I see the problem. It is a long
> standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
> ext2_get_block() asking for 0 blocks to map (while we really want 1 block).
> ext2_get_block() just passes that request and ext2_get_blocks() actually
> allocates 1 block. And that's were the commit you have identified makes a
> difference because previously we returned that 1 block was allocated while
> now we return that 0 blocks were allocated and thus allocation is repeated
> until all free blocks are exhaused.
>
> Attached patch should fix the problem.
>

Thanks for the reply. I've verified that your patch fixes my issue.
And it's absolutely better than my solution.

Tested-by: Andiry Xu 

I have another question about ext2 XIP performance, although it's not
quite related to this thread.

I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
reside in main memory.
Then I use two different applications to write to the ram disk. One is
open() with O_DIRECT flag, and writing with Posix write(). Another is
open() with O_DIRECT, mmap() it to user space, then use memcpy() to
write data. I use different request size to write data, from 512 bytes
to 64MB.

In my understanding, the mmap version bypasses the file system and
does not go to kernel space, hence it should have better performance
than the Posix-write version. However, my test result shows it's not
always true: when the request size is between 8KB and 1MB, the
Posix-write() version has bandwidth about 7GB/s and mmap version only
has 5GB/s. The test is performed on a i7-3770K machine with 8GB
memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap
has really bad performance, only 2GB/s for all request sizes.

Do you know the reason why write() outperforms mmap() in some cases? I
know it's not related the thread but I really appreciate if you can
answer my question.

Thanks,
Andiry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][ext2] XIP does not work on ext2

2013-11-04 Thread Jan Kara
  Hello,

On Mon 04-11-13 14:31:34, Andiry Xu wrote:
> When I'm trying XIP on ext2, I find that xip does not work on ext2
> with latest kernel.
> 
> Reproduce steps:
> Compile kernel with following configs:
> CONFIG_BLK_DEV_XIP=y
> CONFIG_EXT2_FS_XIP=y
> 
> And run following commands:
> # mke2fs -b 4096 /dev/ram0
> # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
> # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
> 
> And it shows:
> dd: writing `/mnt/ramdisk/test1': No space left on device
> 
> df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
> 16MB write should only occupy 1/4 capacity.
> 
> Criminal commit:
> After git bisect, it points to the following commit:
> 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
> Ext2: mark inode dirty after the function dquot_free_block_nodirty is called
  Thanks for report and the bisection!

> Particularly, the following code:
> @@ -1412,9 +1415,11 @@ allocated:
> *errp = 0;
> brelse(bitmap_bh);
> -dquot_free_block_nodirty(inode, *count-num);
> -mark_inode_dirty(inode);
> -*count = num;
> +if (num < *count) {
> +dquot_free_block_nodirty(inode, *count-num);
> +mark_inode_dirty(inode);
> +*count = num;
> +}
>   return ret_block;
> 
> Not mark_inode_dirty() is called only when num is less than *count.
> However, I've seen
> with the dd command, there is case where num >= *count.
> 
> Fix:
> I've verified that the following patch fixes the issue:
> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> index 9f9992b..5446a52 100644
> --- a/fs/ext2/balloc.c
> +++ b/fs/ext2/balloc.c
> @@ -1406,11 +1406,10 @@ allocated:
> 
> *errp = 0;
> brelse(bitmap_bh);
> -   if (num < *count) {
> +   if (num <= *count)
> dquot_free_block_nodirty(inode, *count-num);
> -   mark_inode_dirty(inode);
> -   *count = num;
> -   }
> +   mark_inode_dirty(inode);
> +   *count = num;
> return ret_block;
> 
>  io_error:
> 
> However, I'm not familiar with ext2 source code and cannot tell if
> this is the correct fix. At least it fixes my issue.
  With this, you have essentially reverted a hunk from commit 
8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
should be reverted. num should never ever be greater than *count and when
num == count, we the code inside if doesn't do anything useful.

I've looked into the code and I think I see the problem. It is a long
standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
ext2_get_block() asking for 0 blocks to map (while we really want 1 block).
ext2_get_block() just passes that request and ext2_get_blocks() actually
allocates 1 block. And that's were the commit you have identified makes a
difference because previously we returned that 1 block was allocated while
now we return that 0 blocks were allocated and thus allocation is repeated
until all free blocks are exhaused.

Attached patch should fix the problem.

Honza

-- 
Jan Kara 
SUSE Labs, CR
>From ce14b6595c9f23db4a3fbeccd921f0687c9c73d4 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Tue, 5 Nov 2013 01:15:38 +0100
Subject: [PATCH] ext2: Fix fs corruption in ext2_get_xip_mem()

Commit 8e3dffc651cb "Ext2: mark inode dirty after the function
dquot_free_block_nodirty is called" unveiled a bug in __ext2_get_block()
called from ext2_get_xip_mem(). That function called ext2_get_block()
mistakenly asking it to map 0 blocks while 1 was intended. Before the
above mentioned commit things worked out fine by luck but after that commit
we started returning that we allocated 0 blocks while we in fact
allocated 1 block and thus allocation was looping until all blocks in
the filesystem were exhausted.

Fix the problem by properly asking for one block and also add assertion
in ext2_get_blocks() to catch similar problems.

Signed-off-by: Jan Kara 
---
 fs/ext2/inode.c | 2 ++
 fs/ext2/xip.c   | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c260de6d7b6d..8a337640a46a 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -632,6 +632,8 @@ static int ext2_get_blocks(struct inode *inode,
 	int count = 0;
 	ext2_fsblk_t first_block = 0;
 
+	BUG_ON(maxblocks == 0);
+
 	depth = ext2_block_to_path(inode,iblock,offsets,_to_boundary);
 
 	if (depth == 0)
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 1c3312858fcf..e98171a11cfe 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -35,6 +35,7 @@ __ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
 	int rc;
 
 	memset(, 0, sizeof(struct buffer_head));
+	tmp.b_size = 1 << inode->i_blkbits;
 	rc = ext2_get_block(inode, pgoff, , create);
 	*result = tmp.b_blocknr;
 
-- 
1.8.1.4



Re: [BUG][ext2] XIP does not work on ext2

2013-11-04 Thread Jan Kara
  Hello,

On Mon 04-11-13 14:31:34, Andiry Xu wrote:
 When I'm trying XIP on ext2, I find that xip does not work on ext2
 with latest kernel.
 
 Reproduce steps:
 Compile kernel with following configs:
 CONFIG_BLK_DEV_XIP=y
 CONFIG_EXT2_FS_XIP=y
 
 And run following commands:
 # mke2fs -b 4096 /dev/ram0
 # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
 # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
 
 And it shows:
 dd: writing `/mnt/ramdisk/test1': No space left on device
 
 df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
 16MB write should only occupy 1/4 capacity.
 
 Criminal commit:
 After git bisect, it points to the following commit:
 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
 Ext2: mark inode dirty after the function dquot_free_block_nodirty is called
  Thanks for report and the bisection!

 Particularly, the following code:
 @@ -1412,9 +1415,11 @@ allocated:
 *errp = 0;
 brelse(bitmap_bh);
 -dquot_free_block_nodirty(inode, *count-num);
 -mark_inode_dirty(inode);
 -*count = num;
 +if (num  *count) {
 +dquot_free_block_nodirty(inode, *count-num);
 +mark_inode_dirty(inode);
 +*count = num;
 +}
   return ret_block;
 
 Not mark_inode_dirty() is called only when num is less than *count.
 However, I've seen
 with the dd command, there is case where num = *count.
 
 Fix:
 I've verified that the following patch fixes the issue:
 diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
 index 9f9992b..5446a52 100644
 --- a/fs/ext2/balloc.c
 +++ b/fs/ext2/balloc.c
 @@ -1406,11 +1406,10 @@ allocated:
 
 *errp = 0;
 brelse(bitmap_bh);
 -   if (num  *count) {
 +   if (num = *count)
 dquot_free_block_nodirty(inode, *count-num);
 -   mark_inode_dirty(inode);
 -   *count = num;
 -   }
 +   mark_inode_dirty(inode);
 +   *count = num;
 return ret_block;
 
  io_error:
 
 However, I'm not familiar with ext2 source code and cannot tell if
 this is the correct fix. At least it fixes my issue.
  With this, you have essentially reverted a hunk from commit 
8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
should be reverted. num should never ever be greater than *count and when
num == count, we the code inside if doesn't do anything useful.

I've looked into the code and I think I see the problem. It is a long
standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
ext2_get_block() asking for 0 blocks to map (while we really want 1 block).
ext2_get_block() just passes that request and ext2_get_blocks() actually
allocates 1 block. And that's were the commit you have identified makes a
difference because previously we returned that 1 block was allocated while
now we return that 0 blocks were allocated and thus allocation is repeated
until all free blocks are exhaused.

Attached patch should fix the problem.

Honza

-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
From ce14b6595c9f23db4a3fbeccd921f0687c9c73d4 Mon Sep 17 00:00:00 2001
From: Jan Kara j...@suse.cz
Date: Tue, 5 Nov 2013 01:15:38 +0100
Subject: [PATCH] ext2: Fix fs corruption in ext2_get_xip_mem()

Commit 8e3dffc651cb Ext2: mark inode dirty after the function
dquot_free_block_nodirty is called unveiled a bug in __ext2_get_block()
called from ext2_get_xip_mem(). That function called ext2_get_block()
mistakenly asking it to map 0 blocks while 1 was intended. Before the
above mentioned commit things worked out fine by luck but after that commit
we started returning that we allocated 0 blocks while we in fact
allocated 1 block and thus allocation was looping until all blocks in
the filesystem were exhausted.

Fix the problem by properly asking for one block and also add assertion
in ext2_get_blocks() to catch similar problems.

Signed-off-by: Jan Kara j...@suse.cz
---
 fs/ext2/inode.c | 2 ++
 fs/ext2/xip.c   | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c260de6d7b6d..8a337640a46a 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -632,6 +632,8 @@ static int ext2_get_blocks(struct inode *inode,
 	int count = 0;
 	ext2_fsblk_t first_block = 0;
 
+	BUG_ON(maxblocks == 0);
+
 	depth = ext2_block_to_path(inode,iblock,offsets,blocks_to_boundary);
 
 	if (depth == 0)
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 1c3312858fcf..e98171a11cfe 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -35,6 +35,7 @@ __ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
 	int rc;
 
 	memset(tmp, 0, sizeof(struct buffer_head));
+	tmp.b_size = 1  inode-i_blkbits;
 	rc = ext2_get_block(inode, pgoff, tmp, create);
 	*result = tmp.b_blocknr;
 
-- 
1.8.1.4



Re: [BUG][ext2] XIP does not work on ext2

2013-11-04 Thread Andiry Xu
Hi Jan,

On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara j...@suse.cz wrote:
   Hello,

 On Mon 04-11-13 14:31:34, Andiry Xu wrote:
 When I'm trying XIP on ext2, I find that xip does not work on ext2
 with latest kernel.

 Reproduce steps:
 Compile kernel with following configs:
 CONFIG_BLK_DEV_XIP=y
 CONFIG_EXT2_FS_XIP=y

 And run following commands:
 # mke2fs -b 4096 /dev/ram0
 # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
 # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16

 And it shows:
 dd: writing `/mnt/ramdisk/test1': No space left on device

 df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
 16MB write should only occupy 1/4 capacity.

 Criminal commit:
 After git bisect, it points to the following commit:
 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
 Ext2: mark inode dirty after the function dquot_free_block_nodirty is called
   Thanks for report and the bisection!

 Particularly, the following code:
 @@ -1412,9 +1415,11 @@ allocated:
 *errp = 0;
 brelse(bitmap_bh);
 -dquot_free_block_nodirty(inode, *count-num);
 -mark_inode_dirty(inode);
 -*count = num;
 +if (num  *count) {
 +dquot_free_block_nodirty(inode, *count-num);
 +mark_inode_dirty(inode);
 +*count = num;
 +}
   return ret_block;

 Not mark_inode_dirty() is called only when num is less than *count.
 However, I've seen
 with the dd command, there is case where num = *count.

 Fix:
 I've verified that the following patch fixes the issue:
 diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
 index 9f9992b..5446a52 100644
 --- a/fs/ext2/balloc.c
 +++ b/fs/ext2/balloc.c
 @@ -1406,11 +1406,10 @@ allocated:

 *errp = 0;
 brelse(bitmap_bh);
 -   if (num  *count) {
 +   if (num = *count)
 dquot_free_block_nodirty(inode, *count-num);
 -   mark_inode_dirty(inode);
 -   *count = num;
 -   }
 +   mark_inode_dirty(inode);
 +   *count = num;
 return ret_block;

  io_error:

 However, I'm not familiar with ext2 source code and cannot tell if
 this is the correct fix. At least it fixes my issue.
   With this, you have essentially reverted a hunk from commit
 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
 should be reverted. num should never ever be greater than *count and when
 num == count, we the code inside if doesn't do anything useful.

 I've looked into the code and I think I see the problem. It is a long
 standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
 ext2_get_block() asking for 0 blocks to map (while we really want 1 block).
 ext2_get_block() just passes that request and ext2_get_blocks() actually
 allocates 1 block. And that's were the commit you have identified makes a
 difference because previously we returned that 1 block was allocated while
 now we return that 0 blocks were allocated and thus allocation is repeated
 until all free blocks are exhaused.

 Attached patch should fix the problem.


Thanks for the reply. I've verified that your patch fixes my issue.
And it's absolutely better than my solution.

Tested-by: Andiry Xu andiry...@gmail.com

I have another question about ext2 XIP performance, although it's not
quite related to this thread.

I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
reside in main memory.
Then I use two different applications to write to the ram disk. One is
open() with O_DIRECT flag, and writing with Posix write(). Another is
open() with O_DIRECT, mmap() it to user space, then use memcpy() to
write data. I use different request size to write data, from 512 bytes
to 64MB.

In my understanding, the mmap version bypasses the file system and
does not go to kernel space, hence it should have better performance
than the Posix-write version. However, my test result shows it's not
always true: when the request size is between 8KB and 1MB, the
Posix-write() version has bandwidth about 7GB/s and mmap version only
has 5GB/s. The test is performed on a i7-3770K machine with 8GB
memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap
has really bad performance, only 2GB/s for all request sizes.

Do you know the reason why write() outperforms mmap() in some cases? I
know it's not related the thread but I really appreciate if you can
answer my question.

Thanks,
Andiry
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/