Re: File corruption when using kernels 2.6.18+

2007-10-03 Thread Hiro Yoshioka
Hi,

From: Linus Torvalds <[EMAIL PROTECTED]>
> On Wed, 3 Oct 2007, Pekka Enberg wrote:
> > 
> > On 10/3/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:
> > > I would bet that the reason the intel-optimized memcpy triggers this is
> > > that the non-temporal stores just means that you go out directly on the
> > > bus, and it probably just shows a weakness in the chipset or bus that
> > > doesn't show with the normal cacheline accesses.
> > 
> > But that should show up with memtest too, no?
> 
> Not unless memtest uses non-temporal stores with the same (or similar) 
> access patterns.
> 
> The thing is, the CPU cache hides a *lot* of activity from the chipset, 
> and changes the access patterns radically. 
> 
> With normal cached accesses, you'd normally see just the "fill cacheline" 
> and "write out cacheline" pattern. With movnt, you'd see non-cacheline 
> accesses to memory. If the chipset was tested under mostly normal loads, 
> the movnt cases have been getting a lot less coverage.

I'm not so sure whether it is chipset's bug or not.

The movnt does have the WC (write combining) semantics and
bypass the hardware cache to store the data.

http://www.intel.com/products/processor/manuals/index.htm

Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 1: Basic Architecture

Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3A: System Programming Guide

Thanks in advance,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: File corruption when using kernels 2.6.18+

2007-10-03 Thread Hiro Yoshioka
Hi,

From: Linus Torvalds [EMAIL PROTECTED]
 On Wed, 3 Oct 2007, Pekka Enberg wrote:
  
  On 10/3/07, Linus Torvalds [EMAIL PROTECTED] wrote:
   I would bet that the reason the intel-optimized memcpy triggers this is
   that the non-temporal stores just means that you go out directly on the
   bus, and it probably just shows a weakness in the chipset or bus that
   doesn't show with the normal cacheline accesses.
  
  But that should show up with memtest too, no?
 
 Not unless memtest uses non-temporal stores with the same (or similar) 
 access patterns.
 
 The thing is, the CPU cache hides a *lot* of activity from the chipset, 
 and changes the access patterns radically. 
 
 With normal cached accesses, you'd normally see just the fill cacheline 
 and write out cacheline pattern. With movnt, you'd see non-cacheline 
 accesses to memory. If the chipset was tested under mostly normal loads, 
 the movnt cases have been getting a lot less coverage.

I'm not so sure whether it is chipset's bug or not.

The movnt does have the WC (write combining) semantics and
bypass the hardware cache to store the data.

http://www.intel.com/products/processor/manuals/index.htm

Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 1: Basic Architecture

Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3A: System Programming Guide

Thanks in advance,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How innovative is Linux?

2007-06-25 Thread Hiro Yoshioka

On 6/24/07, Alan Cox <[EMAIL PROTECTED]> wrote:

On Sat, 23 Jun 2007 16:13:55 -0600
"David Kane" <[EMAIL PROTECTED]> wrote:

> The real innotation in Linux is that it is open source and yet popular
> enough that there are versions that even a windoze user could easily pick
> up.

I think that is more a product of its time than the software. It isn't
the first openly available Unix-like OS. The others such as UZI and OMU
died because there wasn't the internet in its modern form to keep them
going, share them and build communities.


Developed by the community is very innovative.
Linux is the first OS developed by very large community. (Bazaar Model)

Regards,
 Hiro
--
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How innovative is Linux?

2007-06-25 Thread Hiro Yoshioka

On 6/24/07, Alan Cox [EMAIL PROTECTED] wrote:

On Sat, 23 Jun 2007 16:13:55 -0600
David Kane [EMAIL PROTECTED] wrote:

 The real innotation in Linux is that it is open source and yet popular
 enough that there are versions that even a windoze user could easily pick
 up.

I think that is more a product of its time than the software. It isn't
the first openly available Unix-like OS. The others such as UZI and OMU
died because there wasn't the internet in its modern form to keep them
going, share them and build communities.


Developed by the community is very innovative.
Linux is the first OS developed by very large community. (Bazaar Model)

Regards,
 Hiro
--
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] Documentation/HOWTO translated into Japanese

2007-06-11 Thread Hiro Yoshioka

Hi,

Shibata san's contribution was really great. I was impressed too.

I think the important point is that Shibata san let the
linux kernel community knows the translation work has been done
and ask the feedback.

I think this two way communication is very important.

On 6/11/07, Greg KH <[EMAIL PROTECTED]> wrote:

On Mon, Jun 11, 2007 at 11:55:57AM +0900, KAMEZAWA Hiroyuki wrote:
>
> Hi, thank you for your work. I was impressed.
>
> BTW, how about adding following lines (both in Japanese and English) ?
> ==
> This is translated "HOWTO" documentation. Original "HOWTO" is maintaind by
> Greg Kroah-Hartman <[EMAIL PROTECTED]> and linux kernal mailing list.
> And this one is maintained by Tsugikazu Shibata <[EMAIL PROTECTED]>
> and JF Projet . Because original HOWTO documentation 
itself
> is being updated day and night, this file may contain old sentences. Please 
contact
> JF project if you find problems in translation.
>
> It is guaranteed that this file is just a translation and doesn't contains any
> additional information, sentences. If you want to update "HOWTO", please 
update
> English version first. Don't fork from original if you update this file.
>
> Last Updated: 2007/06/04  Version: 2.6.21
> ==
>
> not worth writing ?

That sounds fine with me, and would be good to have.  I also have no
problem also CC:ing anyone who wants to translate this file with patches
that happen to modify the originals.  That way they can easily keep up
with the minor number of changes that happen over time.


I think that some members of YLUG (Yokohama Linux Users Group)
may help to review the translations. :-)

Regards,
 Hiro
--
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] Documentation/HOWTO translated into Japanese

2007-06-11 Thread Hiro Yoshioka

Hi,

Shibata san's contribution was really great. I was impressed too.

I think the important point is that Shibata san let the
linux kernel community knows the translation work has been done
and ask the feedback.

I think this two way communication is very important.

On 6/11/07, Greg KH [EMAIL PROTECTED] wrote:

On Mon, Jun 11, 2007 at 11:55:57AM +0900, KAMEZAWA Hiroyuki wrote:

 Hi, thank you for your work. I was impressed.

 BTW, how about adding following lines (both in Japanese and English) ?
 ==
 This is translated HOWTO documentation. Original HOWTO is maintaind by
 Greg Kroah-Hartman [EMAIL PROTECTED] and linux kernal mailing list.
 And this one is maintained by Tsugikazu Shibata [EMAIL PROTECTED]
 and JF Projet www.linux.or.jp/JF. Because original HOWTO documentation 
itself
 is being updated day and night, this file may contain old sentences. Please 
contact
 JF project if you find problems in translation.

 It is guaranteed that this file is just a translation and doesn't contains any
 additional information, sentences. If you want to update HOWTO, please 
update
 English version first. Don't fork from original if you update this file.

 Last Updated: 2007/06/04  Version: 2.6.21
 ==

 not worth writing ?

That sounds fine with me, and would be good to have.  I also have no
problem also CC:ing anyone who wants to translate this file with patches
that happen to modify the originals.  That way they can easily keep up
with the minor number of changes that happen over time.


I think that some members of YLUG (Yokohama Linux Users Group)
may help to review the translations. :-)

Regards,
 Hiro
--
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-02-27 Thread Hiro Yoshioka
From: Robert Hancock <[EMAIL PROTECTED]>
Subject: Re: SMP performance degradation with sysbench
Date: Tue, 27 Feb 2007 18:20:25 -0600
Message-ID: <[EMAIL PROTECTED]>

> Hiro Yoshioka wrote:
> > Howdy,
> > 
> > MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
> > http://ossipedia.ipa.go.jp/capacity/EV0612260303/
> > (written in Japanese but you may read the graph. We compared
> > 5.0.24 vs 5.0.32)
> > 
> > The following is oprofile data
> > ==> 
> > cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt
> > <==
> > CPU: Core Solo / Duo, speed 2666.76 MHz (estimated)
> > Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit
> > mask of 0x00 (Unhalted core cycles) count 10
> > samples  %app name symbol name
> > 47097502 16.8391  libpthread-2.3.4.so  pthread_mutex_trylock
> > 19636300  7.0207  libpthread-2.3.4.so  pthread_mutex_unlock
> > 18600010  6.6502  mysqld   rec_get_offsets_func
> > 18121328  6.4790  mysqld   btr_search_guess_on_hash
> > 11453095  4.0949  mysqld   row_search_for_mysql
> > 
> > MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
> > machine.
> 
> Curious that it calls pthread_mutex_trylock (as opposed to 
> pthread_mutex_lock) so often. Maybe they're doing some kind of mutex 
> lock busy-looping?

Yes, it is.

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-02-27 Thread Hiro Yoshioka
From: Robert Hancock [EMAIL PROTECTED]
Subject: Re: SMP performance degradation with sysbench
Date: Tue, 27 Feb 2007 18:20:25 -0600
Message-ID: [EMAIL PROTECTED]

 Hiro Yoshioka wrote:
  Howdy,
  
  MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
  http://ossipedia.ipa.go.jp/capacity/EV0612260303/
  (written in Japanese but you may read the graph. We compared
  5.0.24 vs 5.0.32)
  
  The following is oprofile data
  == 
  cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt
  ==
  CPU: Core Solo / Duo, speed 2666.76 MHz (estimated)
  Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit
  mask of 0x00 (Unhalted core cycles) count 10
  samples  %app name symbol name
  47097502 16.8391  libpthread-2.3.4.so  pthread_mutex_trylock
  19636300  7.0207  libpthread-2.3.4.so  pthread_mutex_unlock
  18600010  6.6502  mysqld   rec_get_offsets_func
  18121328  6.4790  mysqld   btr_search_guess_on_hash
  11453095  4.0949  mysqld   row_search_for_mysql
  
  MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
  machine.
 
 Curious that it calls pthread_mutex_trylock (as opposed to 
 pthread_mutex_lock) so often. Maybe they're doing some kind of mutex 
 lock busy-looping?

Yes, it is.

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-02-26 Thread Hiro Yoshioka
Hi,

From: Rik van Riel <[EMAIL PROTECTED]>
> Hiro Yoshioka wrote:
> > Howdy,
> > 
> > MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
> > http://ossipedia.ipa.go.jp/capacity/EV0612260303/
> > (written in Japanese but you may read the graph. We compared
> > 5.0.24 vs 5.0.32)
snip
> > MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
> > machine.
> > 
> > I think there are a lot of room to be inproved in MySQL implementation.
> 
> That's one aspect.
> 
> The other aspect of the problem is that when the number of
> threads exceeds the number of CPU cores, Linux no longer
> manages to keep the CPUs busy and we get a lot of idle time.
> 
> On the other hand, with the number of threads being equal to
> the number of CPU cores, we are 100% CPU bound...

I have a question. If so, what is the difference of kernel's
view between SMP and CPU cores?

Another question. When the number of threads exceeds the number of
CPU cores, we may get a lot of idle time. Then a workaround of
MySQL is that do not creat threads which exceeds the number
of CPU cores. Is it right?

Regards,
  Hiro
--
Hiro Yoshioka
CTO/Miracle Linux Corporation
http://blog.miraclelinux.com/yume/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-02-26 Thread Hiro Yoshioka

Howdy,

MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
http://ossipedia.ipa.go.jp/capacity/EV0612260303/
(written in Japanese but you may read the graph. We compared
5.0.24 vs 5.0.32)

The following is oprofile data
==> cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt
<==
CPU: Core Solo / Duo, speed 2666.76 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit
mask of 0x00 (Unhalted core cycles) count 10
samples  %app name symbol name
47097502 16.8391  libpthread-2.3.4.so  pthread_mutex_trylock
19636300  7.0207  libpthread-2.3.4.so  pthread_mutex_unlock
18600010  6.6502  mysqld   rec_get_offsets_func
18121328  6.4790  mysqld   btr_search_guess_on_hash
11453095  4.0949  mysqld   row_search_for_mysql

MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
machine.

I think there are a lot of room to be inproved in MySQL implementation.

On 2/27/07, Dave Jones <[EMAIL PROTECTED]> wrote:

On Mon, Feb 26, 2007 at 04:04:01PM -0600, Pete Harlan wrote:
 > On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote:
 > > I found a couple of interesting issues so far. Firstly, the MySQL
 > > version that I'm using (5.0.26-Max) is making lots of calls to
 >
 > FYI, MySQL fixed some scalability problems in version 5.0.30, as
 > mentioned here:
 >
 > http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/
 >
 > It may be worth using more recent sources than 5.0.26 if tracking down
 > scaling problems in MySQL.

The blog post that originated this discussion ran tests on 5.0.33
Not that the mysql version should really matter. The key point here
is that FreeBSD and Linux were running the *same* version, and
FreeBSD was able to handle the situation better somehow.

Dave

--
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Regards,
 Hiro
--
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-02-26 Thread Hiro Yoshioka

Howdy,

MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
http://ossipedia.ipa.go.jp/capacity/EV0612260303/
(written in Japanese but you may read the graph. We compared
5.0.24 vs 5.0.32)

The following is oprofile data
== cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt
==
CPU: Core Solo / Duo, speed 2666.76 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit
mask of 0x00 (Unhalted core cycles) count 10
samples  %app name symbol name
47097502 16.8391  libpthread-2.3.4.so  pthread_mutex_trylock
19636300  7.0207  libpthread-2.3.4.so  pthread_mutex_unlock
18600010  6.6502  mysqld   rec_get_offsets_func
18121328  6.4790  mysqld   btr_search_guess_on_hash
11453095  4.0949  mysqld   row_search_for_mysql

MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
machine.

I think there are a lot of room to be inproved in MySQL implementation.

On 2/27/07, Dave Jones [EMAIL PROTECTED] wrote:

On Mon, Feb 26, 2007 at 04:04:01PM -0600, Pete Harlan wrote:
  On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote:
   I found a couple of interesting issues so far. Firstly, the MySQL
   version that I'm using (5.0.26-Max) is making lots of calls to
 
  FYI, MySQL fixed some scalability problems in version 5.0.30, as
  mentioned here:
 
  http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/
 
  It may be worth using more recent sources than 5.0.26 if tracking down
  scaling problems in MySQL.

The blog post that originated this discussion ran tests on 5.0.33
Not that the mysql version should really matter. The key point here
is that FreeBSD and Linux were running the *same* version, and
FreeBSD was able to handle the situation better somehow.

Dave

--
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Regards,
 Hiro
--
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-02-26 Thread Hiro Yoshioka
Hi,

From: Rik van Riel [EMAIL PROTECTED]
 Hiro Yoshioka wrote:
  Howdy,
  
  MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
  http://ossipedia.ipa.go.jp/capacity/EV0612260303/
  (written in Japanese but you may read the graph. We compared
  5.0.24 vs 5.0.32)
snip
  MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
  machine.
  
  I think there are a lot of room to be inproved in MySQL implementation.
 
 That's one aspect.
 
 The other aspect of the problem is that when the number of
 threads exceeds the number of CPU cores, Linux no longer
 manages to keep the CPUs busy and we get a lot of idle time.
 
 On the other hand, with the number of threads being equal to
 the number of CPU cores, we are 100% CPU bound...

I have a question. If so, what is the difference of kernel's
view between SMP and CPU cores?

Another question. When the number of threads exceeds the number of
CPU cores, we may get a lot of idle time. Then a workaround of
MySQL is that do not creat threads which exceeds the number
of CPU cores. Is it right?

Regards,
  Hiro
--
Hiro Yoshioka
CTO/Miracle Linux Corporation
http://blog.miraclelinux.com/yume/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-cache-pollution-aware-__copy_from_user_ll.patch added to -mm tree

2005-09-04 Thread Hiro Yoshioka
From: Andrew Morton <[EMAIL PROTECTED]>

> Dave Jones <[EMAIL PROTECTED]> wrote:
> >
> > On Sun, Sep 04, 2005 at 01:16:00PM -0700, Andrew Morton wrote:
> >   >  unsigned long __copy_to_user_ll(void __user *to, const void *from, 
> > unsigned long n)
> >   >  {
> >   > BUG_ON((long) n < 0);
> > 
> >  Ehh? It's unsigned. This will never be true.
> 
> It's cast to long, so it'll trap if we try to copy >=2G.
> 
> It seems a strange thing to check though.   Do we really need it?

I don't know. I've just cut the original __copy_from_user_ll()

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-cache-pollution-aware-__copy_from_user_ll.patch added to -mm tree

2005-09-04 Thread Hiro Yoshioka
From: Andrew Morton [EMAIL PROTECTED]

 Dave Jones [EMAIL PROTECTED] wrote:
 
  On Sun, Sep 04, 2005 at 01:16:00PM -0700, Andrew Morton wrote:
  unsigned long __copy_to_user_ll(void __user *to, const void *from, 
  unsigned long n)
  {
 BUG_ON((long) n  0);
  
   Ehh? It's unsigned. This will never be true.
 
 It's cast to long, so it'll trap if we try to copy =2G.
 
 It seems a strange thing to check though.   Do we really need it?

I don't know. I've just cutpaste the original __copy_from_user_ll()

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-09-03 Thread Hiro Yoshioka
From: Hiro Yoshioka <[EMAIL PROTECTED]>
Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Date: Fri, 02 Sep 2005 13:37:16 +0900 (JST)
Message-ID: <[EMAIL PROTECTED]>

> From: Andrew Morton <[EMAIL PROTECTED]>
> > Hiro Yoshioka <[EMAIL PROTECTED]> wrote:
> > >
> > > --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c  2005-08-05 
> > > 16:04:37.0 +0900
> > >  +++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c   2005-09-01 
> > > 17:09:41.0 +0900
> > 
> > Really.  Please redo and retest the patch against a current kernel.
> 
> Does it mean 2.6.13? I'll do it. 
> 
> Regards,
>   Hiro

Hi,

The following is the patch against 2.6.13

Hiro

diff -ur linux-2.6.13/Makefile linux-2.6.13.nt/Makefile
--- linux-2.6.13/Makefile   2005-08-29 08:41:01.0 +0900
+++ linux-2.6.13.nt/Makefile2005-09-03 14:11:27.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 13
-EXTRAVERSION =
+EXTRAVERSION = .nt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.13/arch/i386/lib/usercopy.c 
linux-2.6.13.nt/arch/i386/lib/usercopy.c
--- linux-2.6.13/arch/i386/lib/usercopy.c   2005-08-29 08:41:01.0 
+0900
+++ linux-2.6.13.nt/arch/i386/lib/usercopy.c2005-09-03 14:09:18.0 
+0900
@@ -425,6 +425,107 @@
   : "eax", "edx", "memory");
return size;
 }
+
+/* Non Temporal Hint version of __copy_user_zeroing_intel */
+/* It is cache aware. */
+/* [EMAIL PROTECTED]  */
+static unsigned long 
+__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned 
long size)
+{
+int d0, d1;
+
+   __asm__ __volatile__(
+  ".align 2,0x90\n"
+  "0:  movl 32(%4), %%eax\n"
+  "cmpl $67, %0\n"  
+  "jbe 2f\n"
+  "1:  movl 64(%4), %%eax\n"
+  ".align 2,0x90\n" 
+  "2:  movl 0(%4), %%eax\n" 
+  "21: movl 4(%4), %%edx\n" 
+  "movnti %%eax, 0(%3)\n" 
+  "movnti %%edx, 4(%3)\n" 
+  "3:  movl 8(%4), %%eax\n" 
+  "31: movl 12(%4),%%edx\n" 
+  "movnti %%eax, 8(%3)\n" 
+  "movnti %%edx, 12(%3)\n"
+  "4:  movl 16(%4), %%eax\n"
+  "41: movl 20(%4), %%edx\n"
+  "movnti %%eax, 16(%3)\n"
+  "movnti %%edx, 20(%3)\n"
+  "10: movl 24(%4), %%eax\n"
+  "51: movl 28(%4), %%edx\n"
+  "movnti %%eax, 24(%3)\n"
+  "movnti %%edx, 28(%3)\n"
+  "11: movl 32(%4), %%eax\n"
+  "61: movl 36(%4), %%edx\n"
+  "movnti %%eax, 32(%3)\n"
+  "movnti %%edx, 36(%3)\n"
+  "12: movl 40(%4), %%eax\n"
+  "71: movl 44(%4), %%edx\n"
+  "movnti %%eax, 40(%3)\n"
+  "movnti %%edx, 44(%3)\n"
+  "13: movl 48(%4), %%eax\n"
+  "81: movl 52(%4), %%edx\n"
+  "movnti %%eax, 48(%3)\n"
+  "movnti %%edx, 52(%3)\n"
+  "14: movl 56(%4), %%eax\n"
+  "91: movl 60(%4), %%edx\n"
+  "movnti %%eax, 56(%3)\n"
+  "movnti %%edx, 60(%3)\n"
+  "addl $-64, %0\n" 
+  "addl $64, %4\n"  
+  "addl $64, %3\n"  
+  "cmpl $63, %0\n"  
+  "ja  0b\n"
+  "sfence \n"
+  "5:  movl  %0, %%eax\n"   
+  "shrl  $2, %0\n"  
+  "andl $3, %%eax\n"
+  "cld\n"   
+  "6:  rep; movsl\n"   
+ 

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-09-03 Thread Hiro Yoshioka
From: Hiro Yoshioka [EMAIL PROTECTED]
Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Date: Fri, 02 Sep 2005 13:37:16 +0900 (JST)
Message-ID: [EMAIL PROTECTED]

 From: Andrew Morton [EMAIL PROTECTED]
  Hiro Yoshioka [EMAIL PROTECTED] wrote:
  
   --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c  2005-08-05 
   16:04:37.0 +0900
+++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c   2005-09-01 
   17:09:41.0 +0900
  
  Really.  Please redo and retest the patch against a current kernel.
 
 Does it mean 2.6.13? I'll do it. 
 
 Regards,
   Hiro

Hi,

The following is the patch against 2.6.13

Hiro

diff -ur linux-2.6.13/Makefile linux-2.6.13.nt/Makefile
--- linux-2.6.13/Makefile   2005-08-29 08:41:01.0 +0900
+++ linux-2.6.13.nt/Makefile2005-09-03 14:11:27.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 13
-EXTRAVERSION =
+EXTRAVERSION = .nt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.13/arch/i386/lib/usercopy.c 
linux-2.6.13.nt/arch/i386/lib/usercopy.c
--- linux-2.6.13/arch/i386/lib/usercopy.c   2005-08-29 08:41:01.0 
+0900
+++ linux-2.6.13.nt/arch/i386/lib/usercopy.c2005-09-03 14:09:18.0 
+0900
@@ -425,6 +425,107 @@
   : eax, edx, memory);
return size;
 }
+
+/* Non Temporal Hint version of __copy_user_zeroing_intel */
+/* It is cache aware. */
+/* [EMAIL PROTECTED]  */
+static unsigned long 
+__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned 
long size)
+{
+int d0, d1;
+
+   __asm__ __volatile__(
+  .align 2,0x90\n
+  0:  movl 32(%4), %%eax\n
+  cmpl $67, %0\n  
+  jbe 2f\n
+  1:  movl 64(%4), %%eax\n
+  .align 2,0x90\n 
+  2:  movl 0(%4), %%eax\n 
+  21: movl 4(%4), %%edx\n 
+  movnti %%eax, 0(%3)\n 
+  movnti %%edx, 4(%3)\n 
+  3:  movl 8(%4), %%eax\n 
+  31: movl 12(%4),%%edx\n 
+  movnti %%eax, 8(%3)\n 
+  movnti %%edx, 12(%3)\n
+  4:  movl 16(%4), %%eax\n
+  41: movl 20(%4), %%edx\n
+  movnti %%eax, 16(%3)\n
+  movnti %%edx, 20(%3)\n
+  10: movl 24(%4), %%eax\n
+  51: movl 28(%4), %%edx\n
+  movnti %%eax, 24(%3)\n
+  movnti %%edx, 28(%3)\n
+  11: movl 32(%4), %%eax\n
+  61: movl 36(%4), %%edx\n
+  movnti %%eax, 32(%3)\n
+  movnti %%edx, 36(%3)\n
+  12: movl 40(%4), %%eax\n
+  71: movl 44(%4), %%edx\n
+  movnti %%eax, 40(%3)\n
+  movnti %%edx, 44(%3)\n
+  13: movl 48(%4), %%eax\n
+  81: movl 52(%4), %%edx\n
+  movnti %%eax, 48(%3)\n
+  movnti %%edx, 52(%3)\n
+  14: movl 56(%4), %%eax\n
+  91: movl 60(%4), %%edx\n
+  movnti %%eax, 56(%3)\n
+  movnti %%edx, 60(%3)\n
+  addl $-64, %0\n 
+  addl $64, %4\n  
+  addl $64, %3\n  
+  cmpl $63, %0\n  
+  ja  0b\n
+  sfence \n
+  5:  movl  %0, %%eax\n   
+  shrl  $2, %0\n  
+  andl $3, %%eax\n
+  cld\n   
+  6:  rep; movsl\n   
+  movl %%eax,%0\n
+  7:  rep; movsb\n   
+  8:\n   
+  .section .fixup,\ax\\n
+  9:  lea 0(%%eax,%0,4),%0\n 
+  16: pushl %0\n 
+  pushl %%eax\n  
+  xorl %%eax,%%eax\n
+  rep; stosb\n   
+  popl %%eax\n   
+  popl %0\n  
+  jmp 8b\n   
+  .previous\n
+  .section __ex_table,\a\\n
+  .align 4\n
+  .long 0b,16b\n

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-09-01 Thread Hiro Yoshioka
From: Andrew Morton <[EMAIL PROTECTED]>
> Hiro Yoshioka <[EMAIL PROTECTED]> wrote:
> >
> > --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05 
> > 16:04:37.0 +0900
> >  +++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c 2005-09-01 
> > 17:09:41.0 +0900
> 
> Really.  Please redo and retest the patch against a current kernel.

Does it mean 2.6.13? I'll do it. 

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-09-01 Thread Hiro Yoshioka
Andrew,

From: Andrew Morton <[EMAIL PROTECTED]>
> Andi Kleen <[EMAIL PROTECTED]> wrote:
> >
> > On Friday 02 September 2005 04:08, Andrew Morton wrote:
> > 
> > > I suppose I'll queue it up in -mm for a while, although I'm a bit dubious
> > > about the whole idea...  We'll gain some and we'll lose some - how do we
> > > know it's a net gain?
> > 
> > I suspect it'll gain more than it loses. The only case where it might 
> > not gain is immediately someone reading the data from the page cache again
> > after the write.
> 
> That's a pretty common case - temporary files.
> 
> > But I suppose that's far less frequent than writing the data.
> 
> yup.
> 
> Hiro, could you please send through a summary of the performance testing
> results sometime?  Runtimes rather than oprofile output?

iozone results are

original 2.6.12.4 CPU time = 207.768 sec
cache aware   CPU time = 184.783 sec
(three times run)
184.783/207.768=88.94% (11.06% reduction)

original:
pattern9-0-cpu4-0-08191720/iozone.out:  CPU Utilization: Wall time   45.997
CPU time   64.527CPU utilization 140.28 %
pattern9-0-cpu4-0-08191741/iozone.out:  CPU Utilization: Wall time   46.878
CPU time   71.933CPU utilization 153.45 %
pattern9-0-cpu4-0-08191743/iozone.out:  CPU Utilization: Wall time   45.152
CPU time   71.308CPU utilization 157.93 %

cache awre:
pattern9-0-cpu4-0-09011728/iozone.out:  CPU Utilization: Wall time   44.842
CPU time   62.465CPU utilization 139.30 %
pattern9-0-cpu4-0-09011731/iozone.out:  CPU Utilization: Wall time   44.718
CPU time   59.273CPU utilization 132.55 %
pattern9-0-cpu4-0-09011744/iozone.out:  CPU Utilization: Wall time   44.367
CPU time   63.045CPU utilization 142.10 %

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-09-01 Thread Hiro Yoshioka
From: Andi Kleen <[EMAIL PROTECTED]>
> On Thursday 01 September 2005 11:07, Hiro Yoshioka wrote:
> 
> > The following is the almost final version of the
> > cache pollution aware __copy_from_user_ll() patch.
> 
> Looks good to me.
> 
> Once the filemap.c hunk is in I'll probably do something
> similar for x86-64.

Thank you very much. What else should I do? Shall I just
be waiting to check in the patch?

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-09-01 Thread Hiro Yoshioka
Hi,

> From: Andi Kleen <[EMAIL PROTECTED]>
> > > Hi,
> > > 
> > > The following patch does not use MMX regsiters so that we don't have
> > > to worry about save/restore the FPU/MMX states.
> > > 
> > > What do you think?
> > 
> > Performance will probably be bad on K7 Athlons - those have a microcoded
> > movnti which is quite slow.
> > 
> > Also BTW I don't see any code anywhere that tests the CPUID bits,
> > so your code will fail spectacularly on a PII that didn't do SSE
> > (intel user copy used to be enabled on those) 
> > 
> > One way to solve this might be to use different code using
> > alternative()
> > 
> > -Andi

The following is the almost final version of the
cache pollution aware __copy_from_user_ll() patch.

1) use sfence instruction to perform a serializing on all
store-to-memory instructions.
2) check if the cpu has the xmm2 extentions. (movnti)

I think it is a good enough to be considered into
the main line.

What do you think?

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig1921587
2.6.12.4.nt  1599424
1599424/1921587=83.23% (16.77% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig  57427
2.6.12.4.nt20858
20858/57427=36.32% (63.7% reduction)

L3 cache miss reduction of __copy_from_user_ll
samples  %
3740865.1412  vmlinux  __copy_from_user_ll
230.1103  vmlinux  __copy_user_zeroing_intel_nocache
23/37408=0.061% (99.94% reduction)

Top 5 of 2.6.12.4.nt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) 
with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
1283928.0274  vmlinux  __copy_user_zeroing_intel_nocache
64206 4.0143  vmlinux  journal_add_journal_head
59746 3.7355  vmlinux  do_get_write_access
47674 2.9807  vmlinux  journal_put_journal_head
46021 2.8774  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-09011728/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
69755 4.2861  vmlinux  __copy_user_zeroing_intel_nocache
55685 3.4215  vmlinux  journal_add_journal_head
52371 3.2179  vmlinux  __find_get_block
45504 2.7960  vmlinux  journal_put_journal_head
36005 2.2123  vmlinux  journal_stop
pattern9-0-cpu4-0-09011744/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
1147  5.4994  vmlinux  journal_add_journal_head
881   4.2240  vmlinux  journal_dirty_data
872   4.1809  vmlinux  blk_rq_map_sg
734   3.5192  vmlinux  journal_commit_transaction
617   2.9582  vmlinux  radix_tree_delete
pattern9-0-cpu4-0-09011731/summary.out

diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nt/Makefile
--- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900
+++ linux-2.6.12.4.nt/Makefile  2005-08-24 17:23:57.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.nt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c 
linux-2.6.12.4.nt/arch/i386/lib/usercopy.c
--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05 
16:04:37.0 +0900
+++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c  2005-09-01 17:09:41.0 
+0900
@@ -421,6 +421,107 @@
   : "eax", "edx", "memory");
return size;
 }
+
+/* Non Temporal Hint version of __copy_user_zeroing_intel */
+/* It is cache aware. */
+/* [EMAIL PROTECTED]  */
+static unsigned long 
+__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned 
long size)
+{
+int d0, d1;
+
+   __asm__ __volatile__(
+  ".align 2,0x90\n"
+  "0:  movl 32(%4), %%eax\n"
+  "cmpl $67, %0\n"  
+  "jbe 2f\n"
+  "1:  movl 64(%4), %%eax\n"
+  ".align 2,0x90\n" 
+  "2:  movl 0(%4), %%eax\n" 
+  "21: movl 4(%4), %%edx\n" 
+  "movnti %%eax, 0(%3)\n" 
+  "movnti %%edx, 4(%3)\n" 
+  "3:  movl 8(%4), %%eax\n" 
+  "31: movl 12(%4),%%edx\n" 
+  "movnti %%eax, 8(%3)\n" 
+  "

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-09-01 Thread Hiro Yoshioka
Hi,

 From: Andi Kleen [EMAIL PROTECTED]
   Hi,
   
   The following patch does not use MMX regsiters so that we don't have
   to worry about save/restore the FPU/MMX states.
   
   What do you think?
  
  Performance will probably be bad on K7 Athlons - those have a microcoded
  movnti which is quite slow.
  
  Also BTW I don't see any code anywhere that tests the CPUID bits,
  so your code will fail spectacularly on a PII that didn't do SSE
  (intel user copy used to be enabled on those) 
  
  One way to solve this might be to use different code using
  alternative()
  
  -Andi

The following is the almost final version of the
cache pollution aware __copy_from_user_ll() patch.

1) use sfence instruction to perform a serializing on all
store-to-memory instructions.
2) check if the cpu has the xmm2 extentions. (movnti)

I think it is a good enough to be considered into
the main line.

What do you think?

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig1921587
2.6.12.4.nt  1599424
1599424/1921587=83.23% (16.77% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig  57427
2.6.12.4.nt20858
20858/57427=36.32% (63.7% reduction)

L3 cache miss reduction of __copy_from_user_ll
samples  %
3740865.1412  vmlinux  __copy_from_user_ll
230.1103  vmlinux  __copy_user_zeroing_intel_nocache
23/37408=0.061% (99.94% reduction)

Top 5 of 2.6.12.4.nt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) 
with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
1283928.0274  vmlinux  __copy_user_zeroing_intel_nocache
64206 4.0143  vmlinux  journal_add_journal_head
59746 3.7355  vmlinux  do_get_write_access
47674 2.9807  vmlinux  journal_put_journal_head
46021 2.8774  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-09011728/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
69755 4.2861  vmlinux  __copy_user_zeroing_intel_nocache
55685 3.4215  vmlinux  journal_add_journal_head
52371 3.2179  vmlinux  __find_get_block
45504 2.7960  vmlinux  journal_put_journal_head
36005 2.2123  vmlinux  journal_stop
pattern9-0-cpu4-0-09011744/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
1147  5.4994  vmlinux  journal_add_journal_head
881   4.2240  vmlinux  journal_dirty_data
872   4.1809  vmlinux  blk_rq_map_sg
734   3.5192  vmlinux  journal_commit_transaction
617   2.9582  vmlinux  radix_tree_delete
pattern9-0-cpu4-0-09011731/summary.out

diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nt/Makefile
--- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900
+++ linux-2.6.12.4.nt/Makefile  2005-08-24 17:23:57.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.nt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c 
linux-2.6.12.4.nt/arch/i386/lib/usercopy.c
--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05 
16:04:37.0 +0900
+++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c  2005-09-01 17:09:41.0 
+0900
@@ -421,6 +421,107 @@
   : eax, edx, memory);
return size;
 }
+
+/* Non Temporal Hint version of __copy_user_zeroing_intel */
+/* It is cache aware. */
+/* [EMAIL PROTECTED]  */
+static unsigned long 
+__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned 
long size)
+{
+int d0, d1;
+
+   __asm__ __volatile__(
+  .align 2,0x90\n
+  0:  movl 32(%4), %%eax\n
+  cmpl $67, %0\n  
+  jbe 2f\n
+  1:  movl 64(%4), %%eax\n
+  .align 2,0x90\n 
+  2:  movl 0(%4), %%eax\n 
+  21: movl 4(%4), %%edx\n 
+  movnti %%eax, 0(%3)\n 
+  movnti %%edx, 4(%3)\n 
+  3:  movl 8(%4), %%eax\n 
+  31: movl 12(%4),%%edx\n 
+  movnti %%eax, 8(%3)\n 
+  movnti %%edx, 12(%3)\n
+  4:  movl 16(%4), %%eax\n
+  

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-09-01 Thread Hiro Yoshioka
From: Andi Kleen [EMAIL PROTECTED]
 On Thursday 01 September 2005 11:07, Hiro Yoshioka wrote:
 
  The following is the almost final version of the
  cache pollution aware __copy_from_user_ll() patch.
 
 Looks good to me.
 
 Once the filemap.c hunk is in I'll probably do something
 similar for x86-64.

Thank you very much. What else should I do? Shall I just
be waiting to check in the patch?

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-09-01 Thread Hiro Yoshioka
Andrew,

From: Andrew Morton [EMAIL PROTECTED]
 Andi Kleen [EMAIL PROTECTED] wrote:
 
  On Friday 02 September 2005 04:08, Andrew Morton wrote:
  
   I suppose I'll queue it up in -mm for a while, although I'm a bit dubious
   about the whole idea...  We'll gain some and we'll lose some - how do we
   know it's a net gain?
  
  I suspect it'll gain more than it loses. The only case where it might 
  not gain is immediately someone reading the data from the page cache again
  after the write.
 
 That's a pretty common case - temporary files.
 
  But I suppose that's far less frequent than writing the data.
 
 yup.
 
 Hiro, could you please send through a summary of the performance testing
 results sometime?  Runtimes rather than oprofile output?

iozone results are

original 2.6.12.4 CPU time = 207.768 sec
cache aware   CPU time = 184.783 sec
(three times run)
184.783/207.768=88.94% (11.06% reduction)

original:
pattern9-0-cpu4-0-08191720/iozone.out:  CPU Utilization: Wall time   45.997
CPU time   64.527CPU utilization 140.28 %
pattern9-0-cpu4-0-08191741/iozone.out:  CPU Utilization: Wall time   46.878
CPU time   71.933CPU utilization 153.45 %
pattern9-0-cpu4-0-08191743/iozone.out:  CPU Utilization: Wall time   45.152
CPU time   71.308CPU utilization 157.93 %

cache awre:
pattern9-0-cpu4-0-09011728/iozone.out:  CPU Utilization: Wall time   44.842
CPU time   62.465CPU utilization 139.30 %
pattern9-0-cpu4-0-09011731/iozone.out:  CPU Utilization: Wall time   44.718
CPU time   59.273CPU utilization 132.55 %
pattern9-0-cpu4-0-09011744/iozone.out:  CPU Utilization: Wall time   44.367
CPU time   63.045CPU utilization 142.10 %

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-09-01 Thread Hiro Yoshioka
From: Andrew Morton [EMAIL PROTECTED]
 Hiro Yoshioka [EMAIL PROTECTED] wrote:
 
  --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05 
  16:04:37.0 +0900
   +++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c 2005-09-01 
  17:09:41.0 +0900
 
 Really.  Please redo and retest the patch against a current kernel.

Does it mean 2.6.13? I'll do it. 

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-24 Thread Hiro Yoshioka
From: Andi Kleen <[EMAIL PROTECTED]>
> > Hi,
> > 
> > The following patch does not use MMX regsiters so that we don't have
> > to worry about save/restore the FPU/MMX states.
> > 
> > What do you think?
> 
> Performance will probably be bad on K7 Athlons - those have a microcoded
> movnti which is quite slow.
> 
> Also BTW I don't see any code anywhere that tests the CPUID bits,
> so your code will fail spectacularly on a PII that didn't do SSE
> (intel user copy used to be enabled on those) 
> 
> One way to solve this might be to use different code using
> alternative()
> 
> -Andi

Thanks for your comments. I'll consider it.

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-24 Thread Hiro Yoshioka
From: Hirokazu Takahashi <[EMAIL PROTECTED]>
> > The following patch does not use MMX regsiters so that we don't have
> > to worry about save/restore the FPU/MMX states.
> > 
> > What do you think?
> 
> I think __copy_user_zeroing_intel_nocache() should be followed by sfence
> or mfence instruction to flush the data.

Thanks. I'll implement it.

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-24 Thread Hiro Yoshioka
const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +504,40 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned 
long n)
+{
+   if (__builtin_constant_p(n)) {
+   unsigned long ret;
+
+   switch (n) {
+   case 1:
+   __get_user_size(*(u8 *)to, from, 1, ret, 1);
+   return ret;
+   case 2:
+   __get_user_size(*(u16 *)to, from, 2, ret, 2);
+   return ret;
+   case 4:
+   __get_user_size(*(u32 *)to, from, 4, ret, 4);
+   return ret;
+   }
+   }
+   return __copy_from_user_ll_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
might_sleep();
return __copy_from_user_inatomic(to, from, n);
 }
+
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+   might_sleep();
+   return __copy_from_user_inatomic_nocache(to, from, n);
+}
+
 unsigned long __must_check copy_to_user(void __user *to,
const void *from, unsigned long n);
 unsigned long __must_check copy_from_user(void *to,
diff -ur linux-2.6.12.4.orig/mm/filemap.c linux-2.6.12.4.nt/mm/filemap.c
--- linux-2.6.12.4.orig/mm/filemap.c2005-08-05 16:04:37.0 +0900
+++ linux-2.6.12.4.nt/mm/filemap.c  2005-08-16 10:16:06.0 +0900
@@ -1727,13 +1727,13 @@
int left;
 
kaddr = kmap_atomic(page, KM_USER0);
-   left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+   left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
kunmap_atomic(kaddr, KM_USER0);
 
if (left != 0) {
/* Do it the slow way */
kaddr = kmap(page);
-   left = __copy_from_user(kaddr + offset, buf, bytes);
+   left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
kunmap(page);
}
return bytes - left;
@@ -1750,7 +1750,7 @@
int copy = min(bytes, iov->iov_len - base);
 
base = 0;
-   left = __copy_from_user_inatomic(vaddr, buf, copy);
+   left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
copied += copy;
bytes -= copy;
vaddr += copy;

Regards,
  Hiro
--
Hiro Yoshioka
CTO/Miracle Linux Corporation
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-24 Thread Hiro Yoshioka
 __must_check copy_from_user(void *to,
diff -ur linux-2.6.12.4.orig/mm/filemap.c linux-2.6.12.4.nt/mm/filemap.c
--- linux-2.6.12.4.orig/mm/filemap.c2005-08-05 16:04:37.0 +0900
+++ linux-2.6.12.4.nt/mm/filemap.c  2005-08-16 10:16:06.0 +0900
@@ -1727,13 +1727,13 @@
int left;
 
kaddr = kmap_atomic(page, KM_USER0);
-   left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+   left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
kunmap_atomic(kaddr, KM_USER0);
 
if (left != 0) {
/* Do it the slow way */
kaddr = kmap(page);
-   left = __copy_from_user(kaddr + offset, buf, bytes);
+   left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
kunmap(page);
}
return bytes - left;
@@ -1750,7 +1750,7 @@
int copy = min(bytes, iov-iov_len - base);
 
base = 0;
-   left = __copy_from_user_inatomic(vaddr, buf, copy);
+   left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
copied += copy;
bytes -= copy;
vaddr += copy;

Regards,
  Hiro
--
Hiro Yoshioka
CTO/Miracle Linux Corporation
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-24 Thread Hiro Yoshioka
From: Hirokazu Takahashi [EMAIL PROTECTED]
  The following patch does not use MMX regsiters so that we don't have
  to worry about save/restore the FPU/MMX states.
  
  What do you think?
 
 I think __copy_user_zeroing_intel_nocache() should be followed by sfence
 or mfence instruction to flush the data.

Thanks. I'll implement it.

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-24 Thread Hiro Yoshioka
From: Andi Kleen [EMAIL PROTECTED]
  Hi,
  
  The following patch does not use MMX regsiters so that we don't have
  to worry about save/restore the FPU/MMX states.
  
  What do you think?
 
 Performance will probably be bad on K7 Athlons - those have a microcoded
 movnti which is quite slow.
 
 Also BTW I don't see any code anywhere that tests the CPUID bits,
 so your code will fail spectacularly on a PII that didn't do SSE
 (intel user copy used to be enabled on those) 
 
 One way to solve this might be to use different code using
 alternative()
 
 -Andi

Thanks for your comments. I'll consider it.

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-22 Thread Hiro Yoshioka
Hi,

It seems to me this mail does not go out.
So resending it.

> On 8/18/05, Hiro Yoshioka <[EMAIL PROTECTED]> wrote:
> > 1) using stack to save/restore MMX registers
> 
> It seems to me that it has some regression.
> I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

The following is a current version of cache aware copy_from_user_ll.

1) using kernel_fpu_begin()/kernel_fpu_end()
2) low latency version of cache aware copy
3) __copy_user*_nocache APIs so if you want to use it.
(There is no change in the current APIs.)

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig1921587
2.6.12.4.preempt 1634411
163411/1921587=85.06% (15% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig  57427
2.6.12.4.preempt   17398

samples  %
3740865.1412  vmlinux  __copy_from_user_ll
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
51/37408=0.136% (99.86% reduction)

Top 5 2.6.12.4.orig
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) 
with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
287643   14.9692  vmlinux  __copy_from_user_ll
72660 3.7813  vmlinux  journal_add_journal_head
65011 3.3832  vmlinux  do_get_write_access
50618 2.6342  vmlinux  journal_put_journal_head
48068 2.5015  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191743/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1347567.9364  vmlinux  __copy_from_user_ll
57735 3.4003  vmlinux  journal_add_journal_head
50653 2.9832  vmlinux  __find_get_block
44522 2.6221  vmlinux  journal_put_journal_head
38928 2.2927  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191741/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
3740865.1412  vmlinux  __copy_from_user_ll
953   1.6595  vmlinux  blk_rq_map_sg
886   1.5429  vmlinux  sub_preempt_count
680   1.1841  vmlinux  journal_add_journal_head
598   1.0413  vmlinux  journal_commit_transaction
pattern9-0-cpu4-0-08191720/summary.out

Top 5 2.6.12.4.preempt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) 
with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
1235317.5582  vmlinux  __copy_user_zeroing_inatomic_nocache
64820 3.9660  vmlinux  journal_add_journal_head
60460 3.6992  vmlinux  do_get_write_access
47172 2.8862  vmlinux  journal_put_journal_head
46753 2.8606  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08190838/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1267626.7993  vmlinux  __copy_user_zeroing_inatomic_nocache
79803 4.2805  vmlinux  journal_add_journal_head
70271 3.7692  vmlinux  journal_dirty_metadata
66146 3.5480  vmlinux  __find_get_block
58082 3.1154  vmlinux  journal_put_journal_head
pattern9-0-cpu4-0-08190855/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
901   5.1788  vmlinux  blk_rq_map_sg
675   3.8798  vmlinux  journal_commit_transaction
637   3.6613  vmlinux  radix_tree_delete
605   3.4774  vmlinux  journal_add_journal_head
580   3.3337  vmlinux  release_pages
...
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
...
1 0.0057  vmlinux  __copy_from_user_ll_inatomic_nocache
pattern9-0-cpu4-0-08190859/summary.out

2.6.12.4-usercopy.c.patch.050819
diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile
--- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900
+++ linux-2.6.12.4.preempt/Makefile 2005-08-18 18:47:07.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.preempt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-22 Thread Hiro Yoshioka
Hi,

It seems to me this mail does not go out.
So resending it.

> On 8/18/05, Hiro Yoshioka <[EMAIL PROTECTED]> wrote:
> > 1) using stack to save/restore MMX registers
> 
> It seems to me that it has some regression.
> I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

The following is a current version of cache aware copy_from_user_ll.

1) using kernel_fpu_begin()/kernel_fpu_end()
2) low latency version of cache aware copy
3) __copy_user*_nocache APIs so if you want to use it.
(There is no change in the current APIs.)

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig1921587
2.6.12.4.preempt 1634411
163411/1921587=85.06% (15% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig  57427
2.6.12.4.preempt   17398

samples  %
3740865.1412  vmlinux  __copy_from_user_ll
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
51/37408=0.136% (99.86% reduction)

Top 5 2.6.12.4.orig
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
287643   14.9692  vmlinux  __copy_from_user_ll
72660 3.7813  vmlinux  journal_add_journal_head
65011 3.3832  vmlinux  do_get_write_access
50618 2.6342  vmlinux  journal_put_journal_head
48068 2.5015  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191743/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1347567.9364  vmlinux  __copy_from_user_ll
57735 3.4003  vmlinux  journal_add_journal_head
50653 2.9832  vmlinux  __find_get_block
44522 2.6221  vmlinux  journal_put_journal_head
38928 2.2927  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191741/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
3740865.1412  vmlinux  __copy_from_user_ll
953   1.6595  vmlinux  blk_rq_map_sg
886   1.5429  vmlinux  sub_preempt_count
680   1.1841  vmlinux  journal_add_journal_head
598   1.0413  vmlinux  journal_commit_transaction
pattern9-0-cpu4-0-08191720/summary.out

Top 5 2.6.12.4.preempt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
1235317.5582  vmlinux  __copy_user_zeroing_inatomic_nocache
64820 3.9660  vmlinux  journal_add_journal_head
60460 3.6992  vmlinux  do_get_write_access
47172 2.8862  vmlinux  journal_put_journal_head
46753 2.8606  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08190838/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1267626.7993  vmlinux  __copy_user_zeroing_inatomic_nocache
79803 4.2805  vmlinux  journal_add_journal_head
70271 3.7692  vmlinux  journal_dirty_metadata
66146 3.5480  vmlinux  __find_get_block
58082 3.1154  vmlinux  journal_put_journal_head
pattern9-0-cpu4-0-08190855/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
901   5.1788  vmlinux  blk_rq_map_sg
675   3.8798  vmlinux  journal_commit_transaction
637   3.6613  vmlinux  radix_tree_delete
605   3.4774  vmlinux  journal_add_journal_head
580   3.3337  vmlinux  release_pages
...
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
...
1 0.0057  vmlinux  __copy_from_user_ll_inatomic_nocache
pattern9-0-cpu4-0-08190859/summary.out

2.6.12.4-usercopy.c.patch.050819
diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile
--- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900
+++ linux-2.6.12.4.preempt/Makefile 2005-08-18 18:47:07.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.preempt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-22 Thread Hiro Yoshioka
> On 8/18/05, Hiro Yoshioka <[EMAIL PROTECTED]> wrote:
> > 1) using stack to save/restore MMX registers
> 
> It seems to me that it has some regression.
> I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

The following is a current version of cache aware copy_from_user_ll.

1) using kernel_fpu_begin()/kernel_fpu_end()
2) low latency version of cache aware copy
3) __copy_user*_nocache APIs so if you want to use it.
(There is no change in the current APIs.)

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig1921587
2.6.12.4.preempt 1634411
163411/1921587=85.06% (15% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig  57427
2.6.12.4.preempt   17398

samples  %
3740865.1412  vmlinux  __copy_from_user_ll
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
51/37408=0.136% (99.86% reduction)

Top 5 2.6.12.4.orig
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) 
with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
287643   14.9692  vmlinux  __copy_from_user_ll
72660 3.7813  vmlinux  journal_add_journal_head
65011 3.3832  vmlinux  do_get_write_access
50618 2.6342  vmlinux  journal_put_journal_head
48068 2.5015  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191743/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1347567.9364  vmlinux  __copy_from_user_ll
57735 3.4003  vmlinux  journal_add_journal_head
50653 2.9832  vmlinux  __find_get_block
44522 2.6221  vmlinux  journal_put_journal_head
38928 2.2927  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191741/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
3740865.1412  vmlinux  __copy_from_user_ll
953   1.6595  vmlinux  blk_rq_map_sg
886   1.5429  vmlinux  sub_preempt_count
680   1.1841  vmlinux  journal_add_journal_head
598   1.0413  vmlinux  journal_commit_transaction
pattern9-0-cpu4-0-08191720/summary.out

Top 5 2.6.12.4.preempt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) 
with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
1235317.5582  vmlinux  __copy_user_zeroing_inatomic_nocache
64820 3.9660  vmlinux  journal_add_journal_head
60460 3.6992  vmlinux  do_get_write_access
47172 2.8862  vmlinux  journal_put_journal_head
46753 2.8606  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08190838/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1267626.7993  vmlinux  __copy_user_zeroing_inatomic_nocache
79803 4.2805  vmlinux  journal_add_journal_head
70271 3.7692  vmlinux  journal_dirty_metadata
66146 3.5480  vmlinux  __find_get_block
58082 3.1154  vmlinux  journal_put_journal_head
pattern9-0-cpu4-0-08190855/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
901   5.1788  vmlinux  blk_rq_map_sg
675   3.8798  vmlinux  journal_commit_transaction
637   3.6613  vmlinux  radix_tree_delete
605   3.4774  vmlinux  journal_add_journal_head
580   3.3337  vmlinux  release_pages
...
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
...
1 0.0057  vmlinux  __copy_from_user_ll_inatomic_nocache
pattern9-0-cpu4-0-08190859/summary.out

2.6.12.4-usercopy.c.patch.050819
diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile
--- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900
+++ linux-2.6.12.4.preempt/Makefile 2005-08-18 18:47:07.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.preempt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c 
linux-2.6.12.4.preempt/

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-22 Thread Hiro Yoshioka
 On 8/18/05, Hiro Yoshioka [EMAIL PROTECTED] wrote:
  1) using stack to save/restore MMX registers
 
 It seems to me that it has some regression.
 I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

The following is a current version of cache aware copy_from_user_ll.

1) using kernel_fpu_begin()/kernel_fpu_end()
2) low latency version of cache aware copy
3) __copy_user*_nocache APIs so if you want to use it.
(There is no change in the current APIs.)

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig1921587
2.6.12.4.preempt 1634411
163411/1921587=85.06% (15% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig  57427
2.6.12.4.preempt   17398

samples  %
3740865.1412  vmlinux  __copy_from_user_ll
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
51/37408=0.136% (99.86% reduction)

Top 5 2.6.12.4.orig
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) 
with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
287643   14.9692  vmlinux  __copy_from_user_ll
72660 3.7813  vmlinux  journal_add_journal_head
65011 3.3832  vmlinux  do_get_write_access
50618 2.6342  vmlinux  journal_put_journal_head
48068 2.5015  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191743/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1347567.9364  vmlinux  __copy_from_user_ll
57735 3.4003  vmlinux  journal_add_journal_head
50653 2.9832  vmlinux  __find_get_block
44522 2.6221  vmlinux  journal_put_journal_head
38928 2.2927  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191741/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
3740865.1412  vmlinux  __copy_from_user_ll
953   1.6595  vmlinux  blk_rq_map_sg
886   1.5429  vmlinux  sub_preempt_count
680   1.1841  vmlinux  journal_add_journal_head
598   1.0413  vmlinux  journal_commit_transaction
pattern9-0-cpu4-0-08191720/summary.out

Top 5 2.6.12.4.preempt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) 
with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
1235317.5582  vmlinux  __copy_user_zeroing_inatomic_nocache
64820 3.9660  vmlinux  journal_add_journal_head
60460 3.6992  vmlinux  do_get_write_access
47172 2.8862  vmlinux  journal_put_journal_head
46753 2.8606  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08190838/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1267626.7993  vmlinux  __copy_user_zeroing_inatomic_nocache
79803 4.2805  vmlinux  journal_add_journal_head
70271 3.7692  vmlinux  journal_dirty_metadata
66146 3.5480  vmlinux  __find_get_block
58082 3.1154  vmlinux  journal_put_journal_head
pattern9-0-cpu4-0-08190855/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
901   5.1788  vmlinux  blk_rq_map_sg
675   3.8798  vmlinux  journal_commit_transaction
637   3.6613  vmlinux  radix_tree_delete
605   3.4774  vmlinux  journal_add_journal_head
580   3.3337  vmlinux  release_pages
...
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
...
1 0.0057  vmlinux  __copy_from_user_ll_inatomic_nocache
pattern9-0-cpu4-0-08190859/summary.out

2.6.12.4-usercopy.c.patch.050819
diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile
--- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900
+++ linux-2.6.12.4.preempt/Makefile 2005-08-18 18:47:07.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.preempt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c 
linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c
--- linux

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-22 Thread Hiro Yoshioka
Hi,

It seems to me this mail does not go out.
So resending it.

 On 8/18/05, Hiro Yoshioka [EMAIL PROTECTED] wrote:
  1) using stack to save/restore MMX registers
 
 It seems to me that it has some regression.
 I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

The following is a current version of cache aware copy_from_user_ll.

1) using kernel_fpu_begin()/kernel_fpu_end()
2) low latency version of cache aware copy
3) __copy_user*_nocache APIs so if you want to use it.
(There is no change in the current APIs.)

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig1921587
2.6.12.4.preempt 1634411
163411/1921587=85.06% (15% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig  57427
2.6.12.4.preempt   17398

samples  %
3740865.1412  vmlinux  __copy_from_user_ll
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
51/37408=0.136% (99.86% reduction)

Top 5 2.6.12.4.orig
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
287643   14.9692  vmlinux  __copy_from_user_ll
72660 3.7813  vmlinux  journal_add_journal_head
65011 3.3832  vmlinux  do_get_write_access
50618 2.6342  vmlinux  journal_put_journal_head
48068 2.5015  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191743/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1347567.9364  vmlinux  __copy_from_user_ll
57735 3.4003  vmlinux  journal_add_journal_head
50653 2.9832  vmlinux  __find_get_block
44522 2.6221  vmlinux  journal_put_journal_head
38928 2.2927  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191741/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
3740865.1412  vmlinux  __copy_from_user_ll
953   1.6595  vmlinux  blk_rq_map_sg
886   1.5429  vmlinux  sub_preempt_count
680   1.1841  vmlinux  journal_add_journal_head
598   1.0413  vmlinux  journal_commit_transaction
pattern9-0-cpu4-0-08191720/summary.out

Top 5 2.6.12.4.preempt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
1235317.5582  vmlinux  __copy_user_zeroing_inatomic_nocache
64820 3.9660  vmlinux  journal_add_journal_head
60460 3.6992  vmlinux  do_get_write_access
47172 2.8862  vmlinux  journal_put_journal_head
46753 2.8606  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08190838/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1267626.7993  vmlinux  __copy_user_zeroing_inatomic_nocache
79803 4.2805  vmlinux  journal_add_journal_head
70271 3.7692  vmlinux  journal_dirty_metadata
66146 3.5480  vmlinux  __find_get_block
58082 3.1154  vmlinux  journal_put_journal_head
pattern9-0-cpu4-0-08190855/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
901   5.1788  vmlinux  blk_rq_map_sg
675   3.8798  vmlinux  journal_commit_transaction
637   3.6613  vmlinux  radix_tree_delete
605   3.4774  vmlinux  journal_add_journal_head
580   3.3337  vmlinux  release_pages
...
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
...
1 0.0057  vmlinux  __copy_from_user_ll_inatomic_nocache
pattern9-0-cpu4-0-08190859/summary.out

2.6.12.4-usercopy.c.patch.050819
diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile
--- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900
+++ linux-2.6.12.4.preempt/Makefile 2005-08-18 18:47:07.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.preempt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-22 Thread Hiro Yoshioka
Hi,

It seems to me this mail does not go out.
So resending it.

 On 8/18/05, Hiro Yoshioka [EMAIL PROTECTED] wrote:
  1) using stack to save/restore MMX registers
 
 It seems to me that it has some regression.
 I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

The following is a current version of cache aware copy_from_user_ll.

1) using kernel_fpu_begin()/kernel_fpu_end()
2) low latency version of cache aware copy
3) __copy_user*_nocache APIs so if you want to use it.
(There is no change in the current APIs.)

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig1921587
2.6.12.4.preempt 1634411
163411/1921587=85.06% (15% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig  57427
2.6.12.4.preempt   17398

samples  %
3740865.1412  vmlinux  __copy_from_user_ll
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
51/37408=0.136% (99.86% reduction)

Top 5 2.6.12.4.orig
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) 
with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
287643   14.9692  vmlinux  __copy_from_user_ll
72660 3.7813  vmlinux  journal_add_journal_head
65011 3.3832  vmlinux  do_get_write_access
50618 2.6342  vmlinux  journal_put_journal_head
48068 2.5015  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191743/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1347567.9364  vmlinux  __copy_from_user_ll
57735 3.4003  vmlinux  journal_add_journal_head
50653 2.9832  vmlinux  __find_get_block
44522 2.6221  vmlinux  journal_put_journal_head
38928 2.2927  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08191741/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
3740865.1412  vmlinux  __copy_from_user_ll
953   1.6595  vmlinux  blk_rq_map_sg
886   1.5429  vmlinux  sub_preempt_count
680   1.1841  vmlinux  journal_add_journal_head
598   1.0413  vmlinux  journal_commit_transaction
pattern9-0-cpu4-0-08191720/summary.out

Top 5 2.6.12.4.preempt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) 
with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
1235317.5582  vmlinux  __copy_user_zeroing_inatomic_nocache
64820 3.9660  vmlinux  journal_add_journal_head
60460 3.6992  vmlinux  do_get_write_access
47172 2.8862  vmlinux  journal_put_journal_head
46753 2.8606  vmlinux  journal_dirty_metadata
pattern9-0-cpu4-0-08190838/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
samples  %app name symbol name
1267626.7993  vmlinux  __copy_user_zeroing_inatomic_nocache
79803 4.2805  vmlinux  journal_add_journal_head
70271 3.7692  vmlinux  journal_dirty_metadata
66146 3.5480  vmlinux  __find_get_block
58082 3.1154  vmlinux  journal_put_journal_head
pattern9-0-cpu4-0-08190855/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %app name symbol name
901   5.1788  vmlinux  blk_rq_map_sg
675   3.8798  vmlinux  journal_commit_transaction
637   3.6613  vmlinux  radix_tree_delete
605   3.4774  vmlinux  journal_add_journal_head
580   3.3337  vmlinux  release_pages
...
510.2931  vmlinux  __copy_user_zeroing_inatomic_nocache
...
1 0.0057  vmlinux  __copy_from_user_ll_inatomic_nocache
pattern9-0-cpu4-0-08190859/summary.out

2.6.12.4-usercopy.c.patch.050819
diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile
--- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900
+++ linux-2.6.12.4.preempt/Makefile 2005-08-18 18:47:07.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.preempt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-18 Thread Hiro Yoshioka
Hi,

On 8/18/05, Hiro Yoshioka <[EMAIL PROTECTED]> wrote:
> 1) using stack to save/restore MMX registers

It seems to me that it has some regression.
I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

Regards,
  Hiro
-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-18 Thread Hiro Yoshioka
ot;
+   : : "r" (from), "r" (to) : "memory");
+   from+=64;
+   to+=64;
+   }
+
+   for(; i>0; i--)
+   {
+   __asm__ __volatile__ (
+"  movq (%0), %%mm0\n"
+"  movq 8(%0), %%mm1\n"
+"  movq 16(%0), %%mm2\n"
+"  movq 24(%0), %%mm3\n"
+"  movntq %%mm0, (%1)\n"
+"  movntq %%mm1, 8(%1)\n"
+"  movntq %%mm2, 16(%1)\n"
+"  movntq %%mm3, 24(%1)\n"
+"  movq 32(%0), %%mm0\n"
+"  movq 40(%0), %%mm1\n"
+"  movq 48(%0), %%mm2\n"
+"  movq 56(%0), %%mm3\n"
+"  movntq %%mm0, 32(%1)\n"
+"  movntq %%mm1, 40(%1)\n"
+"  movntq %%mm2, 48(%1)\n"
+"  movntq %%mm3, 56(%1)\n"
+   : : "r" (from), "r" (to) : "memory");
+   from+=64;
+   to+=64;
+   }
+   /*
+*  Now do the tail of the block
+*/
+   /*  kernel_fpu_end();*/
+   MMX_RESTORE;
+   if(i=(len&63))
+ __copy_user_zeroing(to, from, i);
+   return i;
+}
 
 unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned 
long n)
 {
@@ -582,6 +831,36 @@
return n;
 }
 
+unsigned long
+__copy_from_user_ll_nocache(void *to, const void __user *from, unsigned long n)
+{
+   BUG_ON((long)n < 0);
+if (n < 512) {
+  if (movsl_is_ok(to, from, n))
+__copy_user_zeroing(to, from, n);
+  else
+n = __copy_user_zeroing_intel(to, from, n);
+}
+else
+  n = __copy_user_zeroing_nocache(to, from, n);
+   return n;
+}
+
+unsigned long
+__copy_from_user_ll_inatomic_nocache(void *to, const void __user *from, 
unsigned long n)
+{
+   BUG_ON((long)n < 0);
+if (n < 512) {
+  if (movsl_is_ok(to, from, n))
+__copy_user_zeroing(to, from, n);
+  else
+n = __copy_user_zeroing_intel(to, from, n);
+}
+else
+  n = __copy_user_zeroing_inatomic_nocache(to, from, n);
+   return n;
+}
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
diff -ur linux-2.6.12.4.orig/include/asm-i386/uaccess.h 
linux-2.6.12.4.preempt/include/asm-i386/uaccess.h
--- linux-2.6.12.4.orig/include/asm-i386/uaccess.h  2005-08-05 
16:04:37.0 +0900
+++ linux-2.6.12.4.preempt/include/asm-i386/uaccess.h   2005-08-18 
19:16:55.0 +0900
@@ -413,6 +413,10 @@
const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_nocache(void *to,
+   const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_inatomic_nocache(void *to,
+   const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +506,55 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned 
long n)
+{
+   if (__builtin_constant_p(n)) {
+   unsigned long ret;
+
+   switch (n) {
+   case 1:
+   __get_user_size(*(u8 *)to, from, 1, ret, 1);
+   return ret;
+   case 2:
+   __get_user_size(*(u16 *)to, from, 2, ret, 2);
+   return ret;
+   case 4:
+   __get_user_size(*(u32 *)to, from, 4, ret, 4);
+   return ret;
+   }
+   }
+   return __copy_from_user_ll_inatomic_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
might_sleep();
return __copy_from_user_inatomic(to, from, n);
 }
+
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+   might_sleep();
+   if (__builtin_constant_p(n)) {
+   unsigned long ret;
+
+   switch (n) {
+   case 1:
+   __get_user_size(*(u8 *)to, from, 1, ret, 1);
+   return ret;
+   case 2:
+   __get_user_size(*(u16 *)to, from, 2, ret, 2);
+   return ret;
+   case 4:
+   __get_user_size(*(u32 *)to, from, 4, ret, 4);
+   return ret;
+  

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-18 Thread Hiro Yoshioka
On 16 Aug 2005 15:15:35 +0200, Andi Kleen <[EMAIL PROTECTED]> wrote:
> However it disables preemption, which especially for bigger
> copies will probably make the low latency people unhappy.

In the copy loop,
+#ifdef CONFIG_PREEMPT
+   if ( (i%64)==0 ) {
+   MMX_RESTORE;
+   MMX_SAVE;
+   };
+#endif

It costs several hundred clocks (wow) every 4KB copy.

It kills throughput but it makes the low latency people smile.

So I make two APIs. 
__copy_user_zeroing_nocache()
__copy_user_zeroing_inatomic_nocache()

The former is a low latency version and the other is a throughput version.

What do you think?

Regards,
  Hiro

-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-18 Thread Hiro Yoshioka
Chuck,

On 8/18/05, Chuck Ebbert <[EMAIL PROTECTED]> wrote:
> On Wed, 17 Aug 2005 at 13:50:22 +0900 (JST), Hiro Yoshioka wrote:
> 
> > 3) page faults/exceptions/...
> > 3-1  TS flag is set by the CPU (Am I right?)
> 
>   TS will _not_ be set if a trap/fault or interrupt occurs.  The only
> way that could happen automatically would be to use a separate hardware
> task with its own TSS to handle those.

OK.

>   And since the kernel does not have any state information of its own
> (no task_struct) any attempt to save the kernel-mode FPU state would
> overwrite the current user-mode state anyway.
> 
>   Interrupt and fault handlers will not use FP instructions anyway.
> The only thing you have to worry about is getting scheduled away
> while your code is running, and I guess that's why you have to worry
> about page faults.  And as Arjan pointed out, if you are doing
> __copy_from_user_inatomic you cannot sleep (==switch to another task.)
> 
>   So I would try the code from include/asm-i386/xor.h, modify it to
> save as many registers as you plan to use and see what happens.  It will
> do all the right things. See the xor_sse_2() for how to save and restore
> properly -- you will need to put your xmm_save area on the stack.

My hack is the following. I just change from using kernel_fpu_begin()
and kernel_fpu_end() to using a stack.

My test does not find any regressions.

--- usercopy.c.orig 2005-08-05 16:04:37.0 +0900
+++ usercopy.c  2005-08-18 16:53:37.0 +0900
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 

@@ -511,6 +512,144 @@
: "memory");\
 } while (0)

+#define MMX_SAVE do {   \
+preempt_disable();  \
+__asm__ __volatile__ (  \
+"movl %%cr0,%0  ;\n\t"  \
+"clts   ;\n\t"  \
+"movq %%mm0,(%1) ;\n\t" \
+"movq %%mm1,8(%1) ;\n\t" \
+"movq %%mm2,16(%1) ;\n\t" \
+"movq %%mm3,24(%1) ;\n\t" \
+: "=" (cr0)   \
+: "r" (mmx_save)\
+: "memory");\
+} while(0)
+
+#define MMX_RESTORE do {   \
+__asm__ __volatile__ (  \
+"sfence ;\n\t"  \
+"movq (%1),%%mm0 ;\n\t"  \
+"movq 8(%1),%%mm1 ;\n\t"  \
+"movq 16(%1),%%mm2 ;\n\t"  \
+"movq 24(%1),%%mm3 ;\n\t"  \
+"movl   %0,%%cr0;\n\t"  \
+:   \
+: "r" (cr0), "r" (mmx_save) \
+: "memory");\
+preempt_enable();   \
+} while(0)
+
+#define ALIGN8 __attribute__((aligned(8)))
+
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware   */
+/* [EMAIL PROTECTED]   */
+static unsigned long
+__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
+{
+/* Note! gcc doesn't seem to align stack variables properly, so we
+ * need to make use of unaligned loads and stores.
+ */
+   void *p;
+   int i;
+char mmx_save[8*4] ALIGN8;
+int cr0;
+
+   if (unlikely(in_interrupt())){
+   __copy_user_zeroing(to, from, len);
+   return len;
+   }
+
+   p = to;
+   i = len >> 6; /* len/64 */
+
+   /*kernel_fpu_begin();*/
+   MMX_SAVE;
+
+   __asm__ __volatile__ (
+   "1: prefetchnta (%0)\n" /* This set is 28 bytes */
+   "   prefetchnta 64(%0)\n"
+   "   prefetchnta 128(%0)\n"
+   "   prefetchnta 192(%0)\n"
+   "   prefetchnta 256(%0)\n"
+   "2:  \n"
+   ".section .fixup, \"ax\"\n"
+   "3: movw $0x1AEB, 1b\n" /* jmp on 26 bytes */
+   "   jmp 2b\n"
+   ".previous\n"
+   ".section __ex_table,\"a\"\n"
+   "   .align 4\n"
+   "   .long 1b, 3b\n"
+   ".previous"
+   : : "r" (from) );
+
+   for(; i>5; i--)
+   {
+   __asm__ __volatile__ (
+   "1:  prefetchnta 320(%0)\n"
+"2:  movq (%0), %%mm0\n"
+"  movq 8(%0), %%mm

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-18 Thread Hiro Yoshioka
Chuck,

On 8/18/05, Chuck Ebbert [EMAIL PROTECTED] wrote:
 On Wed, 17 Aug 2005 at 13:50:22 +0900 (JST), Hiro Yoshioka wrote:
 
  3) page faults/exceptions/...
  3-1  TS flag is set by the CPU (Am I right?)
 
   TS will _not_ be set if a trap/fault or interrupt occurs.  The only
 way that could happen automatically would be to use a separate hardware
 task with its own TSS to handle those.

OK.

   And since the kernel does not have any state information of its own
 (no task_struct) any attempt to save the kernel-mode FPU state would
 overwrite the current user-mode state anyway.
 
   Interrupt and fault handlers will not use FP instructions anyway.
 The only thing you have to worry about is getting scheduled away
 while your code is running, and I guess that's why you have to worry
 about page faults.  And as Arjan pointed out, if you are doing
 __copy_from_user_inatomic you cannot sleep (==switch to another task.)
 
   So I would try the code from include/asm-i386/xor.h, modify it to
 save as many registers as you plan to use and see what happens.  It will
 do all the right things. See the xor_sse_2() for how to save and restore
 properly -- you will need to put your xmm_save area on the stack.

My hack is the following. I just change from using kernel_fpu_begin()
and kernel_fpu_end() to using a stack.

My test does not find any regressions.

--- usercopy.c.orig 2005-08-05 16:04:37.0 +0900
+++ usercopy.c  2005-08-18 16:53:37.0 +0900
@@ -10,6 +10,7 @@
 #include linux/highmem.h
 #include linux/blkdev.h
 #include linux/module.h
+#include asm/i387.h
 #include asm/uaccess.h
 #include asm/mmx.h

@@ -511,6 +512,144 @@
: memory);\
 } while (0)

+#define MMX_SAVE do {   \
+preempt_disable();  \
+__asm__ __volatile__ (  \
+movl %%cr0,%0  ;\n\t  \
+clts   ;\n\t  \
+movq %%mm0,(%1) ;\n\t \
+movq %%mm1,8(%1) ;\n\t \
+movq %%mm2,16(%1) ;\n\t \
+movq %%mm3,24(%1) ;\n\t \
+: =r (cr0)   \
+: r (mmx_save)\
+: memory);\
+} while(0)
+
+#define MMX_RESTORE do {   \
+__asm__ __volatile__ (  \
+sfence ;\n\t  \
+movq (%1),%%mm0 ;\n\t  \
+movq 8(%1),%%mm1 ;\n\t  \
+movq 16(%1),%%mm2 ;\n\t  \
+movq 24(%1),%%mm3 ;\n\t  \
+movl   %0,%%cr0;\n\t  \
+:   \
+: r (cr0), r (mmx_save) \
+: memory);\
+preempt_enable();   \
+} while(0)
+
+#define ALIGN8 __attribute__((aligned(8)))
+
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware   */
+/* [EMAIL PROTECTED]   */
+static unsigned long
+__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
+{
+/* Note! gcc doesn't seem to align stack variables properly, so we
+ * need to make use of unaligned loads and stores.
+ */
+   void *p;
+   int i;
+char mmx_save[8*4] ALIGN8;
+int cr0;
+
+   if (unlikely(in_interrupt())){
+   __copy_user_zeroing(to, from, len);
+   return len;
+   }
+
+   p = to;
+   i = len  6; /* len/64 */
+
+   /*kernel_fpu_begin();*/
+   MMX_SAVE;
+
+   __asm__ __volatile__ (
+   1: prefetchnta (%0)\n /* This set is 28 bytes */
+  prefetchnta 64(%0)\n
+  prefetchnta 128(%0)\n
+  prefetchnta 192(%0)\n
+  prefetchnta 256(%0)\n
+   2:  \n
+   .section .fixup, \ax\\n
+   3: movw $0x1AEB, 1b\n /* jmp on 26 bytes */
+  jmp 2b\n
+   .previous\n
+   .section __ex_table,\a\\n
+  .align 4\n
+  .long 1b, 3b\n
+   .previous
+   : : r (from) );
+
+   for(; i5; i--)
+   {
+   __asm__ __volatile__ (
+   1:  prefetchnta 320(%0)\n
+2:  movq (%0), %%mm0\n
+  movq 8(%0), %%mm1\n
+  movq 16(%0), %%mm2\n
+  movq 24(%0), %%mm3\n
+  movntq %%mm0, (%1)\n
+  movntq %%mm1, 8(%1)\n
+  movntq %%mm2, 16(%1)\n
+  movntq %%mm3, 24(%1)\n
+  movq 32(%0), %%mm0\n
+  movq 40(%0), %%mm1\n
+  movq 48(%0), %%mm2\n
+  movq 56(%0), %%mm3\n
+  movntq %%mm0, 32(%1)\n
+  movntq %%mm1, 40(%1)\n

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-18 Thread Hiro Yoshioka
On 16 Aug 2005 15:15:35 +0200, Andi Kleen [EMAIL PROTECTED] wrote:
 However it disables preemption, which especially for bigger
 copies will probably make the low latency people unhappy.

In the copy loop,
+#ifdef CONFIG_PREEMPT
+   if ( (i%64)==0 ) {
+   MMX_RESTORE;
+   MMX_SAVE;
+   };
+#endif

It costs several hundred clocks (wow) every 4KB copy.

It kills throughput but it makes the low latency people smile.

So I make two APIs. 
__copy_user_zeroing_nocache()
__copy_user_zeroing_inatomic_nocache()

The former is a low latency version and the other is a throughput version.

What do you think?

Regards,
  Hiro

-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-18 Thread Hiro Yoshioka
);
+  else
+n = __copy_user_zeroing_intel(to, from, n);
+}
+else
+  n = __copy_user_zeroing_nocache(to, from, n);
+   return n;
+}
+
+unsigned long
+__copy_from_user_ll_inatomic_nocache(void *to, const void __user *from, 
unsigned long n)
+{
+   BUG_ON((long)n  0);
+if (n  512) {
+  if (movsl_is_ok(to, from, n))
+__copy_user_zeroing(to, from, n);
+  else
+n = __copy_user_zeroing_intel(to, from, n);
+}
+else
+  n = __copy_user_zeroing_inatomic_nocache(to, from, n);
+   return n;
+}
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
diff -ur linux-2.6.12.4.orig/include/asm-i386/uaccess.h 
linux-2.6.12.4.preempt/include/asm-i386/uaccess.h
--- linux-2.6.12.4.orig/include/asm-i386/uaccess.h  2005-08-05 
16:04:37.0 +0900
+++ linux-2.6.12.4.preempt/include/asm-i386/uaccess.h   2005-08-18 
19:16:55.0 +0900
@@ -413,6 +413,10 @@
const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_nocache(void *to,
+   const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_inatomic_nocache(void *to,
+   const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +506,55 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned 
long n)
+{
+   if (__builtin_constant_p(n)) {
+   unsigned long ret;
+
+   switch (n) {
+   case 1:
+   __get_user_size(*(u8 *)to, from, 1, ret, 1);
+   return ret;
+   case 2:
+   __get_user_size(*(u16 *)to, from, 2, ret, 2);
+   return ret;
+   case 4:
+   __get_user_size(*(u32 *)to, from, 4, ret, 4);
+   return ret;
+   }
+   }
+   return __copy_from_user_ll_inatomic_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
might_sleep();
return __copy_from_user_inatomic(to, from, n);
 }
+
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+   might_sleep();
+   if (__builtin_constant_p(n)) {
+   unsigned long ret;
+
+   switch (n) {
+   case 1:
+   __get_user_size(*(u8 *)to, from, 1, ret, 1);
+   return ret;
+   case 2:
+   __get_user_size(*(u16 *)to, from, 2, ret, 2);
+   return ret;
+   case 4:
+   __get_user_size(*(u32 *)to, from, 4, ret, 4);
+   return ret;
+   }
+   }
+   return __copy_from_user_ll_nocache(to, from, n);
+}
+
 unsigned long __must_check copy_to_user(void __user *to,
const void *from, unsigned long n);
 unsigned long __must_check copy_from_user(void *to,
diff -ur linux-2.6.12.4.orig/mm/filemap.c linux-2.6.12.4.preempt/mm/filemap.c
--- linux-2.6.12.4.orig/mm/filemap.c2005-08-05 16:04:37.0 +0900
+++ linux-2.6.12.4.preempt/mm/filemap.c 2005-08-16 10:16:06.0 +0900
@@ -1727,13 +1727,13 @@
int left;
 
kaddr = kmap_atomic(page, KM_USER0);
-   left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+   left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
kunmap_atomic(kaddr, KM_USER0);
 
if (left != 0) {
/* Do it the slow way */
kaddr = kmap(page);
-   left = __copy_from_user(kaddr + offset, buf, bytes);
+   left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
kunmap(page);
}
return bytes - left;
@@ -1750,7 +1750,7 @@
int copy = min(bytes, iov-iov_len - base);
 
base = 0;
-   left = __copy_from_user_inatomic(vaddr, buf, copy);
+   left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
copied += copy;
bytes -= copy;
vaddr += copy;


Regards,
  Hiro
--
Hiro Yoshioka
CTO/Miracle Linux Corporation
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-18 Thread Hiro Yoshioka
Hi,

On 8/18/05, Hiro Yoshioka [EMAIL PROTECTED] wrote:
 1) using stack to save/restore MMX registers

It seems to me that it has some regression.
I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

Regards,
  Hiro
-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: math_state_restore() question

2005-08-17 Thread Hiro Yoshioka
> Just take a look at __switch_to(), where __unlazy_fpu() is called.

Thanks. Does an exception handler (like page_fault, etc) come 
from __switch_to()?

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


math_state_restore() question

2005-08-17 Thread Hiro Yoshioka
Hi,

I have a quick question.

The math_state_restore() restores the FPU/MMX/XMM states.
However where do we save the previous task's states if it is necessary?

asmlinkage void math_state_restore(struct pt_regs regs)
{
struct thread_info *thread = current_thread_info();
struct task_struct *tsk = thread->task;

clts(); /* Allow maths ops (or we recurse) */
if (!tsk_used_math(tsk))
init_fpu(tsk);
restore_fpu(tsk);
thread->status |= TS_USEDFPU;   /* So we fnsave on switch_to() */
}

Thanks in advance,
  Hiro
-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


math_state_restore() question

2005-08-17 Thread Hiro Yoshioka
Hi,

I have a quick question.

The math_state_restore() restores the FPU/MMX/XMM states.
However where do we save the previous task's states if it is necessary?

asmlinkage void math_state_restore(struct pt_regs regs)
{
struct thread_info *thread = current_thread_info();
struct task_struct *tsk = thread-task;

clts(); /* Allow maths ops (or we recurse) */
if (!tsk_used_math(tsk))
init_fpu(tsk);
restore_fpu(tsk);
thread-status |= TS_USEDFPU;   /* So we fnsave on switch_to() */
}

Thanks in advance,
  Hiro
-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: math_state_restore() question

2005-08-17 Thread Hiro Yoshioka
 Just take a look at __switch_to(), where __unlazy_fpu() is called.

Thanks. Does an exception handler (like page_fault, etc) come 
from __switch_to()?

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-16 Thread Hiro Yoshioka
Akira,

Thanks for your suggestions.

On 8/17/05, Akira Tsukamoto <[EMAIL PROTECTED]> wrote:
> Anyway, going back to copy_user topic,
> big remaining issues are
>   1)store/restore floating point register (80/64bytes) twice every time by
>  surrounding with kernel_fpu_begin()/kernel_fpu_end() is big penalty

I don't know. If nobody uses MMX/XMM, then there is no need
to save and restore.

>   2)after pagefault not always come back to copy function and corrupts fp 
> register

I'm trying to understand this mechanism but I don't
understand very well.

>   3)disabling long preemption
> Please correct me if I am wrong.
> 
> I tried to implement fpsave inside pagefault handler once and here is my junk;
> http://www.suna-asobi.com/~akira-t/linux/k7-copy-user/K7-copy_47_with_fpusave_not_finished.patch
> never had a time to finish it. Hiro, does it help you?

Thanks. I'm reading your patch but could not understand very well.

I'll ask you.

Regards,
  Hiro
-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-16 Thread Hiro Yoshioka
From: Hiro Yoshioka <[EMAIL PROTECTED]>
Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Date: Wed, 17 Aug 2005 08:21:53 +0900 (JST)
Message-ID: <[EMAIL PROTECTED]>

> Chuck,
> 
> From: Chuck Ebbert <[EMAIL PROTECTED]>
> > On Tue, 16 Aug 2005 at 19:16:17 +0900 (JST), Hiro Yoshioka wrote:
> > > oh, really? Does the linux kernel take care of
> > > SSE save/restore on a task switch?
> > 
> >  Check out XMMS_SAVE and XMMS_RESTORE in include/asm-i386/xor.h
> 
> Thanks for your suggestion. But it seems to me it won't help
> when we have a page fault or other exeptions.

Hi,

Let me understand what the kernel does save/resfore FPU/MMX/XMM
registers. Please let me know if I'm wrong.

1) kernel_fpu_begin()
 preempt_disable()
 if TS_USEDFPU then
   __save_init_fpu()
... save to tsk->thread.i387.f*save
clear TS_USEDFPU flag of tsk->thread_info->status
 else
clts() --- clear TS flag of CR0

2) copy 
 MMX/XMM registers are used.

3) page faults/exceptions/...
3-1  TS flag is set by the CPU (Am I right?)
 if nobody uses MMX/XMM
3-2 it's fine. we don't need save/restore
 else
3-3 MMX/XMM is used

  When TS flag is set, the CPU monitors the instruction stream
of X87 FPU/MMX/SSE/SSE2 instructions. When the CPU detects one of
these instruction, it raises a device-not-available exception (#NM)
prior to executing the instruction. (IA32 Software Developer's Manual,
Vol. 3, 12.5.1)

  math_state_restore() is the device-not-available exception
 clts()
 if (!tsk_used_math(tsk))
init_fpu(tsk);
 restore_fpu(tsk);
 set TS_USEDFPU;

4) kernel_fpu_end()
 stts(); set TS flag of CR0
 preempt_enable();

It seems to me that the kernel automatically save/restore FPU/MMX/XMM
registers.

What's wrong with it? Do I misunderstand it?

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-16 Thread Hiro Yoshioka
Chuck,

From: Chuck Ebbert <[EMAIL PROTECTED]>
> On Tue, 16 Aug 2005 at 19:16:17 +0900 (JST), Hiro Yoshioka wrote:
> > oh, really? Does the linux kernel take care of
> > SSE save/restore on a task switch?
> 
>  Check out XMMS_SAVE and XMMS_RESTORE in include/asm-i386/xor.h

Thanks for your suggestion. But it seems to me it won't help
when we have a page fault or other exeptions.

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-16 Thread Hiro Yoshioka
From: Arjan van de Ven <[EMAIL PROTECTED]>
> > My code does nothing do it.
> > 
> > I need a volunteer to implement it.
> 
> it's actually not too hard; all you need is to use SSE and not MMX; and
> then just store sse register you're overwriting on the stack or so...

oh, really? Does the linux kernel take care of
SSE save/restore on a task switch?

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-16 Thread Hiro Yoshioka
From: Arjan van de Ven [EMAIL PROTECTED]
  My code does nothing do it.
  
  I need a volunteer to implement it.
 
 it's actually not too hard; all you need is to use SSE and not MMX; and
 then just store sse register you're overwriting on the stack or so...

oh, really? Does the linux kernel take care of
SSE save/restore on a task switch?

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-16 Thread Hiro Yoshioka
Chuck,

From: Chuck Ebbert [EMAIL PROTECTED]
 On Tue, 16 Aug 2005 at 19:16:17 +0900 (JST), Hiro Yoshioka wrote:
  oh, really? Does the linux kernel take care of
  SSE save/restore on a task switch?
 
  Check out XMMS_SAVE and XMMS_RESTORE in include/asm-i386/xor.h

Thanks for your suggestion. But it seems to me it won't help
when we have a page fault or other exeptions.

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-16 Thread Hiro Yoshioka
From: Hiro Yoshioka [EMAIL PROTECTED]
Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Date: Wed, 17 Aug 2005 08:21:53 +0900 (JST)
Message-ID: [EMAIL PROTECTED]

 Chuck,
 
 From: Chuck Ebbert [EMAIL PROTECTED]
  On Tue, 16 Aug 2005 at 19:16:17 +0900 (JST), Hiro Yoshioka wrote:
   oh, really? Does the linux kernel take care of
   SSE save/restore on a task switch?
  
   Check out XMMS_SAVE and XMMS_RESTORE in include/asm-i386/xor.h
 
 Thanks for your suggestion. But it seems to me it won't help
 when we have a page fault or other exeptions.

Hi,

Let me understand what the kernel does save/resfore FPU/MMX/XMM
registers. Please let me know if I'm wrong.

1) kernel_fpu_begin()
 preempt_disable()
 if TS_USEDFPU then
   __save_init_fpu()
... save to tsk-thread.i387.f*save
clear TS_USEDFPU flag of tsk-thread_info-status
 else
clts() --- clear TS flag of CR0

2) copy 
 MMX/XMM registers are used.

3) page faults/exceptions/...
3-1  TS flag is set by the CPU (Am I right?)
 if nobody uses MMX/XMM
3-2 it's fine. we don't need save/restore
 else
3-3 MMX/XMM is used

  When TS flag is set, the CPU monitors the instruction stream
of X87 FPU/MMX/SSE/SSE2 instructions. When the CPU detects one of
these instruction, it raises a device-not-available exception (#NM)
prior to executing the instruction. (IA32 Software Developer's Manual,
Vol. 3, 12.5.1)

  math_state_restore() is the device-not-available exception
 clts()
 if (!tsk_used_math(tsk))
init_fpu(tsk);
 restore_fpu(tsk);
 set TS_USEDFPU;

4) kernel_fpu_end()
 stts(); set TS flag of CR0
 preempt_enable();

It seems to me that the kernel automatically save/restore FPU/MMX/XMM
registers.

What's wrong with it? Do I misunderstand it?

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-16 Thread Hiro Yoshioka
Akira,

Thanks for your suggestions.

On 8/17/05, Akira Tsukamoto [EMAIL PROTECTED] wrote:
 Anyway, going back to copy_user topic,
 big remaining issues are
   1)store/restore floating point register (80/64bytes) twice every time by
  surrounding with kernel_fpu_begin()/kernel_fpu_end() is big penalty

I don't know. If nobody uses MMX/XMM, then there is no need
to save and restore.

   2)after pagefault not always come back to copy function and corrupts fp 
 register

I'm trying to understand this mechanism but I don't
understand very well.

   3)disabling long preemption
 Please correct me if I am wrong.
 
 I tried to implement fpsave inside pagefault handler once and here is my junk;
 http://www.suna-asobi.com/~akira-t/linux/k7-copy-user/K7-copy_47_with_fpusave_not_finished.patch
 never had a time to finish it. Hiro, does it help you?

Thanks. I'm reading your patch but could not understand very well.

I'll ask you.

Regards,
  Hiro
-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread Hiro Yoshioka
Takahashi san,

I appreciate your comments.

> Hi,
> 
> BTW, what are you going to do with the page-faults which may happen
> during __copy_user_zeroing_nocache()? The current process may be blocked
> in the handler for a while and get FPU registers polluted.
> kernel_fpu_begin() won't help the case. This is another issue, though.

My code does nothing do it.

I need a volunteer to implement it.

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread Hiro Yoshioka
From: Hiro Yoshioka <[EMAIL PROTECTED]>
Date: Tue, 16 Aug 2005 08:33:59 +0900

> Thanks.
> 
> filemap_copy_from_user() calls __copy_from_user_inatomic() calls
> __copy_from_user_ll().
> 
> I'll look at the code.

The following is a quick hack of cache aware implementation
of __copy_from_user_ll() and __copy_from_user_inatomic()

__copy_from_user_ll_nocache() and __copy_from_user_inatomic_nocache()

filemap_copy_from_user() calles __copy_from_user_inatomic_nocache()
instead of __copy_from_user_inatomic() and reduced cashe miss.

The first column is the cache reference (memory access) and the
third column is the 3rd level cache miss.

The following example shows the L3 cache miss is reduced from 37410 to 107.

2.6.12.4 nocache version
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %samples  % app name   symbol name
1204426.4106  1070.5620  vmlinux__copy_user_zeroing_nocache
80049 4.2606  5783.0357  vmlinuxjournal_add_journal_head
69194 3.6829  1540.8088  vmlinuxjournal_dirty_metadata
67059 3.5692  78 0.4097  vmlinux__find_get_block
64145 3.4141  32 0.1681  vmlinuxjournal_put_journal_head
pattern9-0-cpu4-0-08161154/summary.out

The 2.6.12.4 original version is
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %samples  % app name   symbol name
1206467.4680  37410 62.3355  vmlinux__copy_from_user_ll
79508 4.9215  9031.5046  vmlinux_spin_lock
65526 4.0561  8731.4547  vmlinuxjournal_add_journal_head
59296 3.6704  1290.2149  vmlinux__find_get_block
58647 3.6302  2150.3582  vmlinuxjournal_dirty_metadata

What do you think?

Hiro

diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nocache/Makefile
--- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900
+++ linux-2.6.12.4.nocache/Makefile 2005-08-16 10:22:31.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.nocache
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c 
linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c
--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05 
16:04:37.0 +0900
+++ linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c 2005-08-16 
10:49:59.0 +0900
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -511,6 +512,110 @@
: "memory");\
 } while (0)
 
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware   */
+/* [EMAIL PROTECTED]   */
+static unsigned long 
+__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
+{
+/* Note! gcc doesn't seem to align stack variables properly, so we
+ * need to make use of unaligned loads and stores.
+ */
+   void *p;
+   int i;
+
+   if (unlikely(in_interrupt())){
+   __copy_user_zeroing(to, from, len);
+   return len;
+   }
+
+   p = to;
+   i = len >> 6; /* len/64 */
+
+kernel_fpu_begin();
+
+   __asm__ __volatile__ (
+   "1: prefetchnta (%0)\n" /* This set is 28 bytes */
+   "   prefetchnta 64(%0)\n"
+   "   prefetchnta 128(%0)\n"
+   "   prefetchnta 192(%0)\n"
+   "   prefetchnta 256(%0)\n"
+   "2:  \n"
+   ".section .fixup, \"ax\"\n"
+   "3: movw $0x1AEB, 1b\n" /* jmp on 26 bytes */
+   "   jmp 2b\n"
+   ".previous\n"
+   ".section __ex_table,\"a\"\n"
+   "   .align 4\n"
+   "   .long 1b, 3b\n"
+   ".previous"
+   : : "r" (from) );
+   
+   for(; i>5; i--)
+   {
+   __asm__ __volatile__ (
+   "1:  prefetchnta 320(%0)\n"
+   "2:  movq (%0), %%mm0\n"
+   "  movq 8(%0), %%mm1\n"
+   "  movq 16(%0), %%mm2\n"
+   "  movq 24(%0), %%mm3\n"
+   "  movntq %%mm0, (%1)\n"
+   "  movntq %%mm1, 8(%1)\n"

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread Hiro Yoshioka
On 8/15/05, Arjan van de Ven <[EMAIL PROTECTED]> wrote:
> > copy_from_user_nocache() is fine.
> >
> > But I don't know where I can use it. (I'm not so
> >  familiar with the linux kernel file system yet.)
> 
> I suspect the few cases where it will make the most difference will be
> in the VFS for the write() system call, and the AIO variants thereof.
> 
> generic_file_buffered_write() will be a good candidate to try first...

Thanks.

filemap_copy_from_user() calls __copy_from_user_inatomic() calls
__copy_from_user_ll().

I'll look at the code.

Hiro
--
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread Hiro Yoshioka
Hi,

I appreciate your suggestion.

On 8/15/05, Arjan van de Ven <[EMAIL PROTECTED]> wrote:
> 
> > Anyway we could not find the cache aware version of __copy_from_user_ll
> > has a big regression yet.
> 
> 
> that is because you spread the cache misses out from one place to all
> over the place, so that no one single point sticks out anymore.
> 
> Do you agree that your copy is less optimal for the case where the
> kernel will (almost) immediately use the data?

Yes, I do.

My server has 8KB of L1 cache. (512KB of L2/2MB of L3)

If you move more than 4KB of data using by __copy_from_user_ll(), the
data will be spilled over L1 cache but in L2 (or L3)
When you move huge data (> 1MB), even L3 cache will not help you.
(This is known as a cache pollution.)

> I agree that your copy is really nice for places where the kernel will
> NOT use the data in the cpu, say for big write() system calls.
> 
> My suggestion is to realize there are basically 2 different use cases,
> and that in the code the first one is very common, while in your
> profiles the second one is very common. Based on that I suggest to make
> a special copy_from_user_nocache() API for the cases where the kernel
> will not use the data (and ignore software raid5 here) and use your
> excellent version for that API, while leaving the code for the cases
> where the kernel WILL use the data alone. Code wise the "will use" case
> is the vast majority, so only changing the few places that know they
> don't use the data will be very efficient, and will give immediate big
> improvement in your profile data, since those few places tend to get
> used a lot in the cases you benchmark.

copy_from_user_nocache() is fine.

But I don't know where I can use it. (I'm not so
 familiar with the linux kernel file system yet.) 

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread Hiro Yoshioka
Hi,

From: Arjan van de Ven <[EMAIL PROTECTED]>
Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Date: Sun, 14 Aug 2005 12:35:43 +0200
Message-ID: <[EMAIL PROTECTED]>

> On Sun, 2005-08-14 at 19:22 +0900, Hiro Yoshioka wrote:
> > Thanks for your comments.
> > 
> > On 8/14/05, Arjan van de Ven <[EMAIL PROTECTED]> wrote:
> > > On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote:
> > > > Hi,
> > > >
> > > > The following is a patch to reduce a cache pollution
> > > > of __copy_from_user_ll().
> > > >
> > > > When I run simple iozone benchmark to find a performance bottleneck of
> > > > the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
> > > > most and it did many cache misses.
> > > 
> > > 
> > > however... you copy something from userspace... aren't you going to USE
> > > it? The non-termoral versions actually throw the data out of the
> > > cache... so while this part might be nice, you pay BIG elsewhere
> > 
> > The oprofile data does not give an evidence that we pay BIG elsewhere.
> 
> 
> the problem is that the pay elsewhere is far more spread out, but not
> less. At least generally
> 
> I can see the point of a copy_from_user_nocache() or something, for
> those cases where we *know* we are not going to use the copied data in
> the cpu (but say, only do DMA).
> But that should be explicit, not implicit, since the general case will
> be that the kernel WILL use the data. And if that's the case your change
> is a loss (just harder to see because the cost is spread out)

I understand the iozone is not good benchmark nor reprsents any useful
application so I did a kernel build as a simple benchmark.

What I did is
cd /test/f1
tar xjf ${baseDir}/src/linux-2.6.12.4.tar.bz2
cd linux-2.6.12.4
cp -p ${baseDir}/src/config .config
make oldconfig
time make -j $CPUS

The following is Top 5 of CPU cycle
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10

samples  %app name symbol name
7347544  72.8296  cc1  (no symbols)
5323075.2763  libbz2.so.1.0.2  (no symbols)
2418532.3973  vmlinux  buffered_rmqueue
1285521.2742  libc-2.3.4.so_int_malloc
1077841.0684  vmlinux  page_fault
...
10749 0.1065  vmlinux  __copy_from_user_ll
pattern12-0-cpu4-0-08150920/summary.out

Since __copy_from_user_ll is not hot spot, so we didn't see any big
performance difference. (the number is time (sec) of 5 runs)

original 2.6.12.4   realusersystem
No profiling532.27  1797.02 194.9
BSQ 0x200+0x3f  620.15  2094.21 212.38
GLOBAL_POWER_EVENTS:10: 586.01  1984.92 215.97

cache aware 2.6.12.4realusersystem
No profiling526.65  1792.22 190.05
BSQ 0x200+0x3f  615.51  2090.74 206.58
GLOBAL_POWER_EVENTS:10: 587.69  1978.66 209.18

Now Top 5 of Memory Access (2.6.12.4)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %samples  %app name symbol name
11439689 82.2135  3390627.9328  cc1  (no symbols)
2771771.9920  347   0.2859  libc-2.3.4.so_int_malloc
2295931.6500  1294610.6653  libbz2.so.1.0.2  (no symbols)
84348 0.6062  116   0.0956  libc-2.3.4.so_int_free
83653 0.6012  438   0.3608  libc-2.3.4.socalloc
...
8527  0.0613  1648  1.3577  vmlinux  __copy_from_user_ll

Top 5 of Cache miss
33906   27.9328 cc1 (no symbols)
30849   25.4144 vmlinux buffered_rmqueue
12946   10.6653 libbz2.so.1.0.2 (no symbols)
91787.5611  vmlinux __copy_to_user_ll
29342.4171  oprofiled   (no symbols)
...
16481.3577  vmlinux __copy_from_user_ll
pattern12-0-cpu4-0-08150917

Cache aware 2.6.12.4, Top 5 of Memory Access
samples  %samples  %app name symbol name
11448487 82.8100  3278628.1051  cc1  (no symbols)
2768122.0023  256   0.2195  libc-2.3.4.so_int_malloc
2301771.6649  1237110.6048  libbz2.so.1.0.2  (no symbols)
84485 0.6111  120   0.1029  libc-2.3.4.so_int_free
84043 0.6079  473   0.4055  libc-2.3.4.socalloc
...
18282 0.1322  9060  7.7665  vmlinux   

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread Hiro Yoshioka
Hi,

From: Arjan van de Ven [EMAIL PROTECTED]
Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Date: Sun, 14 Aug 2005 12:35:43 +0200
Message-ID: [EMAIL PROTECTED]

 On Sun, 2005-08-14 at 19:22 +0900, Hiro Yoshioka wrote:
  Thanks for your comments.
  
  On 8/14/05, Arjan van de Ven [EMAIL PROTECTED] wrote:
   On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote:
Hi,
   
The following is a patch to reduce a cache pollution
of __copy_from_user_ll().
   
When I run simple iozone benchmark to find a performance bottleneck of
the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
most and it did many cache misses.
   
   
   however... you copy something from userspace... aren't you going to USE
   it? The non-termoral versions actually throw the data out of the
   cache... so while this part might be nice, you pay BIG elsewhere
  
  The oprofile data does not give an evidence that we pay BIG elsewhere.
 
 
 the problem is that the pay elsewhere is far more spread out, but not
 less. At least generally
 
 I can see the point of a copy_from_user_nocache() or something, for
 those cases where we *know* we are not going to use the copied data in
 the cpu (but say, only do DMA).
 But that should be explicit, not implicit, since the general case will
 be that the kernel WILL use the data. And if that's the case your change
 is a loss (just harder to see because the cost is spread out)

I understand the iozone is not good benchmark nor reprsents any useful
application so I did a kernel build as a simple benchmark.

What I did is
cd /test/f1
tar xjf ${baseDir}/src/linux-2.6.12.4.tar.bz2
cd linux-2.6.12.4
cp -p ${baseDir}/src/config .config
make oldconfig
time make -j $CPUS

The following is Top 5 of CPU cycle
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10

samples  %app name symbol name
7347544  72.8296  cc1  (no symbols)
5323075.2763  libbz2.so.1.0.2  (no symbols)
2418532.3973  vmlinux  buffered_rmqueue
1285521.2742  libc-2.3.4.so_int_malloc
1077841.0684  vmlinux  page_fault
...
10749 0.1065  vmlinux  __copy_from_user_ll
pattern12-0-cpu4-0-08150920/summary.out

Since __copy_from_user_ll is not hot spot, so we didn't see any big
performance difference. (the number is time (sec) of 5 runs)

original 2.6.12.4   realusersystem
No profiling532.27  1797.02 194.9
BSQ 0x200+0x3f  620.15  2094.21 212.38
GLOBAL_POWER_EVENTS:10: 586.01  1984.92 215.97

cache aware 2.6.12.4realusersystem
No profiling526.65  1792.22 190.05
BSQ 0x200+0x3f  615.51  2090.74 206.58
GLOBAL_POWER_EVENTS:10: 587.69  1978.66 209.18

Now Top 5 of Memory Access (2.6.12.4)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %samples  %app name symbol name
11439689 82.2135  3390627.9328  cc1  (no symbols)
2771771.9920  347   0.2859  libc-2.3.4.so_int_malloc
2295931.6500  1294610.6653  libbz2.so.1.0.2  (no symbols)
84348 0.6062  116   0.0956  libc-2.3.4.so_int_free
83653 0.6012  438   0.3608  libc-2.3.4.socalloc
...
8527  0.0613  1648  1.3577  vmlinux  __copy_from_user_ll

Top 5 of Cache miss
33906   27.9328 cc1 (no symbols)
30849   25.4144 vmlinux buffered_rmqueue
12946   10.6653 libbz2.so.1.0.2 (no symbols)
91787.5611  vmlinux __copy_to_user_ll
29342.4171  oprofiled   (no symbols)
...
16481.3577  vmlinux __copy_from_user_ll
pattern12-0-cpu4-0-08150917

Cache aware 2.6.12.4, Top 5 of Memory Access
samples  %samples  %app name symbol name
11448487 82.8100  3278628.1051  cc1  (no symbols)
2768122.0023  256   0.2195  libc-2.3.4.so_int_malloc
2301771.6649  1237110.6048  libbz2.so.1.0.2  (no symbols)
84485 0.6111  120   0.1029  libc-2.3.4.so_int_free
84043 0.6079  473   0.4055  libc-2.3.4.socalloc
...
18282 0.1322  9060  7.7665  vmlinux  __copy_from_user_ll

Top 5 of Cache miss
32786   28.1051 cc1 (no symbols)
31175   26.7241 vmlinux buffered_rmqueue
12371   10.6048 libbz2.so.1.0.2 (no symbols)
90607.7665  vmlinux __copy_from_user_ll
28012.4011  oprofiled   (no symbols)
...
0

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread Hiro Yoshioka
Hi,

I appreciate your suggestion.

On 8/15/05, Arjan van de Ven [EMAIL PROTECTED] wrote:
 
  Anyway we could not find the cache aware version of __copy_from_user_ll
  has a big regression yet.
 
 
 that is because you spread the cache misses out from one place to all
 over the place, so that no one single point sticks out anymore.
 
 Do you agree that your copy is less optimal for the case where the
 kernel will (almost) immediately use the data?

Yes, I do.

My server has 8KB of L1 cache. (512KB of L2/2MB of L3)

If you move more than 4KB of data using by __copy_from_user_ll(), the
data will be spilled over L1 cache but in L2 (or L3)
When you move huge data ( 1MB), even L3 cache will not help you.
(This is known as a cache pollution.)

 I agree that your copy is really nice for places where the kernel will
 NOT use the data in the cpu, say for big write() system calls.
 
 My suggestion is to realize there are basically 2 different use cases,
 and that in the code the first one is very common, while in your
 profiles the second one is very common. Based on that I suggest to make
 a special copy_from_user_nocache() API for the cases where the kernel
 will not use the data (and ignore software raid5 here) and use your
 excellent version for that API, while leaving the code for the cases
 where the kernel WILL use the data alone. Code wise the will use case
 is the vast majority, so only changing the few places that know they
 don't use the data will be very efficient, and will give immediate big
 improvement in your profile data, since those few places tend to get
 used a lot in the cases you benchmark.

copy_from_user_nocache() is fine.

But I don't know where I can use it. (I'm not so
 familiar with the linux kernel file system yet.) 

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread Hiro Yoshioka
On 8/15/05, Arjan van de Ven [EMAIL PROTECTED] wrote:
  copy_from_user_nocache() is fine.
 
  But I don't know where I can use it. (I'm not so
   familiar with the linux kernel file system yet.)
 
 I suspect the few cases where it will make the most difference will be
 in the VFS for the write() system call, and the AIO variants thereof.
 
 generic_file_buffered_write() will be a good candidate to try first...

Thanks.

filemap_copy_from_user() calls __copy_from_user_inatomic() calls
__copy_from_user_ll().

I'll look at the code.

Hiro
--
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread Hiro Yoshioka
From: Hiro Yoshioka [EMAIL PROTECTED]
Date: Tue, 16 Aug 2005 08:33:59 +0900

 Thanks.
 
 filemap_copy_from_user() calls __copy_from_user_inatomic() calls
 __copy_from_user_ll().
 
 I'll look at the code.

The following is a quick hack of cache aware implementation
of __copy_from_user_ll() and __copy_from_user_inatomic()

__copy_from_user_ll_nocache() and __copy_from_user_inatomic_nocache()

filemap_copy_from_user() calles __copy_from_user_inatomic_nocache()
instead of __copy_from_user_inatomic() and reduced cashe miss.

The first column is the cache reference (memory access) and the
third column is the 3rd level cache miss.

The following example shows the L3 cache miss is reduced from 37410 to 107.

2.6.12.4 nocache version
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %samples  % app name   symbol name
1204426.4106  1070.5620  vmlinux__copy_user_zeroing_nocache
80049 4.2606  5783.0357  vmlinuxjournal_add_journal_head
69194 3.6829  1540.8088  vmlinuxjournal_dirty_metadata
67059 3.5692  78 0.4097  vmlinux__find_get_block
64145 3.4141  32 0.1681  vmlinuxjournal_put_journal_head
pattern9-0-cpu4-0-08161154/summary.out

The 2.6.12.4 original version is
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with 
a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %samples  % app name   symbol name
1206467.4680  37410 62.3355  vmlinux__copy_from_user_ll
79508 4.9215  9031.5046  vmlinux_spin_lock
65526 4.0561  8731.4547  vmlinuxjournal_add_journal_head
59296 3.6704  1290.2149  vmlinux__find_get_block
58647 3.6302  2150.3582  vmlinuxjournal_dirty_metadata

What do you think?

Hiro

diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nocache/Makefile
--- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900
+++ linux-2.6.12.4.nocache/Makefile 2005-08-16 10:22:31.0 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.nocache
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c 
linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c
--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05 
16:04:37.0 +0900
+++ linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c 2005-08-16 
10:49:59.0 +0900
@@ -10,6 +10,7 @@
 #include linux/highmem.h
 #include linux/blkdev.h
 #include linux/module.h
+#include asm/i387.h
 #include asm/uaccess.h
 #include asm/mmx.h
 
@@ -511,6 +512,110 @@
: memory);\
 } while (0)
 
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware   */
+/* [EMAIL PROTECTED]   */
+static unsigned long 
+__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
+{
+/* Note! gcc doesn't seem to align stack variables properly, so we
+ * need to make use of unaligned loads and stores.
+ */
+   void *p;
+   int i;
+
+   if (unlikely(in_interrupt())){
+   __copy_user_zeroing(to, from, len);
+   return len;
+   }
+
+   p = to;
+   i = len  6; /* len/64 */
+
+kernel_fpu_begin();
+
+   __asm__ __volatile__ (
+   1: prefetchnta (%0)\n /* This set is 28 bytes */
+  prefetchnta 64(%0)\n
+  prefetchnta 128(%0)\n
+  prefetchnta 192(%0)\n
+  prefetchnta 256(%0)\n
+   2:  \n
+   .section .fixup, \ax\\n
+   3: movw $0x1AEB, 1b\n /* jmp on 26 bytes */
+  jmp 2b\n
+   .previous\n
+   .section __ex_table,\a\\n
+  .align 4\n
+  .long 1b, 3b\n
+   .previous
+   : : r (from) );
+   
+   for(; i5; i--)
+   {
+   __asm__ __volatile__ (
+   1:  prefetchnta 320(%0)\n
+   2:  movq (%0), %%mm0\n
+ movq 8(%0), %%mm1\n
+ movq 16(%0), %%mm2\n
+ movq 24(%0), %%mm3\n
+ movntq %%mm0, (%1)\n
+ movntq %%mm1, 8(%1)\n
+ movntq %%mm2, 16(%1)\n
+ movntq %%mm3, 24(%1)\n
+ movq 32(%0), %%mm0\n
+ movq 40(%0), %%mm1\n
+ movq 48(%0), %%mm2\n
+ movq 56(%0), %%mm3\n
+ movntq %%mm0

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread Hiro Yoshioka
Takahashi san,

I appreciate your comments.

 Hi,
 
 BTW, what are you going to do with the page-faults which may happen
 during __copy_user_zeroing_nocache()? The current process may be blocked
 in the handler for a while and get FPU registers polluted.
 kernel_fpu_begin() won't help the case. This is another issue, though.

My code does nothing do it.

I need a volunteer to implement it.

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-14 Thread Hiro Yoshioka
Thanks for your comments.

On 8/14/05, Arjan van de Ven <[EMAIL PROTECTED]> wrote:
> On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote:
> > Hi,
> >
> > The following is a patch to reduce a cache pollution
> > of __copy_from_user_ll().
> >
> > When I run simple iozone benchmark to find a performance bottleneck of
> > the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
> > most and it did many cache misses.
> 
> 
> however... you copy something from userspace... aren't you going to USE
> it? The non-termoral versions actually throw the data out of the
> cache... so while this part might be nice, you pay BIG elsewhere

The oprofile data does not give an evidence that we pay BIG elsewhere.

For examples, the original 2.6.12.4 Top 5 cache misses are the following,

37017 63.4603  vmlinux__copy_from_user_ll
1049   1.7984  vmlinux_spin_lock_irqsave
9401.6115  vmlinuxblk_rq_map_sg
8961.5361  vmlinuxgeneric_file_buffered_write
8851.5172  vmlinux_spin_lock
pattern9-0-cpu4-0-08141702

cache aware version Top 5 cache misses are
899 5.7305  vmlinuxblk_rq_map_sg
569 3.6270  vmlinuxjournal_commit_transaction
531 3.3848  vmlinuxradix_tree_delete
514 3.2764  vmlinuxjournal_add_journal_head
505 3.2190  vmlinuxrelease_pages
...
89 0.5673 vmlinux _mmx_memcpy_nt
pattern9-0-cpu4-0-08141625

What do you think?

Regards,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-14 Thread Hiro Yoshioka
Hi,

The following is a patch to reduce a cache pollution
of __copy_from_user_ll().

When I run simple iozone benchmark to find a performance bottleneck of
the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
most and it did many cache misses.

The following is profiled by oprofile.

Top 5 CPU cycle
CPU: P4 / Xeon, speed 2200.91 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
281538   15.2083  vmlinux  __copy_from_user_ll
81069 4.3792  vmlinux  _spin_lock
75523 4.0796  vmlinux  journal_add_journal_head
63674 3.4396  vmlinux  do_get_write_access
52634 2.8432  vmlinux  journal_put_journal_head
(pattern9-0-cpu4-0-08141700/summary.out)

Top 5 Memory Access and Cache miss
CPU: P4 / Xeon, speed 2200.91 MHz (estimated)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %samples  %app name symbol name
1208017.4379  3701763.4603  vmlinux  __copy_from_user_ll
84139 5.1806  885   1.5172  vmlinux  _spin_lock
66027 4.0654  656   1.1246  vmlinux 
journal_add_journal_head
60400 3.7189  250   0.4286  vmlinux  __find_get_block
60032 3.6963  120   0.2057  vmlinux 
journal_dirty_metadata

__copy_from_user_ll spent 63.4603% of L3 cache miss though it spent only
7.4379% of memory access.

In order to reduce the cache miss in the __copy_from_user_ll, I made
the following patch and confirmed the reduction of the miss.

Top 5 CPU cycle
CPU: P4 / Xeon, speed 2200.93 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
1207178.3454  vmlinux  _mmx_memcpy_nt
65955 4.5596  vmlinux  do_get_write_access
56088 3.8775  vmlinux  journal_put_journal_head
52550 3.6329  vmlinux  journal_dirty_metadata
38886 2.6883  vmlinux  journal_add_journal_head
pattern9-0-cpu4-0-08141627/summary.out

_mmx_memcpy_nt is the new function which is called from
__copy_from_user_ll and it spent only 42.88% of the original
implementation. (120717/281538==42.88%)

Top 5 Memory Access
CPU: P4 / Xeon, speed 2200.93 MHz (estimated)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %samples  %app name symbol name
90918 6.3079  890.5673  vmlinux  _mmx_memcpy_nt
83654 5.8039  177   1.1283  vmlinux 
journal_dirty_metadata
57836 4.0127  348   2.2183  vmlinux 
journal_put_journal_head
48236 3.3466  165   1.0518  vmlinux  do_get_write_access
44546 3.0906  210.1339  vmlinux  __getblk

The cache miss reduced from 37017 (63.4603%) to 89 (0.5673%). It is
0.24% of the original implementation.

The actual elapse time which five times run  were 229.76 (sec) and
222.94 (sec). (229.76/222.94= 3.06% gain)

iozone -CMR -i 0 -+n -+u -s 8000MB -t 4 

What do you think?

--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05
16:04:37.0 +0900
+++ linux-2.6.12.4/arch/i386/lib/usercopy.c 2005-08-12 13:18:14.106916200 
+0900
@@ -10,6 +10,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
 
@@ -511,6 +512,108 @@
: "memory");\
 } while (0)
 
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware   */
+/* [EMAIL PROTECTED]   */
+static unsigned long _mmx_memcpy_nt(void *to, const void *from, size_t len)
+{
+/* Note! gcc doesn't seem to align stack variables properly, so we
+ * need to make use of unaligned loads and stores.
+ */
+   void *p;
+   int i;
+
+   if (unlikely(in_interrupt())){
+   __copy_user_zeroing(to, from, len);
+   return len;
+   }
+
+   p = to;
+   i = len >> 6; /* len/64 */
+
+kernel_fpu_begin();
+
+   __asm__ __volatile__ (
+   "1: prefetchnta (%0)\n" /* This set is 28 bytes */
+   "   prefetchnta 64(%0)\n"
+   "   prefetchnta 128(%0)\n"
+

[RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-14 Thread Hiro Yoshioka
Hi,

The following is a patch to reduce a cache pollution
of __copy_from_user_ll().

When I run simple iozone benchmark to find a performance bottleneck of
the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
most and it did many cache misses.

The following is profiled by oprofile.

Top 5 CPU cycle
CPU: P4 / Xeon, speed 2200.91 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
281538   15.2083  vmlinux  __copy_from_user_ll
81069 4.3792  vmlinux  _spin_lock
75523 4.0796  vmlinux  journal_add_journal_head
63674 3.4396  vmlinux  do_get_write_access
52634 2.8432  vmlinux  journal_put_journal_head
(pattern9-0-cpu4-0-08141700/summary.out)

Top 5 Memory Access and Cache miss
CPU: P4 / Xeon, speed 2200.91 MHz (estimated)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %samples  %app name symbol name
1208017.4379  3701763.4603  vmlinux  __copy_from_user_ll
84139 5.1806  885   1.5172  vmlinux  _spin_lock
66027 4.0654  656   1.1246  vmlinux 
journal_add_journal_head
60400 3.7189  250   0.4286  vmlinux  __find_get_block
60032 3.6963  120   0.2057  vmlinux 
journal_dirty_metadata

__copy_from_user_ll spent 63.4603% of L3 cache miss though it spent only
7.4379% of memory access.

In order to reduce the cache miss in the __copy_from_user_ll, I made
the following patch and confirmed the reduction of the miss.

Top 5 CPU cycle
CPU: P4 / Xeon, speed 2200.93 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
samples  %app name symbol name
1207178.3454  vmlinux  _mmx_memcpy_nt
65955 4.5596  vmlinux  do_get_write_access
56088 3.8775  vmlinux  journal_put_journal_head
52550 3.6329  vmlinux  journal_dirty_metadata
38886 2.6883  vmlinux  journal_add_journal_head
pattern9-0-cpu4-0-08141627/summary.out

_mmx_memcpy_nt is the new function which is called from
__copy_from_user_ll and it spent only 42.88% of the original
implementation. (120717/281538==42.88%)

Top 5 Memory Access
CPU: P4 / Xeon, speed 2200.93 MHz (estimated)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %samples  %app name symbol name
90918 6.3079  890.5673  vmlinux  _mmx_memcpy_nt
83654 5.8039  177   1.1283  vmlinux 
journal_dirty_metadata
57836 4.0127  348   2.2183  vmlinux 
journal_put_journal_head
48236 3.3466  165   1.0518  vmlinux  do_get_write_access
44546 3.0906  210.1339  vmlinux  __getblk

The cache miss reduced from 37017 (63.4603%) to 89 (0.5673%). It is
0.24% of the original implementation.

The actual elapse time which five times run  were 229.76 (sec) and
222.94 (sec). (229.76/222.94= 3.06% gain)

iozone -CMR -i 0 -+n -+u -s 8000MB -t 4 

What do you think?

--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05
16:04:37.0 +0900
+++ linux-2.6.12.4/arch/i386/lib/usercopy.c 2005-08-12 13:18:14.106916200 
+0900
@@ -10,6 +10,7 @@
  #include linux/highmem.h
  #include linux/blkdev.h
  #include linux/module.h
+#include asm/i387.h
  #include asm/uaccess.h
  #include asm/mmx.h
 
@@ -511,6 +512,108 @@
: memory);\
 } while (0)
 
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware   */
+/* [EMAIL PROTECTED]   */
+static unsigned long _mmx_memcpy_nt(void *to, const void *from, size_t len)
+{
+/* Note! gcc doesn't seem to align stack variables properly, so we
+ * need to make use of unaligned loads and stores.
+ */
+   void *p;
+   int i;
+
+   if (unlikely(in_interrupt())){
+   __copy_user_zeroing(to, from, len);
+   return len;
+   }
+
+   p = to;
+   i = len  6; /* len/64 */
+
+kernel_fpu_begin();
+
+   __asm__ __volatile__ (
+   1: prefetchnta (%0)\n /* This set is 28 bytes */
+  

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-14 Thread Hiro Yoshioka
Thanks for your comments.

On 8/14/05, Arjan van de Ven [EMAIL PROTECTED] wrote:
 On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote:
  Hi,
 
  The following is a patch to reduce a cache pollution
  of __copy_from_user_ll().
 
  When I run simple iozone benchmark to find a performance bottleneck of
  the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
  most and it did many cache misses.
 
 
 however... you copy something from userspace... aren't you going to USE
 it? The non-termoral versions actually throw the data out of the
 cache... so while this part might be nice, you pay BIG elsewhere

The oprofile data does not give an evidence that we pay BIG elsewhere.

For examples, the original 2.6.12.4 Top 5 cache misses are the following,

37017 63.4603  vmlinux__copy_from_user_ll
1049   1.7984  vmlinux_spin_lock_irqsave
9401.6115  vmlinuxblk_rq_map_sg
8961.5361  vmlinuxgeneric_file_buffered_write
8851.5172  vmlinux_spin_lock
pattern9-0-cpu4-0-08141702

cache aware version Top 5 cache misses are
899 5.7305  vmlinuxblk_rq_map_sg
569 3.6270  vmlinuxjournal_commit_transaction
531 3.3848  vmlinuxradix_tree_delete
514 3.2764  vmlinuxjournal_add_journal_head
505 3.2190  vmlinuxrelease_pages
...
89 0.5673 vmlinux _mmx_memcpy_nt
pattern9-0-cpu4-0-08141625

What do you think?

Regards,
  Hiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/