Re: [git pull] Please pull powerpc.git merge branch (updated)

2012-08-16 Thread Kumar Gala
Ben,

Poke.  :)

- k

On Aug 10, 2012, at 8:07 AM, Kumar Gala wrote:

> Ben,
> 
> Two updates from last week (one dts bug fix, one minor defconfig update)
> 
> - k
> 
> The following changes since commit 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee:
> 
>  Linux 3.6-rc1 (2012-08-02 16:38:10 -0700)
> 
> are available in the git repository at:
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/galak/powerpc.git merge
> 
> for you to fetch changes up to 09a3017a585eb8567a7de15b426bb1dfb548bf0f:
> 
>  powerpc/p4080ds: dts - add usb controller version info and port0 (2012-08-10 
> 07:47:02 -0500)
> 
> 
> Jia Hongtao (1):
>  powerpc/fsl-pci: Only scan PCI bus if configured as a host
> 
> Shengzhou Liu (1):
>  powerpc/p4080ds: dts - add usb controller version info and port0
> 
> Zhao Chenhui (1):
>  powerpc/85xx: mpc85xx_defconfig - add VIA PATA support for MPC85xxCDS
> 
> arch/powerpc/boot/dts/fsl/p4080si-post.dtsi |7 +++
> arch/powerpc/configs/mpc85xx_defconfig  |1 +
> arch/powerpc/sysdev/fsl_pci.c   |   13 -
> 3 files changed, 16 insertions(+), 5 deletions(-)
> 
> ___
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc

2012-08-16 Thread Ananth N Mavinakayanahalli
On Thu, Aug 16, 2012 at 05:21:12PM +0200, Oleg Nesterov wrote:

...

> > So, the arch agnostic code itself
> > takes care of this case...
> 
> Yes. I forgot about install_breakpoint()->is_swbp_insn() check which
> returns -ENOTSUPP, somehow I thought arch_uprobe_analyze_insn() does
> this.
> 
> > or am I missing something?
> 
> No, it is me.
> 
> > However, I see that we need a powerpc specific is_swbp_insn()
> > implementation since we will have to take care of all the trap variants.
> 
> Hmm, I am not sure. is_swbp_insn(insn), as it is used in the arch agnostic
> code, should only return true if insn == UPROBE_SWBP_INSN (just in case,
> this logic needs more fixes but this is offtopic).

I think it does...

> If powerpc has another insn(s) which can trigger powerpc's do_int3()
> counterpart, they should be rejected by arch_uprobe_analyze_insn().
> I think.

The insn that gets passed to arch_uprobe_analyze_insn() is copy_insn()'s
version, which is the file copy of the instruction. We should also take
care of the in-memory copy, in case gdb had inserted a breakpoint at the
same location, right? Updating is_swbp_insn() per-arch where needed will
take care of both the cases, 'cos it gets called before
arch_analyze_uprobe_insn() too.

> > I will need to update the patches based on changes being made by Oleg
> > and Sebastien for the single-step issues.
> 
> Perhaps you can do this in a separate change?
> 
> We need some (simple) changes in the arch agnostic code first, they
> should not break poweppc. These changes are still under discussion.
> Once we have "__weak  arch_uprobe_step*" you can reimplement these
> hooks and fix the problems with single-stepping.

OK. Agreed.

Ananth

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: powerpc/perf: hw breakpoints return ENOSPC

2012-08-16 Thread Michael Ellerman
On Thu, 2012-08-16 at 16:15 +0200, Peter Zijlstra wrote:
> On Fri, 2012-08-17 at 00:02 +1000, Michael Ellerman wrote:
> > You do want to guarantee that the task will always be subject to the
> > breakpoint, even if it moves cpus. So is there any way to guarantee that
> > other than reserving a breakpoint slot on every cpu ahead of time? 
> 
> That's not how regular perf works.. regular perf can overload hw
> resources at will and stuff is strictly per-cpu.
..
> For regular (!pinned) events, we'll RR the created events on the
> available hardware resources.

Yeah I know, but that isn't really the semantics you want for a
breakpoint. You don't want to sometimes have the breakpoint active and
sometimes not, it needs to be active at all times when the task is
running.

At the very least you want it to behave like a pinned event, ie. if it
can't be scheduled you get notified and can tell the user.

> HWBP does things completely different and reserves a slot over all CPUs
> for everything, thus stuff completely falls apart.

So it would seem :)

I guess my point was that reserving a slot on each cpu seems like a
reasonable way of guaranteeing that wherever the task goes we will be
able to install the breakpoint.

But obviously we need some way to make it play nice with perf.

cheers



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: powerpc/perf: hw breakpoints return ENOSPC

2012-08-16 Thread Michael Neuling
> > > > On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1,
> > > > despite there being no breakpoint on this CPU.  This is because the call
> > > > the task_bp_pinned, checks all CPUs, rather than just the current CPU.
> > > > POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we
> > > > return ENOSPC.
> > > 
> > > I think this comes from the ptrace legacy, we register a breakpoint on
> > > all cpus because when we migrate a task it cannot fail to migrate the
> > > breakpoint.
> > > 
> > > Its one of the things I hate most about the hwbp stuff as it relates to
> > > perf.
> > > 
> > > Frederic knows more...
> > 
> > Maybe I should wait for Frederic to respond but I'm not sure I
> > understand what you're saying.
> > 
> > I can see how using ptrace hw breakpoints and perf hw breakpoints at the
> > same time could be a problem, but I'm not sure how this would stop it.
> 
> ptrace uses perf for hwbp support so we're stuck with all kinds of
> stupid ptrace constraints.. or somesuch.

OK

> > Are you saying that we need to keep at least 1 slot free at all times,
> > so that we can use it for ptrace?
> 
> No, I'm saying perf-hwbp is weird because of ptrace, maybe the ptrace
> weirdness shouldn't live in perf-hwpb but in the ptrace-perf glue
> however..

OK.

> > Is "perf record -e mem:0x1000 true" ever going to be able to work on
> > POWER7 with only one hw breakpoint resource per CPU?  
> 
> I think it should work... but I'm fairly sure it currently doesn't
> because of how things are done. 'perf record -ie mem:0x100... true'
> might just work.

Adding -i doesn't help. 

Mikey
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: therm_pm72 units, interface

2012-08-16 Thread Benjamin Herrenschmidt

> If you have more things to print/offer via sysfs, I'm all for it.
> 
> The XsG5 really has (by looking into the casing): 1 PCI Fan,
> 6 center fans, 1 PSU intake and 1 PSU outblow fan (this last one
> seems rather slow-turning, but maybe that's normal).
> It is not quite clear which is which in the sysfs display.

The cpu intake & exhaust are the same, they are handled by groups of 3
ie, cpu0_* is the 3 fans on CPU 0, cpu1_* is the 3 fans on CPU 1.

Backside fan is supposed to blow on the U3 chip, I don't remember where
it's located, and slots fan is the PCI one afaik. The PSU's own fan
isn't under our direct control

> What I did figure out: at the PROM, fans run at what seems
> to be full speed (some 8000-9000 rpm?). Once Linux and therm_pm72
> are loaded, the fans settle down towards 4000 rpm, and if the machine
> has warmed up, that is then when it powers off. (The kernel is indeed
> 3.4. I now need to figure out how to place a new kernel on it without
> it powering off inbetween.)

You can try netbooting... OF netboot is limited to 4M sized zImages
which can be a bit tough nowadays, but modern yaboot can netboot larger
files. Another option is USB sticks.

> >> $ cd /sys/devices/temperature; grep '' *;
> >> backside_fan_pwm:32
> >> backside_temperature:54.000
> >> cpu0_current:34.423
> >> cpu0_exhaust_fan_rpm:5340
> >> cpu0_intake_fan_rpm:5340
> >> cpu0_temperature:72.889
> >> cpu0_voltage:1.252
> >> cpu1_current:34.179
> >> cpu1_exhaust_fan_rpm:4584
> >> cpu1_intake_fan_rpm:4584
> >> cpu1_temperature:68.526
> >> cpu1_voltage:1.259
> >> dimms_temperature:53.000
> >> grep: driver: Er en filkatalog
> >> modalias:platform:temperature
> >> grep: power: Er en filkatalog
> >> slots_fan_pwm:20
> >> slots_temperature:38.500
> >> grep: subsystem: Er en filkatalog
> >> uevent:DRIVER=temperature
> >> uevent:OF_NAME=fan
> >> uevent:OF_FULLNAME=/u3@0,f800/i2c@f8001000/fan@15e
> >> uevent:OF_TYPE=fcu
> >> uevent:OF_COMPATIBLE_0=fcu
> >> uevent:OF_COMPATIBLE_N=1
> >> uevent:MODALIAS=of:NfanTfcuCfcu

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] scsi/ibmvscsi: /sys/class/scsi_host/hostX/config doesn't show any information

2012-08-16 Thread Robert Jennings
On Sun, Jul 29, 2012 at 8:33 PM, Benjamin Herrenschmidt
 wrote:
> n Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote:
>> From: Linda Xie 
>>
>> Expected result:
>> It should show something like this:
>> x1521p4:~ # cat /sys/class/scsi_host/host1/config
>> PARTITIONNAME='x1521p4'
>> NWSDNAME='X1521P4'
>> HOSTNAME='X1521P4'
>> DOMAINNAME='RCHLAND.IBM.COM'
>> NAMESERVERS='9.10.244.100 9.10.244.200'
>>
>> Actual result:
>> x1521p4:~ # cat /sys/class/scsi_host/host0/config
>> x1521p4:~ #
>>
>> This patch changes the size of the buffer used for transfering config
>> data to 4K. It was tested against 2.6.19-rc2 tree.
>>
>> Reported by IBM during SLES11 beta testing:
>
> So this patch just seems to blindly replace all occurrences of PAGE_SIZE
> with HOST_PAGE_SIZE which is utterly wrong. Only one of those needs to
> be changed, the one passed to ibmvscsi_do_host_config() which is what's
> visible to the server, all the rest is just sysfs attributes and should
> remain as-is.
>
> Additionally (not even mentioning that there is no explanation as to
> what the real problem is anywhere in the changeset) I don't like the
> fix. The root of the problem is that the MAD header has a 16-bit length
> field, so writing 0x1 (64K PAGE_SIZE) into it doesn't quite work.
>
> So in addition to a better comment, I would suggest a fix more like
> this:
>
> scsi/ibmvscsi: Fix host config length field overflow
>
> The length field in the host config packet is only 16-bit long, so
> passing it 0x1 (64K which is our standard PAGE_SIZE) doesn't
> work and result in an empty config from the server.
>
> Signed-off-by: Benjamin Herrenschmidt 
> CC: 

Acked-by: Robert Jennings 

Tested with an IBM i host and confirmed the fix.

> ---
>
> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c 
> b/drivers/scsi/ibmvscsi/ibmvscsi.c
> index 3a6c474..337e8b3 100644
> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
> @@ -1541,6 +1541,9 @@ static int ibmvscsi_do_host_config(struct 
> ibmvscsi_host_data *hostdata,
>
> host_config = &evt_struct->iu.mad.host_config;
>
> +   /* The transport length field is only 16-bit */
> +   length = min(0x, length);
> +
> /* Set up a lun reset SRP command */
> memset(host_config, 0x00, sizeof(*host_config));
> host_config->common.type = VIOSRP_HOST_CONFIG_TYPE;
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Andrea Arcangeli
On Thu, Aug 16, 2012 at 09:37:25PM +0300, Kirill A. Shutemov wrote:
> On Thu, Aug 16, 2012 at 08:29:44PM +0200, Andrea Arcangeli wrote:
> > On Thu, Aug 16, 2012 at 07:43:56PM +0300, Kirill A. Shutemov wrote:
> > > Hm.. I think with static_key we can avoid cache overhead here. I'll try.
> > 
> > Could you elaborate on the static_key? Is it some sort of self
> > modifying code?
> 
> Runtime code patching. See Documentation/static-keys.txt. We can patch it
> on sysctl.

I guessed it had to be patching the code, thanks for the
pointer. It looks a perfect fit for this one agreed.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Kirill A. Shutemov
On Thu, Aug 16, 2012 at 08:29:44PM +0200, Andrea Arcangeli wrote:
> On Thu, Aug 16, 2012 at 07:43:56PM +0300, Kirill A. Shutemov wrote:
> > Hm.. I think with static_key we can avoid cache overhead here. I'll try.
> 
> Could you elaborate on the static_key? Is it some sort of self
> modifying code?

Runtime code patching. See Documentation/static-keys.txt. We can patch it
on sysctl.

> 
> > Thanks, for review. Could you take a look at huge zero page patchset? ;)
> 
> I've noticed that too, nice :). I'm checking some detail on the
> wrprotect fault behavior but I'll comment there.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Andrea Arcangeli
On Thu, Aug 16, 2012 at 07:43:56PM +0300, Kirill A. Shutemov wrote:
> Hm.. I think with static_key we can avoid cache overhead here. I'll try.

Could you elaborate on the static_key? Is it some sort of self
modifying code?

> Thanks, for review. Could you take a look at huge zero page patchset? ;)

I've noticed that too, nice :). I'm checking some detail on the
wrprotect fault behavior but I'll comment there.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Kirill A. Shutemov
On Thu, Aug 16, 2012 at 06:16:47PM +0200, Andrea Arcangeli wrote:
> Hi Kirill,
> 
> On Thu, Aug 16, 2012 at 06:15:53PM +0300, Kirill A. Shutemov wrote:
> > for (i = 0; i < pages_per_huge_page;
> >  i++, p = mem_map_next(p, page, i)) {
> 
> It may be more optimal to avoid a multiplication/shiftleft before the
> add, and to do:
> 
>   for (i = 0, vaddr = haddr; i < pages_per_huge_page;
>i++, p = mem_map_next(p, page, i), vaddr += PAGE_SIZE) {
> 

Makes sense. I'll update it.

> > cond_resched();
> > -   clear_user_highpage(p, addr + i * PAGE_SIZE);
> > +   vaddr = haddr + i*PAGE_SIZE;
> 
> Not sure if gcc can optimize it away because of the external calls.
> 
> > +   if (!ARCH_HAS_USER_NOCACHE || i == target)
> > +   clear_user_highpage(page + i, vaddr);
> > +   else
> > +   clear_user_highpage_nocache(page + i, vaddr);
> > }
> 
> 
> My only worry overall is if there can be some workload where this may
> actually slow down userland if the CPU cache is very large and
> userland would access most of the faulted in memory after the first
> fault.
> 
> So I wouldn't mind to add one more check in addition of
> !ARCH_HAS_USER_NOCACHE above to check a runtime sysctl variable. It'll
> waste a cacheline yes but I doubt it's measurable compared to the time
> it takes to do a >=2M hugepage copy.

Hm.. I think with static_key we can avoid cache overhead here. I'll try.
 
> Furthermore it would allow people to benchmark its effect without
> having to rebuild the kernel themself.
> 
> All other patches looks fine to me.

Thanks, for review. Could you take a look at huge zero page patchset? ;)

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 0/7] mv643xx.c: Add basic device tree support.

2012-08-16 Thread Ian Molton
Ping :)

Can we get some consensus on the right approach here? I'm loathe to code
this if its going to be rejected.

I'd prefer the driver to be properly split so we dont have the MDIO
driver mapping the ethernet drivers address spaces, but if thats not
going to be merged, I'm not feeling like doing the work for nothing.

If the driver is to use the overlapping-address mapped-by-the-mdio
scheme, then so be it, but I could do with knowing.

Another point against the latter scheme is that the MDIO driver could
sensibly be used (the block is identical) on the ArmadaXP, which has 4
ethernet blocks rather than two, yet grouped in two pairs with a
discontiguous address range.

I'd like to get this moved along as soon as possible though.

-Ian
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Andrea Arcangeli
Hi Kirill,

On Thu, Aug 16, 2012 at 06:15:53PM +0300, Kirill A. Shutemov wrote:
>   for (i = 0; i < pages_per_huge_page;
>i++, p = mem_map_next(p, page, i)) {

It may be more optimal to avoid a multiplication/shiftleft before the
add, and to do:

for (i = 0, vaddr = haddr; i < pages_per_huge_page;
 i++, p = mem_map_next(p, page, i), vaddr += PAGE_SIZE) {

>   cond_resched();
> - clear_user_highpage(p, addr + i * PAGE_SIZE);
> + vaddr = haddr + i*PAGE_SIZE;

Not sure if gcc can optimize it away because of the external calls.

> + if (!ARCH_HAS_USER_NOCACHE || i == target)
> + clear_user_highpage(page + i, vaddr);
> + else
> + clear_user_highpage_nocache(page + i, vaddr);
>   }


My only worry overall is if there can be some workload where this may
actually slow down userland if the CPU cache is very large and
userland would access most of the faulted in memory after the first
fault.

So I wouldn't mind to add one more check in addition of
!ARCH_HAS_USER_NOCACHE above to check a runtime sysctl variable. It'll
waste a cacheline yes but I doubt it's measurable compared to the time
it takes to do a >=2M hugepage copy.

Furthermore it would allow people to benchmark its effect without
having to rebuild the kernel themself.

All other patches looks fine to me.

Thanks!
Andrea
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc

2012-08-16 Thread Oleg Nesterov
On 08/16, Ananth N Mavinakayanahalli wrote:
>
> On Thu, Aug 16, 2012 at 07:41:53AM +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2012-08-15 at 18:59 +0200, Oleg Nesterov wrote:
> > > On 07/26, Ananth N Mavinakayanahalli wrote:
> > > >
> > > > From: Ananth N Mavinakayanahalli 
> > > >
> > > > This is the port of uprobes to powerpc. Usage is similar to x86.
> > >
> > > I am just curious why this series was ignored by powerpc maintainers...
> >
> > Because it arrived too late for the previous merge window considering my
> > limited bandwidth for reviewing things and that nobody else seems to
> > have reviewed it :-)
> >
> > It's still on track for the next one, and I'm hoping to dedicate most of
> > next week going through patches & doing a powerpc -next.
>
> Thanks Ben!

Great!

> > > Just one question... Shouldn't arch_uprobe_pre_xol() forbid to probe
> > > UPROBE_SWBP_INSN (at least) ?
> > >
> > > (I assume that emulate_step() can't handle this case but of course I
> > >  do not understand arch/powerpc/lib/sstep.c)
> > >
> > > Note that uprobe_pre_sstep_notifier() sets utask->state = UTASK_BP_HIT
> > > without any checks. This doesn't look right if it was UTASK_SSTEP...
> > >
> > > But again, I do not know what powepc will actually do if we try to
> > > single-step over UPROBE_SWBP_INSN.
> >
> > Ananth ?
>
> set_swbp() will return -EEXIST to install_breakpoint if we are trying to
> put a breakpoint on UPROBE_SWBP_INSN.

not really, this -EEXIST (already removed by recent changes) means that
bp was already installed.

But this doesn't matter,

> So, the arch agnostic code itself
> takes care of this case...

Yes. I forgot about install_breakpoint()->is_swbp_insn() check which
returns -ENOTSUPP, somehow I thought arch_uprobe_analyze_insn() does
this.

> or am I missing something?

No, it is me.

> However, I see that we need a powerpc specific is_swbp_insn()
> implementation since we will have to take care of all the trap variants.

Hmm, I am not sure. is_swbp_insn(insn), as it is used in the arch agnostic
code, should only return true if insn == UPROBE_SWBP_INSN (just in case,
this logic needs more fixes but this is offtopic).

If powerpc has another insn(s) which can trigger powerpc's do_int3()
counterpart, they should be rejected by arch_uprobe_analyze_insn().
I think.

> I will need to update the patches based on changes being made by Oleg
> and Sebastien for the single-step issues.

Perhaps you can do this in a separate change?

We need some (simple) changes in the arch agnostic code first, they
should not break poweppc. These changes are still under discussion.
Once we have "__weak  arch_uprobe_step*" you can reimplement these
hooks and fix the problems with single-stepping.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: therm_pm72 units, interface

2012-08-16 Thread Jan Engelhardt

On Wednesday 2012-08-15 23:35, Benjamin Herrenschmidt wrote:
>> XServe G5 of mine started powering off more or less 
>> randomly
>
>BTW. There's a new windfarm driver for these in recent kernels...
>
>Appart from that, the trip points are coming from a calibration EEPROM,
>you may want to tweak the driver to warn a bit earlier or that sort of
>things ? (Or just to print more things out ?)

If you have more things to print/offer via sysfs, I'm all for it.

The XsG5 really has (by looking into the casing): 1 PCI Fan,
6 center fans, 1 PSU intake and 1 PSU outblow fan (this last one
seems rather slow-turning, but maybe that's normal).
It is not quite clear which is which in the sysfs display.

What I did figure out: at the PROM, fans run at what seems
to be full speed (some 8000-9000 rpm?). Once Linux and therm_pm72
are loaded, the fans settle down towards 4000 rpm, and if the machine
has warmed up, that is then when it powers off. (The kernel is indeed
3.4. I now need to figure out how to place a new kernel on it without
it powering off inbetween.)

>> $ cd /sys/devices/temperature; grep '' *;
>> backside_fan_pwm:32
>> backside_temperature:54.000
>> cpu0_current:34.423
>> cpu0_exhaust_fan_rpm:5340
>> cpu0_intake_fan_rpm:5340
>> cpu0_temperature:72.889
>> cpu0_voltage:1.252
>> cpu1_current:34.179
>> cpu1_exhaust_fan_rpm:4584
>> cpu1_intake_fan_rpm:4584
>> cpu1_temperature:68.526
>> cpu1_voltage:1.259
>> dimms_temperature:53.000
>> grep: driver: Er en filkatalog
>> modalias:platform:temperature
>> grep: power: Er en filkatalog
>> slots_fan_pwm:20
>> slots_temperature:38.500
>> grep: subsystem: Er en filkatalog
>> uevent:DRIVER=temperature
>> uevent:OF_NAME=fan
>> uevent:OF_FULLNAME=/u3@0,f800/i2c@f8001000/fan@15e
>> uevent:OF_TYPE=fcu
>> uevent:OF_COMPATIBLE_0=fcu
>> uevent:OF_COMPATIBLE_N=1
>> uevent:MODALIAS=of:NfanTfcuCfcu
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Kirill A. Shutemov
From: Andi Kleen 

Clearing a 2MB huge page will typically blow away several levels
of CPU caches. To avoid this only cache clear the 4K area
around the fault address and use a cache avoiding clears
for the rest of the 2MB area.

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/memory.c |   34 +-
 1 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index dfc179b..d4626b9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3969,18 +3969,34 @@ EXPORT_SYMBOL(might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
+#ifndef ARCH_HAS_USER_NOCACHE
+#define ARCH_HAS_USER_NOCACHE 0
+#endif
+
+#if ARCH_HAS_USER_NOCACHE == 0
+#define clear_user_highpage_nocache clear_user_highpage
+#endif
+
 static void clear_gigantic_page(struct page *page,
-   unsigned long addr,
-   unsigned int pages_per_huge_page)
+   unsigned long haddr, unsigned long fault_address,
+   unsigned int pages_per_huge_page)
 {
int i;
struct page *p = page;
+   unsigned long vaddr;
+   int target = (fault_address - haddr) >> PAGE_SHIFT;
 
might_sleep();
+   vaddr = haddr;
for (i = 0; i < pages_per_huge_page;
 i++, p = mem_map_next(p, page, i)) {
cond_resched();
-   clear_user_highpage(p, addr + i * PAGE_SIZE);
+   vaddr = haddr + i*PAGE_SIZE;
+   if (!ARCH_HAS_USER_NOCACHE  || i == target)
+   clear_user_highpage(p, vaddr);
+   else
+   clear_user_highpage_nocache(p, vaddr);
}
 }
 void clear_huge_page(struct page *page,
@@ -3988,16 +4004,24 @@ void clear_huge_page(struct page *page,
 unsigned int pages_per_huge_page)
 {
int i;
+   unsigned long vaddr;
+   int target = (fault_address - haddr) >> PAGE_SHIFT;
 
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-   clear_gigantic_page(page, haddr, pages_per_huge_page);
+   clear_gigantic_page(page, haddr, fault_address,
+   pages_per_huge_page);
return;
}
 
might_sleep();
+   vaddr = haddr;
for (i = 0; i < pages_per_huge_page; i++) {
cond_resched();
-   clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
+   vaddr = haddr + i*PAGE_SIZE;
+   if (!ARCH_HAS_USER_NOCACHE || i == target)
+   clear_user_highpage(page + i, vaddr);
+   else
+   clear_user_highpage_nocache(page + i, vaddr);
}
 }
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 5/7] x86: Add clear_page_nocache

2012-08-16 Thread Kirill A. Shutemov
From: Andi Kleen 

Add a cache avoiding version of clear_page. Straight forward integer variant
of the existing 64bit clear_page, for both 32bit and 64bit.

Also add the necessary glue for highmem including a layer that non cache
coherent architectures that use the virtual address for flushing can
hook in. This is not needed on x86 of course.

If an architecture wants to provide cache avoiding version of clear_page
it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/page.h  |2 +
 arch/x86/include/asm/string_32.h |5 +++
 arch/x86/include/asm/string_64.h |5 +++
 arch/x86/lib/Makefile|3 +-
 arch/x86/lib/clear_page_32.S |   72 ++
 arch/x86/lib/clear_page_64.S |   29 +++
 arch/x86/mm/fault.c  |7 
 7 files changed, 122 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_32.S

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 8ca8283..aa83a1b 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -29,6 +29,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
copy_page(to, from);
 }
 
+void clear_user_highpage_nocache(struct page *page, unsigned long vaddr);
+
 #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e835..3f2fbcf 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 /* Let gcc decide whether to inline or use the out of line functions */
 
 #define __HAVE_ARCH_STRCPY
@@ -337,6 +339,9 @@ void *__constant_c_and_count_memset(void *s, unsigned long 
pattern,
 #define __HAVE_ARCH_MEMSCAN
 extern void *memscan(void *addr, int c, size_t size);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_32_H */
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..ca23d1d 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 /* Written 2002 by Andi Kleen */
 
 /* Only used for special circumstances. Stolen from i386/string.h */
@@ -63,6 +65,9 @@ char *strcpy(char *dest, const char *src);
 char *strcat(char *dest, const char *src);
 int strcmp(const char *cs, const char *ct);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index b00f678..14e47a2 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -23,6 +23,7 @@ lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_SMP) += rwlock.o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o
+lib-y += clear_page_$(BITS).o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o
 
@@ -40,7 +41,7 @@ endif
 else
 obj-y += iomap_copy_64.o
 lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
-lib-y += thunk_64.o clear_page_64.o copy_page_64.o
+lib-y += thunk_64.o copy_page_64.o
 lib-y += memmove_64.o memset_64.o
 lib-y += copy_user_64.o copy_user_nocache_64.o
lib-y += cmpxchg16b_emu.o
diff --git a/arch/x86/lib/clear_page_32.S b/arch/x86/lib/clear_page_32.S
new file mode 100644
index 000..9592161
--- /dev/null
+++ b/arch/x86/lib/clear_page_32.S
@@ -0,0 +1,72 @@
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Fallback version if SSE2 is not avaible.
+ */
+ENTRY(clear_page_nocache)
+   CFI_STARTPROC
+   mov%eax,%edx
+   xorl   %eax,%eax
+   movl   $4096/32,%ecx
+   .p2align 4
+.Lloop:
+   decl%ecx
+#define PUT(x) mov %eax,x*4(%edx)
+   PUT(0)
+   PUT(1)
+   PUT(2)
+   PUT(3)
+   PUT(4)
+   PUT(5)
+   PUT(6)
+   PUT(7)
+#undef PUT
+   lea 32(%edx),%edx
+   jnz .Lloop
+   nop
+   ret
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache)
+
+   .section .altinstr_replacement,"ax"
+1:  .byte 0xeb /* jmp  */
+   .byte (clear_page_nocache_sse2 - clear_page_nocache) - (2f - 1b)
+   /* offset */
+2:
+   .previous
+   .section .altinstructions,"a"
+   altinstruction_entry clear_page_nocache,1b,X86_FEATURE_XMM2,\
+   16, 2b-1b
+   .previous
+
+/*
+ * Zero a page avoiding the caches
+ * eax page
+ */
+ENTRY(clear_page_nocache_sse2)
+   CFI_STARTPROC
+   mov%eax,%edx
+   xorl   %eax,%eax
+   

[PATCH v3 3/7] hugetlb: pass fault address to hugetlb_no_page()

2012-08-16 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Signed-off-by: Kirill A. Shutemov 
---
 mm/hugetlb.c |   38 +++---
 1 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc72712..3c86d3d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2672,7 +2672,8 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 }
 
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-   unsigned long address, pte_t *ptep, unsigned int flags)
+   unsigned long haddr, unsigned long fault_address,
+   pte_t *ptep, unsigned int flags)
 {
struct hstate *h = hstate_vma(vma);
int ret = VM_FAULT_SIGBUS;
@@ -2696,7 +2697,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
}
 
mapping = vma->vm_file->f_mapping;
-   idx = vma_hugecache_offset(h, vma, address);
+   idx = vma_hugecache_offset(h, vma, haddr);
 
/*
 * Use page lock to guard against racing truncation
@@ -2708,7 +2709,7 @@ retry:
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto out;
-   page = alloc_huge_page(vma, address, 0);
+   page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
ret = PTR_ERR(page);
if (ret == -ENOMEM)
@@ -2717,7 +2718,7 @@ retry:
ret = VM_FAULT_SIGBUS;
goto out;
}
-   clear_huge_page(page, address, pages_per_huge_page(h));
+   clear_huge_page(page, haddr, pages_per_huge_page(h));
__SetPageUptodate(page);
 
if (vma->vm_flags & VM_MAYSHARE) {
@@ -2763,7 +2764,7 @@ retry:
 * the spinlock.
 */
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
-   if (vma_needs_reservation(h, vma, address) < 0) {
+   if (vma_needs_reservation(h, vma, haddr) < 0) {
ret = VM_FAULT_OOM;
goto backout_unlocked;
}
@@ -2778,16 +2779,16 @@ retry:
goto backout;
 
if (anon_rmap)
-   hugepage_add_new_anon_rmap(page, vma, address);
+   hugepage_add_new_anon_rmap(page, vma, haddr);
else
page_dup_rmap(page);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
-   set_huge_pte_at(mm, address, ptep, new_pte);
+   set_huge_pte_at(mm, haddr, ptep, new_pte);
 
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
-   ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
+   ret = hugetlb_cow(mm, vma, haddr, ptep, new_pte, page);
}
 
spin_unlock(&mm->page_table_lock);
@@ -2813,21 +2814,20 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
struct page *pagecache_page = NULL;
static DEFINE_MUTEX(hugetlb_instantiation_mutex);
struct hstate *h = hstate_vma(vma);
+   unsigned long haddr = address & huge_page_mask(h);
 
-   address &= huge_page_mask(h);
-
-   ptep = huge_pte_offset(mm, address);
+   ptep = huge_pte_offset(mm, haddr);
if (ptep) {
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
-   migration_entry_wait(mm, (pmd_t *)ptep, address);
+   migration_entry_wait(mm, (pmd_t *)ptep, haddr);
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
}
 
-   ptep = huge_pte_alloc(mm, address, huge_page_size(h));
+   ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
if (!ptep)
return VM_FAULT_OOM;
 
@@ -2839,7 +2839,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
mutex_lock(&hugetlb_instantiation_mutex);
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
-   ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+   ret = hugetlb_no_page(mm, vma, haddr, address, ptep, flags);
goto out_mutex;
}
 
@@ -2854,14 +2854,14 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * consumed.
 */
if ((flags & FAULT_FLAG_WRITE) && !pte_write(entry)) {
-   if (vma_needs_reservation(h, vma, address) < 0) {
+   if (vma_needs_reservation(h, vma, haddr) < 0) {
ret = VM_FAULT_OOM;
  

[PATCH v3 2/7] THP: Pass fault address to __do_huge_pmd_anonymous_page()

2012-08-16 Thread Kirill A. Shutemov
From: Andi Kleen 

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/huge_memory.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 70737ec..6f0825b611 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -633,7 +633,8 @@ static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct 
vm_area_struct *vma)
 
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
-   unsigned long haddr, pmd_t *pmd,
+   unsigned long haddr,
+   unsigned long address, pmd_t *pmd,
struct page *page)
 {
pgtable_t pgtable;
@@ -720,8 +721,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
put_page(page);
goto out;
}
-   if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
- page))) {
+   if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr,
+   address, pmd, page))) {
mem_cgroup_uncharge_page(page);
put_page(page);
goto out;
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 0/7] Avoid cache trashing on clearing huge/gigantic page

2012-08-16 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Clearing a 2MB huge page will typically blow away several levels of CPU
caches.  To avoid this only cache clear the 4K area around the fault
address and use a cache avoiding clears for the rest of the 2MB area.

This patchset implements cache avoiding version of clear_page only for
x86. If an architecture wants to provide cache avoiding version of
clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

v3:
  - Rebased to current Linus' tree. kmap_atomic() build issue is fixed;
  - Pass fault address to clear_huge_page(). v2 had problem with clearing
for sizes other than HPAGE_SIZE
  - x86: fix 32bit variant. Fallback version of clear_page_nocache() has
been added for non-SSE2 systems;
  - x86: clear_page_nocache() moved to clear_page_{32,64}.S;
  - x86: use pushq_cfi/popq_cfi instead of push/pop;
v2:
  - No code change. Only commit messages are updated.
  - RFC mark is dropped.

Andi Kleen (5):
  THP: Use real address for NUMA policy
  THP: Pass fault address to __do_huge_pmd_anonymous_page()
  x86: Add clear_page_nocache
  mm: make clear_huge_page cache clear only around the fault address
  x86: switch the 64bit uncached page clear to SSE/AVX v2

Kirill A. Shutemov (2):
  hugetlb: pass fault address to hugetlb_no_page()
  mm: pass fault address to clear_huge_page()

 arch/x86/include/asm/page.h  |2 +
 arch/x86/include/asm/string_32.h |5 ++
 arch/x86/include/asm/string_64.h |5 ++
 arch/x86/lib/Makefile|3 +-
 arch/x86/lib/clear_page_32.S |   72 +++
 arch/x86/lib/clear_page_64.S |   78 ++
 arch/x86/mm/fault.c  |7 +++
 include/linux/mm.h   |2 +-
 mm/huge_memory.c |   17 
 mm/hugetlb.c |   39 ++-
 mm/memory.c  |   37 +++---
 11 files changed, 232 insertions(+), 35 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_32.S

-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 4/7] mm: pass fault address to clear_huge_page()

2012-08-16 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/mm.h |2 +-
 mm/huge_memory.c   |2 +-
 mm/hugetlb.c   |3 ++-
 mm/memory.c|7 ---
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..2858723 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1638,7 +1638,7 @@ extern void dump_page(struct page *page);
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
-   unsigned long addr,
+   unsigned long haddr, unsigned long fault_address,
unsigned int pages_per_huge_page);
 extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr, struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6f0825b611..070bf89 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -644,7 +644,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
if (unlikely(!pgtable))
return VM_FAULT_OOM;
 
-   clear_huge_page(page, haddr, HPAGE_PMD_NR);
+   clear_huge_page(page, haddr, address, HPAGE_PMD_NR);
__SetPageUptodate(page);
 
spin_lock(&mm->page_table_lock);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c86d3d..5182192 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2718,7 +2718,8 @@ retry:
ret = VM_FAULT_SIGBUS;
goto out;
}
-   clear_huge_page(page, haddr, pages_per_huge_page(h));
+   clear_huge_page(page, haddr, fault_address,
+   pages_per_huge_page(h));
__SetPageUptodate(page);
 
if (vma->vm_flags & VM_MAYSHARE) {
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..dfc179b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3984,19 +3984,20 @@ static void clear_gigantic_page(struct page *page,
}
 }
 void clear_huge_page(struct page *page,
-unsigned long addr, unsigned int pages_per_huge_page)
+unsigned long haddr, unsigned long fault_address,
+unsigned int pages_per_huge_page)
 {
int i;
 
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-   clear_gigantic_page(page, addr, pages_per_huge_page);
+   clear_gigantic_page(page, haddr, pages_per_huge_page);
return;
}
 
might_sleep();
for (i = 0; i < pages_per_huge_page; i++) {
cond_resched();
-   clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+   clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
}
 }
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 7/7] x86: switch the 64bit uncached page clear to SSE/AVX v2

2012-08-16 Thread Kirill A. Shutemov
From: Andi Kleen 

With multiple threads vector stores are more efficient, so use them.
This will cause the page clear to run non preemptable and add some
overhead. However on 32bit it was already non preempable (due to
kmap_atomic) and there is an preemption opportunity every 4K unit.

On a NPB (Nasa Parallel Benchmark) 128GB run on a Westmere this improves
the performance regression of enabling transparent huge pages
by ~2% (2.81% to 0.81%), near the runtime variability now.
On a system with AVX support more is expected.

Signed-off-by: Andi Kleen 
[kirill.shute...@linux.intel.com: Properly save/restore arguments]
Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/lib/clear_page_64.S |   79 ++
 1 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 9d2f3c2..b302cff 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -73,30 +73,79 @@ ENDPROC(clear_page)
 .Lclear_page_end-clear_page,3b-2b
.previous
 
+#define SSE_UNROLL 128
+
 /*
  * Zero a page avoiding the caches
  * rdi page
  */
 ENTRY(clear_page_nocache)
CFI_STARTPROC
-   xorl   %eax,%eax
-   movl   $4096/64,%ecx
+   pushq_cfi %rdi
+   call   kernel_fpu_begin
+   popq_cfi  %rdi
+   sub$16,%rsp
+   CFI_ADJUST_CFA_OFFSET 16
+   movdqu %xmm0,(%rsp)
+   xorpd  %xmm0,%xmm0
+   movl   $4096/SSE_UNROLL,%ecx
.p2align 4
 .Lloop_nocache:
decl%ecx
-#define PUT(x) movnti %rax,x*8(%rdi)
-   movnti %rax,(%rdi)
-   PUT(1)
-   PUT(2)
-   PUT(3)
-   PUT(4)
-   PUT(5)
-   PUT(6)
-   PUT(7)
-#undef PUT
-   leaq64(%rdi),%rdi
+   .set x,0
+   .rept SSE_UNROLL/16
+   movntdq %xmm0,x(%rdi)
+   .set x,x+16
+   .endr
+   leaqSSE_UNROLL(%rdi),%rdi
jnz .Lloop_nocache
-   nop
-   ret
+   movdqu (%rsp),%xmm0
+   addq   $16,%rsp
+   CFI_ADJUST_CFA_OFFSET -16
+   jmp   kernel_fpu_end
CFI_ENDPROC
 ENDPROC(clear_page_nocache)
+
+#ifdef CONFIG_AS_AVX
+
+   .section .altinstr_replacement,"ax"
+1: .byte 0xeb  /* jmp  */
+   .byte (clear_page_nocache_avx - clear_page_nocache) - (2f - 1b)
+   /* offset */
+2:
+   .previous
+   .section .altinstructions,"a"
+   altinstruction_entry clear_page_nocache,1b,X86_FEATURE_AVX,\
+16, 2b-1b
+   .previous
+
+#define AVX_UNROLL 256 /* TUNE ME */
+
+ENTRY(clear_page_nocache_avx)
+   CFI_STARTPROC
+   pushq_cfi %rdi
+   call   kernel_fpu_begin
+   popq_cfi  %rdi
+   sub$32,%rsp
+   CFI_ADJUST_CFA_OFFSET 32
+   vmovdqu %ymm0,(%rsp)
+   vxorpd  %ymm0,%ymm0,%ymm0
+   movl   $4096/AVX_UNROLL,%ecx
+   .p2align 4
+.Lloop_avx:
+   decl%ecx
+   .set x,0
+   .rept AVX_UNROLL/32
+   vmovntdq %ymm0,x(%rdi)
+   .set x,x+32
+   .endr
+   leaqAVX_UNROLL(%rdi),%rdi
+   jnz .Lloop_avx
+   vmovdqu (%rsp),%ymm0
+   addq   $32,%rsp
+   CFI_ADJUST_CFA_OFFSET -32
+   jmp   kernel_fpu_end
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache_avx)
+
+#endif
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 1/7] THP: Use real address for NUMA policy

2012-08-16 Thread Kirill A. Shutemov
From: Andi Kleen 

Use the fault address, not the rounded down hpage address for NUMA
policy purposes. In some circumstances this can give more exact
NUMA policy.

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/huge_memory.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..70737ec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -681,11 +681,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, 
gfp_t extra_gfp)
 
 static inline struct page *alloc_hugepage_vma(int defrag,
  struct vm_area_struct *vma,
- unsigned long haddr, int nd,
+ unsigned long address, int nd,
  gfp_t extra_gfp)
 {
return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
-  HPAGE_PMD_ORDER, vma, haddr, nd);
+  HPAGE_PMD_ORDER, vma, address, nd);
 }
 
 #ifndef CONFIG_NUMA
@@ -710,7 +710,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, address, numa_node_id(), 0);
if (unlikely(!page)) {
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -944,7 +944,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, address, numa_node_id(), 0);
else
new_page = NULL;
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Powerpc 8xx CPM_UART delay in receive

2012-08-16 Thread Alan Cox
> MAX_IDL: Maximum idle characters. When a character is received, the 
> receiver begins counting idle characters. If MAX_IDL idle characters
> are received before the next data character, an idle timeout occurs
> and the buffer is closed,
> generating a maskable interrupt request to the core to receive the
> data from the buffer. Thus, MAX_IDL offers a way to demarcate frames.
> To disable the feature, clear MAX_IDL. The bit length of an idle
> character is calculated as follows: 1 + data length (5–9) + 1 (if
> parity is used) 
> + number of stop bits (1–2). For 8 data bits, no parity, and 1 stop
> bit, the character length is 10 bits


So if you have slightly bursty high speed data as its quite typical
before your change you would get one interrupt per buffer of 32 bytes,
with it you'll get a lot more interrupts.

You have two available hints about the way to set this - one of them is
the baud rate (low baud rates mean the fifo isn't a big win and the
latency is high), the other is the low_latency flag if the driver
supports the low latency feature (and arguably you can still use a
request for it as a hint even if you refuse the actual feature).

So I think a reasonable approach would be set the idle timeout down for
low baud rates or if low_latency is requested.

> generated if there is at least one word in the FIFO and for a time 
> equivalent to the transmission of four characters

Which is a bit more reasonable than one, although problematic at low
speed (hence the fifo on/off).


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Powerpc 8xx CPM_UART delay in receive

2012-08-16 Thread leroy christophe

Le 16/08/2012 16:29, Alan Cox a écrit :

The PowerPC CPM is working differently. It doesn't use a fifo but
buffers. Buffers are handed to the microprocessor only when they are
full or after a timeout period which is adjustable. In the driver, the

Which is different how - remembering we empty the FIFO on an IRQ


buffers are configured with a size of 32 bytes. And the timeout is set
to the size of the buffer. That is this timeout that I'm reducing to 1
byte in my proposed patch. I can't see what it would break for high
speed I/O.

How can a timeout be measured in "bytes". Can we have a bit more clarity
on how the hardware works and take it from there ?

Alan

The reference manual of MPC885 says the following about the MAX_IDL 
parameter:


MAX_IDL: Maximum idle characters. When a character is received, the 
receiver begins counting idle characters. If MAX_IDL idle characters are 
received before the next data character, an idle timeout occurs and the 
buffer is closed,
generating a maskable interrupt request to the core to receive the data 
from the buffer. Thus, MAX_IDL offers a way to demarcate frames. To 
disable the feature, clear MAX_IDL. The bit length of an idle character 
is calculated as follows: 1 + data length (5–9) + 1 (if parity is used) 
+ number of stop bits (1–2). For 8 data bits, no parity, and 1 stop bit, 
the character length is 10 bits


If the UART is receiving data and gets an idle character (all ones), the 
channel begins counting consecutive idle characters received. If MAX_IDL 
is reached, the buffer is closed and an RX interrupt is generated if not 
masked. If no buffer is open, this event does not generate an interrupt 
or any status information. The internal idle counter (IDLC) is reset 
every time a character is received. To disable the idle sequence 
function, clear MAX_IDL



The datasheet of the 16550 UART says:

Besides, for FIFO mode operation a time out mechanism is implemented. 
Independently of the trigger level of the FIFO, an interrupt will be 
generated if there is at least one word in the FIFO and for a time 
equivalent to the transmission of four characters

- no new character has been received and
- the microprocessor has not read the RHR
To compute the time out, the current total number of bits (start, data, 
parity and stop(s)) is used, together with the current baud rate (i.e., 
it depends on the contents of the LCR, DLL, DLM and PSD registers).



Christophe
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Powerpc 8xx CPM_UART delay in receive

2012-08-16 Thread Alan Cox
> The PowerPC CPM is working differently. It doesn't use a fifo but 
> buffers. Buffers are handed to the microprocessor only when they are 
> full or after a timeout period which is adjustable. In the driver, the 

Which is different how - remembering we empty the FIFO on an IRQ

> buffers are configured with a size of 32 bytes. And the timeout is set 
> to the size of the buffer. That is this timeout that I'm reducing to 1 
> byte in my proposed patch. I can't see what it would break for high 
> speed I/O.

How can a timeout be measured in "bytes". Can we have a bit more clarity
on how the hardware works and take it from there ?

Alan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: powerpc/perf: hw breakpoints return ENOSPC

2012-08-16 Thread Peter Zijlstra
On Fri, 2012-08-17 at 00:02 +1000, Michael Ellerman wrote:
> You do want to guarantee that the task will always be subject to the
> breakpoint, even if it moves cpus. So is there any way to guarantee that
> other than reserving a breakpoint slot on every cpu ahead of time? 

That's not how regular perf works.. regular perf can overload hw
resources at will and stuff is strictly per-cpu.

So the regular perf record has perf_event_attr::inherit enabled by
default, this will result in it creating a per-task-per-cpu event for
each cpu and this will succeed because there's no strict reservation to
avoid/detect starvation against perf_event_attr::pinned events.

For regular (!pinned) events, we'll RR the created events on the
available hardware resources.

HWBP does things completely different and reserves a slot over all CPUs
for everything, thus stuff completely falls apart.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: powerpc/perf: hw breakpoints return ENOSPC

2012-08-16 Thread Michael Ellerman
On Thu, 2012-08-16 at 13:44 +0200, Peter Zijlstra wrote:
> On Thu, 2012-08-16 at 21:17 +1000, Michael Neuling wrote:
> > Peter,
> > 
> > > > On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1,
> > > > despite there being no breakpoint on this CPU.  This is because the call
> > > > the task_bp_pinned, checks all CPUs, rather than just the current CPU.
> > > > POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we
> > > > return ENOSPC.
> > > 
> > > I think this comes from the ptrace legacy, we register a breakpoint on
> > > all cpus because when we migrate a task it cannot fail to migrate the
> > > breakpoint.
> > > 
> > > Its one of the things I hate most about the hwbp stuff as it relates to
> > > perf.
> > > 
> > > Frederic knows more...
> > 
> > Maybe I should wait for Frederic to respond but I'm not sure I
> > understand what you're saying.
> > 
> > I can see how using ptrace hw breakpoints and perf hw breakpoints at the
> > same time could be a problem, but I'm not sure how this would stop it.
> 
> ptrace uses perf for hwbp support so we're stuck with all kinds of
> stupid ptrace constraints.. or somesuch.
> 
> > Are you saying that we need to keep at least 1 slot free at all times,
> > so that we can use it for ptrace?
> 
> No, I'm saying perf-hwbp is weird because of ptrace, maybe the ptrace
> weirdness shouldn't live in perf-hwpb but in the ptrace-perf glue
> however..

But how else would it work, even if ptrace wasn't in the picture?

You do want to guarantee that the task will always be subject to the
breakpoint, even if it moves cpus. So is there any way to guarantee that
other than reserving a breakpoint slot on every cpu ahead of time?

Or can a hwbp event go into error state if it can't be installed on the
new cpu, like a pinned event does? I can't see any code that does that.

cheers




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Powerpc 8xx CPM_UART delay in receive

2012-08-16 Thread leroy christophe


Le 14/08/2012 16:52, Alan Cox a écrit :

On Tue, 14 Aug 2012 16:26:28 +0200
Christophe Leroy  wrote:


Hello,

I'm not sure who to address this Patch to either

It fixes a delay issue with CPM UART driver on Powerpc MPC8xx.
The problem is that with the actual code, the driver waits 32 IDLE patterns 
before returning the received data to the upper level. It means for instance 
about 1 second at 300 bauds.
This fix limits to one byte the waiting period.

Take a look how the 8250 does it - I think you want to set the value
based upon the data rate. Your patch will break it for everyone doing
high seed I/O.

Alan

I'm not sure I understand what you mean. As far as I can see 8250/16550 
is working a bit different, as it is based on a fifo and triggers an 
interrupt as soon as a given number of bytes is received. I also see 
that in case this amount is not reached, there is a receive-timeout 
which goes on after no byte is received for a duration of more than 4 bytes.


The PowerPC CPM is working differently. It doesn't use a fifo but 
buffers. Buffers are handed to the microprocessor only when they are 
full or after a timeout period which is adjustable. In the driver, the 
buffers are configured with a size of 32 bytes. And the timeout is set 
to the size of the buffer. That is this timeout that I'm reducing to 1 
byte in my proposed patch. I can't see what it would break for high 
speed I/O.


Christophe
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: powerpc/perf: hw breakpoints return ENOSPC

2012-08-16 Thread Peter Zijlstra
On Thu, 2012-08-16 at 21:17 +1000, Michael Neuling wrote:
> Peter,
> 
> > > On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1,
> > > despite there being no breakpoint on this CPU.  This is because the call
> > > the task_bp_pinned, checks all CPUs, rather than just the current CPU.
> > > POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we
> > > return ENOSPC.
> > 
> > I think this comes from the ptrace legacy, we register a breakpoint on
> > all cpus because when we migrate a task it cannot fail to migrate the
> > breakpoint.
> > 
> > Its one of the things I hate most about the hwbp stuff as it relates to
> > perf.
> > 
> > Frederic knows more...
> 
> Maybe I should wait for Frederic to respond but I'm not sure I
> understand what you're saying.
> 
> I can see how using ptrace hw breakpoints and perf hw breakpoints at the
> same time could be a problem, but I'm not sure how this would stop it.

ptrace uses perf for hwbp support so we're stuck with all kinds of
stupid ptrace constraints.. or somesuch.

> Are you saying that we need to keep at least 1 slot free at all times,
> so that we can use it for ptrace?

No, I'm saying perf-hwbp is weird because of ptrace, maybe the ptrace
weirdness shouldn't live in perf-hwpb but in the ptrace-perf glue
however..

> Is "perf record -e mem:0x1000 true" ever going to be able to work on
> POWER7 with only one hw breakpoint resource per CPU?  

I think it should work... but I'm fairly sure it currently doesn't
because of how things are done. 'perf record -ie mem:0x100... true'
might just work.

I always forget all the ptrace details but I am forever annoyed at the
mess that is perf-hwbp.. Frederic is there really nothing we can do
about this?

The fact that ptrace hwbp semantics are different per architecture
doesn't help of course.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: powerpc/perf: hw breakpoints return ENOSPC

2012-08-16 Thread Michael Neuling
Peter,

> > On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1,
> > despite there being no breakpoint on this CPU.  This is because the call
> > the task_bp_pinned, checks all CPUs, rather than just the current CPU.
> > POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we
> > return ENOSPC.
> 
> I think this comes from the ptrace legacy, we register a breakpoint on
> all cpus because when we migrate a task it cannot fail to migrate the
> breakpoint.
> 
> Its one of the things I hate most about the hwbp stuff as it relates to
> perf.
> 
> Frederic knows more...

Maybe I should wait for Frederic to respond but I'm not sure I
understand what you're saying.

I can see how using ptrace hw breakpoints and perf hw breakpoints at the
same time could be a problem, but I'm not sure how this would stop it.

Are you saying that we need to keep at least 1 slot free at all times,
so that we can use it for ptrace?

Is "perf record -e mem:0x1000 true" ever going to be able to work on
POWER7 with only one hw breakpoint resource per CPU?  

Thanks,
Mikey
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: powerpc/perf: hw breakpoints return ENOSPC

2012-08-16 Thread Peter Zijlstra
On Thu, 2012-08-16 at 14:23 +1000, Michael Neuling wrote:
> 
> On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1,
> despite there being no breakpoint on this CPU.  This is because the call
> the task_bp_pinned, checks all CPUs, rather than just the current CPU.
> POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we
> return ENOSPC.

I think this comes from the ptrace legacy, we register a breakpoint on
all cpus because when we migrate a task it cannot fail to migrate the
breakpoint.

Its one of the things I hate most about the hwbp stuff as it relates to
perf.

Frederic knows more...
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev