Re: [PATCH v2 1/5] pinctrl: qcom: Add ipq8074 pinctrl driver

2017-05-19 Thread Bjorn Andersson
On Thu 18 May 01:39 PDT 2017, Varadarajan Narayanan wrote:

> 
> 
> On 5/18/2017 1:03 AM, Bjorn Andersson wrote:
> > On Mon 15 May 02:05 PDT 2017, Varadarajan Narayanan wrote:
> > 
> > > On 5/14/2017 9:53 AM, Bjorn Andersson wrote:
> > > > On Thu 11 May 03:33 PDT 2017, Varadarajan Narayanan wrote:
> > > > 
> > > > > On 5/11/2017 4:13 AM, Bjorn Andersson wrote:
> > > > > > On Thu 04 May 04:53 PDT 2017, Varadarajan Narayanan wrote:
> > [..]
> > > > > > > + msm_mux_qpic_pad4,
> > > > > > 
> > > > > > What are qpic_pad and qpic_pad0 through qpic_pad8? Different 
> > > > > > functions,
> > > > > > alternative muxings...?
> > > > > 
> > > > > This is for the NAND and LCD display. The pins listed are the 9 data 
> > > > > pins.
> > > > > 
> > > > 
> > > > Then you can describe them all as "qpic_pad" (or simply "qpic"?). (It's
> > > > possible to reference a partial group in the DTS, if that's necessary)
> > > 
> > > There are two sets of 9 pins, either of which can go to NAND or LCD.
> > > Will rename qpic_pad as qpic_a and qpic_pad[0-8] as qpic_b.
> > > Is that ok?
> > > 
> > 
> > So you have NAND and LCD hardware muxed to either "a" or "b" and then
> > you mux either "a" or "b" out onto actual pins?
> > 
> > How is this first mux configured?
> > 
> > I think the a/b scheme sounds reasonable, if above is how it works.
> 
> Sorry, I was wrong. I had misread the documentation.
> 
> There are 18 pins. 15 pins are common between LCD and NAND. The QPIC
> controller arbitrates between LCD and NAND. Of the remaining 4, 2 are for
> NAND and 2 are for LCD exclusively. We plan to group the qpic pins into 3
> groups namely, qpic_common, qpic_nand and qpic_lcd. Is that ok?
> 

If you consider that you are defining the available functions for this
pinmuxer and then define the sets of pins exposing these available
functions it does make sense to just name it "qpic".

I think that naming them _common, _lcd and _nand is just adding
confusion when it comes to writing the dts files.


@Linus, do you have a different preference here?

Regards,
Bjorn


Re: [PATCH v2 1/5] pinctrl: qcom: Add ipq8074 pinctrl driver

2017-05-19 Thread Bjorn Andersson
On Thu 18 May 01:39 PDT 2017, Varadarajan Narayanan wrote:

> 
> 
> On 5/18/2017 1:03 AM, Bjorn Andersson wrote:
> > On Mon 15 May 02:05 PDT 2017, Varadarajan Narayanan wrote:
> > 
> > > On 5/14/2017 9:53 AM, Bjorn Andersson wrote:
> > > > On Thu 11 May 03:33 PDT 2017, Varadarajan Narayanan wrote:
> > > > 
> > > > > On 5/11/2017 4:13 AM, Bjorn Andersson wrote:
> > > > > > On Thu 04 May 04:53 PDT 2017, Varadarajan Narayanan wrote:
> > [..]
> > > > > > > + msm_mux_qpic_pad4,
> > > > > > 
> > > > > > What are qpic_pad and qpic_pad0 through qpic_pad8? Different 
> > > > > > functions,
> > > > > > alternative muxings...?
> > > > > 
> > > > > This is for the NAND and LCD display. The pins listed are the 9 data 
> > > > > pins.
> > > > > 
> > > > 
> > > > Then you can describe them all as "qpic_pad" (or simply "qpic"?). (It's
> > > > possible to reference a partial group in the DTS, if that's necessary)
> > > 
> > > There are two sets of 9 pins, either of which can go to NAND or LCD.
> > > Will rename qpic_pad as qpic_a and qpic_pad[0-8] as qpic_b.
> > > Is that ok?
> > > 
> > 
> > So you have NAND and LCD hardware muxed to either "a" or "b" and then
> > you mux either "a" or "b" out onto actual pins?
> > 
> > How is this first mux configured?
> > 
> > I think the a/b scheme sounds reasonable, if above is how it works.
> 
> Sorry, I was wrong. I had misread the documentation.
> 
> There are 18 pins. 15 pins are common between LCD and NAND. The QPIC
> controller arbitrates between LCD and NAND. Of the remaining 4, 2 are for
> NAND and 2 are for LCD exclusively. We plan to group the qpic pins into 3
> groups namely, qpic_common, qpic_nand and qpic_lcd. Is that ok?
> 

If you consider that you are defining the available functions for this
pinmuxer and then define the sets of pins exposing these available
functions it does make sense to just name it "qpic".

I think that naming them _common, _lcd and _nand is just adding
confusion when it comes to writing the dts files.


@Linus, do you have a different preference here?

Regards,
Bjorn


Re: [4.12 regression] Thinkpad X250 Touchpad and Trackpoint not recognized anymore; commit e839ffa: "Input: synaptics - add support for Intertouch devices"

2017-05-19 Thread Pascal Wichmann
> Looks like you running your patched kernel?
That's right.


>>> CONFIG_RMI4_CORE=m
>>> CONFIG_RMI4_I2C=m
>>> CONFIG_RMI4_SPI=m
>>> # CONFIG_RMI4_SMB is not set
>
> This is your issue I believe.

Indeed, enabling that configuration solves that issue.

However, I think it is quite unintuitive that a module (psmouse) chooses
a default mode which requires another driver which is not necessarily
included; though it would probably be not a very clean solution to
explicitly check that as well.

Is this behaviour, that one module requires another without
communicating that clearly, wanted?

Thanks,
Pascal




signature.asc
Description: OpenPGP digital signature


Re: [4.12 regression] Thinkpad X250 Touchpad and Trackpoint not recognized anymore; commit e839ffa: "Input: synaptics - add support for Intertouch devices"

2017-05-19 Thread Pascal Wichmann
> Looks like you running your patched kernel?
That's right.


>>> CONFIG_RMI4_CORE=m
>>> CONFIG_RMI4_I2C=m
>>> CONFIG_RMI4_SPI=m
>>> # CONFIG_RMI4_SMB is not set
>
> This is your issue I believe.

Indeed, enabling that configuration solves that issue.

However, I think it is quite unintuitive that a module (psmouse) chooses
a default mode which requires another driver which is not necessarily
included; though it would probably be not a very clean solution to
explicitly check that as well.

Is this behaviour, that one module requires another without
communicating that clearly, wanted?

Thanks,
Pascal




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 7/7] DWARF: add the config option

2017-05-19 Thread Andy Lutomirski
On Fri, May 19, 2017 at 2:35 PM, Josh Poimboeuf  wrote:
> On Fri, May 19, 2017 at 04:29:13PM -0500, Josh Poimboeuf wrote:
>> > How are you handling control flow?
>>
>> Control flow of what?
>>
>> > > Here's the struct in its current state:
>> > >
>> > >   #define UNDWARF_REG_UNDEFINED   0
>> > >   #define UNDWARF_REG_CFA 1
>> > >   #define UNDWARF_REG_SP  2
>> > >   #define UNDWARF_REG_FP  3
>> > >   #define UNDWARF_REG_SP_INDIRECT 4
>> > >   #define UNDWARF_REG_FP_INDIRECT 5
>> > >   #define UNDWARF_REG_R10 6
>> > >   #define UNDWARF_REG_DI  7
>> > >   #define UNDWARF_REG_DX  8
>> > >
>> >
>> > Why only those registers?  Also, if you have the option I would really
>> > suggest using the actual x86 register numbers (ax, ex, dx, bx, sp, bp,
>> > si, di, r8-r15 in that order.)
>>
>> Those are the only registers which are ever needed as the base for
>> finding the previous stack frame.  99% of the time it's sp or bp, the
>> other registers are needed for aligned stacks and entry code.
>>
>> Using the actual register numbers isn't an option because I don't need
>> them all and they need to fit in a small number of bits.
>>
>> This construct might be useful for other arches, which is why I called
>> it "FP" instead of "BP".  But then I ruined that with the last 3 :-)
>
> BTW, here's the link to the unwinder code if you're interested:
>
>   
> https://github.com/jpoimboe/linux/blob/undwarf/arch/x86/kernel/unwind_undwarf.c

At the risk of potentially overcomplicating matters, here's a
suggestion.  As far as I know, all (or most all?) unwinders
effectively do the following in a loop:

1. Look up the IP to figure out how to unwind from that IP.
2. Use the results of that lookup to compute the previous frame state.

The results of step 1 could perhaps be expressed like this:

struct reg_formula {
  unsigned int source_reg :4;
  long offset;
  bool dereference;  /* true: *(reg + offset); false: (reg + offset) */
  /* For DWARF, I think this can be considerably more complicated, but
I doubt it's useful. */
};

struct unwind_step {
  u16 available_regs;  /* mask of the caller frame regs that we are
able to recover */
  struct reg_formula[16];
};

The CFA computation is just reg_formula[UNWIND_REG_SP] (or that plus
or minus sizeof(unsigned long) or whatever -- I can never remember
exactly what CFA refers to.)  For a frame pointer-based unwinder, the
entire unwind_step is a foregone conclusion independent of IP: SP = BP
+ 8 (or whatever), BP = *(BP + whatever), all other regs unknown.

Could it make sense to actually structure the code this way?  I can
see a few advantages.  It would make the actual meat of the unwind
loop totally independent of the unwinding algorithm in use, it would
make the meat be dead simple (and thus easy to verify for
non-crashiness), and I think it just might open the door for a real
in-kernel DWARF unwinder that Linus would be okay with.  Specifically,
write a function like:

bool get_dwarf_step(struct unwind_step *step, unsigned long ip);

Put this function in its own file and make it buildable as kernel code
or as user code.  Write a test case that runs it on every single
address on the kernel image (in user mode!) with address-sanitizer
enabled (or in Valgrind or both) and make sure that (a) it doesn't
blow up and (b) that the results are credible (e.g. by comparing to
objtool output).  Heck, you could even fuzz-test it where the fuzzer
is allowed to corrupt the actual DWARF data.  You could do the same
thing with whatever crazy super-compacted undwarf scheme someone comes
up with down the road, too.

I personally like the idea of using real DWARF annotations in the
entry code because it makes gdb work better (not kgdb -- real gdb
attached to KVM).  I bet that we could get entry asm annotations into
good shape if we really wanted to.  OTOH, getting DWARF to work well
for inline asm is really nasty IIRC.

(H.J., could we get a binutils feature that allows is to do:

pushq %whatever
.cfi_adjust_sp -8
...
popq %whatever
.cfi_adjust_sp 8

that will emit the right DWARF instructions regardless of whether
there's a frame pointer or not?  .cfi_adjust_cfa_offset is not
particularly helpful here because it's totally wrong if the CFA is
currently being computed based on BP.)


Also, you read the stack like this:

*val = *(unsigned long *)addr;

how about probe_kernel_read() instead?

--Andy


Re: [PATCH 7/7] DWARF: add the config option

2017-05-19 Thread Andy Lutomirski
On Fri, May 19, 2017 at 2:35 PM, Josh Poimboeuf  wrote:
> On Fri, May 19, 2017 at 04:29:13PM -0500, Josh Poimboeuf wrote:
>> > How are you handling control flow?
>>
>> Control flow of what?
>>
>> > > Here's the struct in its current state:
>> > >
>> > >   #define UNDWARF_REG_UNDEFINED   0
>> > >   #define UNDWARF_REG_CFA 1
>> > >   #define UNDWARF_REG_SP  2
>> > >   #define UNDWARF_REG_FP  3
>> > >   #define UNDWARF_REG_SP_INDIRECT 4
>> > >   #define UNDWARF_REG_FP_INDIRECT 5
>> > >   #define UNDWARF_REG_R10 6
>> > >   #define UNDWARF_REG_DI  7
>> > >   #define UNDWARF_REG_DX  8
>> > >
>> >
>> > Why only those registers?  Also, if you have the option I would really
>> > suggest using the actual x86 register numbers (ax, ex, dx, bx, sp, bp,
>> > si, di, r8-r15 in that order.)
>>
>> Those are the only registers which are ever needed as the base for
>> finding the previous stack frame.  99% of the time it's sp or bp, the
>> other registers are needed for aligned stacks and entry code.
>>
>> Using the actual register numbers isn't an option because I don't need
>> them all and they need to fit in a small number of bits.
>>
>> This construct might be useful for other arches, which is why I called
>> it "FP" instead of "BP".  But then I ruined that with the last 3 :-)
>
> BTW, here's the link to the unwinder code if you're interested:
>
>   
> https://github.com/jpoimboe/linux/blob/undwarf/arch/x86/kernel/unwind_undwarf.c

At the risk of potentially overcomplicating matters, here's a
suggestion.  As far as I know, all (or most all?) unwinders
effectively do the following in a loop:

1. Look up the IP to figure out how to unwind from that IP.
2. Use the results of that lookup to compute the previous frame state.

The results of step 1 could perhaps be expressed like this:

struct reg_formula {
  unsigned int source_reg :4;
  long offset;
  bool dereference;  /* true: *(reg + offset); false: (reg + offset) */
  /* For DWARF, I think this can be considerably more complicated, but
I doubt it's useful. */
};

struct unwind_step {
  u16 available_regs;  /* mask of the caller frame regs that we are
able to recover */
  struct reg_formula[16];
};

The CFA computation is just reg_formula[UNWIND_REG_SP] (or that plus
or minus sizeof(unsigned long) or whatever -- I can never remember
exactly what CFA refers to.)  For a frame pointer-based unwinder, the
entire unwind_step is a foregone conclusion independent of IP: SP = BP
+ 8 (or whatever), BP = *(BP + whatever), all other regs unknown.

Could it make sense to actually structure the code this way?  I can
see a few advantages.  It would make the actual meat of the unwind
loop totally independent of the unwinding algorithm in use, it would
make the meat be dead simple (and thus easy to verify for
non-crashiness), and I think it just might open the door for a real
in-kernel DWARF unwinder that Linus would be okay with.  Specifically,
write a function like:

bool get_dwarf_step(struct unwind_step *step, unsigned long ip);

Put this function in its own file and make it buildable as kernel code
or as user code.  Write a test case that runs it on every single
address on the kernel image (in user mode!) with address-sanitizer
enabled (or in Valgrind or both) and make sure that (a) it doesn't
blow up and (b) that the results are credible (e.g. by comparing to
objtool output).  Heck, you could even fuzz-test it where the fuzzer
is allowed to corrupt the actual DWARF data.  You could do the same
thing with whatever crazy super-compacted undwarf scheme someone comes
up with down the road, too.

I personally like the idea of using real DWARF annotations in the
entry code because it makes gdb work better (not kgdb -- real gdb
attached to KVM).  I bet that we could get entry asm annotations into
good shape if we really wanted to.  OTOH, getting DWARF to work well
for inline asm is really nasty IIRC.

(H.J., could we get a binutils feature that allows is to do:

pushq %whatever
.cfi_adjust_sp -8
...
popq %whatever
.cfi_adjust_sp 8

that will emit the right DWARF instructions regardless of whether
there's a frame pointer or not?  .cfi_adjust_cfa_offset is not
particularly helpful here because it's totally wrong if the CFA is
currently being computed based on BP.)


Also, you read the stack like this:

*val = *(unsigned long *)addr;

how about probe_kernel_read() instead?

--Andy


RE: [PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV incapable platform

2017-05-19 Thread Cheng, Collins
Hi Alex,

Yes, I hope kernel can disable SR-IOV and related VF resource allocation if the 
system BIOS is not SR-IOV capable.

Adding the parameter "pci=nosriov" sounds a doable solution, but it would need 
user to add this parameter manually, right? I think an automatic detection 
would be better. My patch is trying to auto detect and bypass VF resource 
allocation.


-Collins Cheng


-Original Message-
From: Alexander Duyck [mailto:alexander.du...@gmail.com] 
Sent: Friday, May 19, 2017 11:44 PM
To: Alex Williamson 
Cc: Cheng, Collins ; Bjorn Helgaas 
; linux-...@vger.kernel.org; linux-kernel@vger.kernel.org; 
Deucher, Alexander ; Zytaruk, Kelly 
; Yinghai Lu 
Subject: Re: [PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV 
incapable platform

On Mon, May 15, 2017 at 10:53 AM, Alex Williamson  
wrote:
> On Mon, 15 May 2017 08:19:28 +
> "Cheng, Collins"  wrote:
>
>> Hi Williamson,
>>
>> We cannot assume BIOS supports SR-IOV, actually only newer server 
>> motherboard BIOS supports SR-IOV. Normal desktop motherboard BIOS or older 
>> server motherboard BIOS doesn't support SR-IOV. This issue would happen if 
>> an user plugs our AMD SR-IOV capable GPU card to a normal desktop 
>> motherboard.
>
> Servers should be supporting SR-IOV for a long time now.  What really 
> is there to a BIOS supporting SR-IOV anyway, it's simply reserving 
> sufficient bus number and MMIO resources such that we can enable the 
> VFs.  This process isn't exclusively reserved for the BIOS.  Some 
> platforms may choose to only initialize boot devices, leaving the rest 
> for the OS to program.  The initial proposal here to disable SR-IOV if 
> not programmed at OS hand-off disables even the possibility of the OS 
> reallocating resources for this device.

There are differences between supporting SR-IOV and supporting SR-IOV on 
devices with massive resources. I know I have seen NICs that will keep a system 
from completing POST if SR-IOV is enabled, and MMIO beyond 4G is not. My guess 
would be that the issues being seen are probably that they disable SR-IOV in 
the BIOS in such a setup and end up running into issues when they try to boot 
into the Linux kernel as it goes through and tries to allocate resources for 
SR-IOV even though it was disabled in the BIOS.

It might make sense to add a kernel parameter something like a "pci=nosriov" 
that would allow for disabling SR-IOV and related resource allocation if that 
is what we are talking about. That way you could plug in these types of devices 
into a system with a legacy bios or that doesn't wan to allocate addresses 
above 32b for MMIO, and this parameter would be all that is needed to disable 
SR-IOV so you could plug in a NIC that has SR-IOV associated with it.

>> I agree that failure to allocate VF resources should leave the device in no 
>> worse condition than before it tried. I hope kernel could allocate PF device 
>> resource before allocating VF device resource, and keep PF device resource 
>> valid and functional if failed to allocate VF device resource.
>>
>> I will send out dmesg log lspci info tomorrow. Thanks.
>
> Thanks,
> Alex
>
>> -Original Message-
>> From: Alex Williamson [mailto:alex.william...@redhat.com]
>> Sent: Friday, May 12, 2017 10:43 PM
>> To: Cheng, Collins 
>> Cc: Bjorn Helgaas ; linux-...@vger.kernel.org; 
>> linux-kernel@vger.kernel.org; Deucher, Alexander 
>> ; Zytaruk, Kelly ; 
>> Yinghai Lu 
>> Subject: Re: [PATCH] PCI: Make SR-IOV capable GPU working on the 
>> SR-IOV incapable platform
>>
>> On Fri, 12 May 2017 04:51:43 +
>> "Cheng, Collins"  wrote:
>>
>> > Hi Williamson,
>> >
>> > I verified the patch is working for both AMD SR-IOV GPU and Intel SR-IOV 
>> > NIC. I don't think it is redundant to check the VF BAR valid before call 
>> > sriov_init(), it is safe and saving boot time, also there is no a better 
>> > method to know if system BIOS has correctly initialized the SR-IOV 
>> > capability or not.
>>
>> It also masks an underlying bug and creates a maintenance issue that we 
>> won't know when it's safe to remove this workaround.  I don't think faster 
>> boot is valid rationale, in one case SR-IOV is completely disabled, the 
>> other we attempt to allocate the resources the BIOS failed to provide.  I 
>> expect this is also a corner case, the BIOS should typically support SR-IOV, 
>> therefore this situation should be an exception.
>>
>> > I did not try to fix the issue from the kernel resource allocation 
>> > perspective, it is because:
>> > 1. I am not very familiar with the PCI resource allocation scheme in 
>> > kernel. For example, in sriov_init(), kernel will 

RE: [PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV incapable platform

2017-05-19 Thread Cheng, Collins
Hi Alex,

Yes, I hope kernel can disable SR-IOV and related VF resource allocation if the 
system BIOS is not SR-IOV capable.

Adding the parameter "pci=nosriov" sounds a doable solution, but it would need 
user to add this parameter manually, right? I think an automatic detection 
would be better. My patch is trying to auto detect and bypass VF resource 
allocation.


-Collins Cheng


-Original Message-
From: Alexander Duyck [mailto:alexander.du...@gmail.com] 
Sent: Friday, May 19, 2017 11:44 PM
To: Alex Williamson 
Cc: Cheng, Collins ; Bjorn Helgaas 
; linux-...@vger.kernel.org; linux-kernel@vger.kernel.org; 
Deucher, Alexander ; Zytaruk, Kelly 
; Yinghai Lu 
Subject: Re: [PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV 
incapable platform

On Mon, May 15, 2017 at 10:53 AM, Alex Williamson  
wrote:
> On Mon, 15 May 2017 08:19:28 +
> "Cheng, Collins"  wrote:
>
>> Hi Williamson,
>>
>> We cannot assume BIOS supports SR-IOV, actually only newer server 
>> motherboard BIOS supports SR-IOV. Normal desktop motherboard BIOS or older 
>> server motherboard BIOS doesn't support SR-IOV. This issue would happen if 
>> an user plugs our AMD SR-IOV capable GPU card to a normal desktop 
>> motherboard.
>
> Servers should be supporting SR-IOV for a long time now.  What really 
> is there to a BIOS supporting SR-IOV anyway, it's simply reserving 
> sufficient bus number and MMIO resources such that we can enable the 
> VFs.  This process isn't exclusively reserved for the BIOS.  Some 
> platforms may choose to only initialize boot devices, leaving the rest 
> for the OS to program.  The initial proposal here to disable SR-IOV if 
> not programmed at OS hand-off disables even the possibility of the OS 
> reallocating resources for this device.

There are differences between supporting SR-IOV and supporting SR-IOV on 
devices with massive resources. I know I have seen NICs that will keep a system 
from completing POST if SR-IOV is enabled, and MMIO beyond 4G is not. My guess 
would be that the issues being seen are probably that they disable SR-IOV in 
the BIOS in such a setup and end up running into issues when they try to boot 
into the Linux kernel as it goes through and tries to allocate resources for 
SR-IOV even though it was disabled in the BIOS.

It might make sense to add a kernel parameter something like a "pci=nosriov" 
that would allow for disabling SR-IOV and related resource allocation if that 
is what we are talking about. That way you could plug in these types of devices 
into a system with a legacy bios or that doesn't wan to allocate addresses 
above 32b for MMIO, and this parameter would be all that is needed to disable 
SR-IOV so you could plug in a NIC that has SR-IOV associated with it.

>> I agree that failure to allocate VF resources should leave the device in no 
>> worse condition than before it tried. I hope kernel could allocate PF device 
>> resource before allocating VF device resource, and keep PF device resource 
>> valid and functional if failed to allocate VF device resource.
>>
>> I will send out dmesg log lspci info tomorrow. Thanks.
>
> Thanks,
> Alex
>
>> -Original Message-
>> From: Alex Williamson [mailto:alex.william...@redhat.com]
>> Sent: Friday, May 12, 2017 10:43 PM
>> To: Cheng, Collins 
>> Cc: Bjorn Helgaas ; linux-...@vger.kernel.org; 
>> linux-kernel@vger.kernel.org; Deucher, Alexander 
>> ; Zytaruk, Kelly ; 
>> Yinghai Lu 
>> Subject: Re: [PATCH] PCI: Make SR-IOV capable GPU working on the 
>> SR-IOV incapable platform
>>
>> On Fri, 12 May 2017 04:51:43 +
>> "Cheng, Collins"  wrote:
>>
>> > Hi Williamson,
>> >
>> > I verified the patch is working for both AMD SR-IOV GPU and Intel SR-IOV 
>> > NIC. I don't think it is redundant to check the VF BAR valid before call 
>> > sriov_init(), it is safe and saving boot time, also there is no a better 
>> > method to know if system BIOS has correctly initialized the SR-IOV 
>> > capability or not.
>>
>> It also masks an underlying bug and creates a maintenance issue that we 
>> won't know when it's safe to remove this workaround.  I don't think faster 
>> boot is valid rationale, in one case SR-IOV is completely disabled, the 
>> other we attempt to allocate the resources the BIOS failed to provide.  I 
>> expect this is also a corner case, the BIOS should typically support SR-IOV, 
>> therefore this situation should be an exception.
>>
>> > I did not try to fix the issue from the kernel resource allocation 
>> > perspective, it is because:
>> > 1. I am not very familiar with the PCI resource allocation scheme in 
>> > kernel. For example, in sriov_init(), kernel will re-assign the PCI 
>> > resource for both VF and PF. I don't understand why kernel allocates 
>> > resource for VF firstly, then PF. If it is PF firstly, then this issue 
>> > could be avoided.
>> > 2. I am not sure if kernel has error handler if PCI resource 
>> > allocation failed. In this case, kernel cannot allocate enough 

Re: [PATCH v3] jbd2: preserve original nofs flag during journal restart

2017-05-19 Thread Theodore Ts'o
On Thu, May 18, 2017 at 09:28:50AM -0700, Tahsin Erdogan wrote:
> When a transaction starts, start_this_handle() saves current
> PF_MEMALLOC_NOFS value so that it can be restored at journal stop time.
> Journal restart is a special case that calls start_this_handle() without
> stopping the transaction. start_this_handle() isn't aware that the
> original value is already stored so it overwrites it with current value.
> 
> For instance, a call sequence like below leaves PF_MEMALLOC_NOFS flag set
> at the end:
> 
>   jbd2_journal_start()
>   jbd2__journal_restart()
>   jbd2_journal_stop()
> 
> Make jbd2__journal_restart() restore the original value before calling
> start_this_handle().
> 
> Fixes: 81378da64de6 ("jbd2: mark the transaction context with the scope 
> GFP_NOFS context")
> Signed-off-by: Tahsin Erdogan 
> Reviewed-by: Jan Kara 

Thanks, applied.

- Ted


Re: [PATCH v3] jbd2: preserve original nofs flag during journal restart

2017-05-19 Thread Theodore Ts'o
On Thu, May 18, 2017 at 09:28:50AM -0700, Tahsin Erdogan wrote:
> When a transaction starts, start_this_handle() saves current
> PF_MEMALLOC_NOFS value so that it can be restored at journal stop time.
> Journal restart is a special case that calls start_this_handle() without
> stopping the transaction. start_this_handle() isn't aware that the
> original value is already stored so it overwrites it with current value.
> 
> For instance, a call sequence like below leaves PF_MEMALLOC_NOFS flag set
> at the end:
> 
>   jbd2_journal_start()
>   jbd2__journal_restart()
>   jbd2_journal_stop()
> 
> Make jbd2__journal_restart() restore the original value before calling
> start_this_handle().
> 
> Fixes: 81378da64de6 ("jbd2: mark the transaction context with the scope 
> GFP_NOFS context")
> Signed-off-by: Tahsin Erdogan 
> Reviewed-by: Jan Kara 

Thanks, applied.

- Ted


Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread zhong jiang
On 2017/5/20 10:40, Hugh Dickins wrote:
> On Sat, 20 May 2017, Xishi Qiu wrote:
>> Here is a bug report form redhat: 
>> https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>> And I meet the bug too. However it is hard to reproduce, and 
>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not 
>> help.
>>
>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
>> _count=2),
>> and the value of mapping is a valid address(mapping = 0x8801b3e2a101),
>> but anon_vma has been corrupted.
>>
>> Any ideas?
> Sorry, no.  I assume that _mapcount has been misaccounted, for example
> a pte mapped in on top of another pte; but cannot begin tell you where
> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>
> Hugh
>
> .
>
Hi, Hugh

I find the following message from the dmesg.

[26068.316592] BUG: Bad rss-counter state mm:8800a7de2d80 idx:1 val:1

I can prove that the __mapcount is misaccount.  when task is exited. the rmap
still exist.

Thanks
zhongjiang



Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread zhong jiang
On 2017/5/20 10:40, Hugh Dickins wrote:
> On Sat, 20 May 2017, Xishi Qiu wrote:
>> Here is a bug report form redhat: 
>> https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>> And I meet the bug too. However it is hard to reproduce, and 
>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not 
>> help.
>>
>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
>> _count=2),
>> and the value of mapping is a valid address(mapping = 0x8801b3e2a101),
>> but anon_vma has been corrupted.
>>
>> Any ideas?
> Sorry, no.  I assume that _mapcount has been misaccounted, for example
> a pte mapped in on top of another pte; but cannot begin tell you where
> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>
> Hugh
>
> .
>
Hi, Hugh

I find the following message from the dmesg.

[26068.316592] BUG: Bad rss-counter state mm:8800a7de2d80 idx:1 val:1

I can prove that the __mapcount is misaccount.  when task is exited. the rmap
still exist.

Thanks
zhongjiang



Re: [RESEND: PATCH v4 2/4] remoteproc: qcom: refactor mss fw image loading sequence

2017-05-19 Thread Sricharan R
Hi Bjorn/Avaneesh,

On 5/16/2017 11:32 PM, Avaneesh Kumar Dwivedi wrote:
> This patch refactor code to first load all firmware blobs
> and then update modem proc to authenticate and boot fw.
> Also make a trivial change in a error log.
> 
> Signed-off-by: Avaneesh Kumar Dwivedi 
> ---
>  drivers/remoteproc/qcom_q6v5_pil.c | 25 -
>  1 file changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/remoteproc/qcom_q6v5_pil.c 
> b/drivers/remoteproc/qcom_q6v5_pil.c
> index 8fd697a..2626954 100644
> --- a/drivers/remoteproc/qcom_q6v5_pil.c
> +++ b/drivers/remoteproc/qcom_q6v5_pil.c
> @@ -466,9 +466,10 @@ static int q6v5_mpss_init_image(struct q6v5 *qproc, 
> const struct firmware *fw)
>  
>   ret = q6v5_rmb_mba_wait(qproc, RMB_MBA_META_DATA_AUTH_SUCCESS, 1000);
>   if (ret == -ETIMEDOUT)
> - dev_err(qproc->dev, "MPSS header authentication timed out\n");
> + dev_err(qproc->dev, "metadata authentication timed out\n");
>   else if (ret < 0)
> - dev_err(qproc->dev, "MPSS header authentication failed: %d\n", 
> ret);
> + dev_err(qproc->dev,
> + "metadata authentication failed: %d\n", ret);
>  
>   dma_free_attrs(qproc->dev, fw->size, ptr, phys, dma_attrs);
>  
> @@ -503,7 +504,7 @@ static int q6v5_mpss_load(struct q6v5 *qproc)
>   bool relocate = false;
>   char seg_name[10];
>   ssize_t offset;
> - size_t size;
> + size_t size = 0;
>   void *ptr;
>   int ret;
>   int i;
> @@ -541,7 +542,9 @@ static int q6v5_mpss_load(struct q6v5 *qproc)
>   }
>  
>   mpss_reloc = relocate ? min_addr : qproc->mpss_phys;
> -
> + /* Load firmware completely before letting mss to start
> +  * authentication and then boot firmware
> +  */
>   for (i = 0; i < ehdr->e_phnum; i++) {
>   phdr = [i];
>  
> @@ -574,17 +577,13 @@ static int q6v5_mpss_load(struct q6v5 *qproc)
>   memset(ptr + phdr->p_filesz, 0,
>  phdr->p_memsz - phdr->p_filesz);
>   }
> -
> - size = readl(qproc->rmb_base + RMB_PMI_CODE_LENGTH_REG);
> - if (!size) {
> - boot_addr = relocate ? qproc->mpss_phys : min_addr;
> - writel(boot_addr, qproc->rmb_base + 
> RMB_PMI_CODE_START_REG);
> - writel(RMB_CMD_LOAD_READY, qproc->rmb_base + 
> RMB_MBA_COMMAND_REG);
> - }
> -
>   size += phdr->p_memsz;
> - writel(size, qproc->rmb_base + RMB_PMI_CODE_LENGTH_REG);
>   }

So while moving this down, can we use qcom_mdt_load instead for the mpss
image loading part above ?

> + /* Transfer ownership of modem region with modem fw */
> + boot_addr = relocate ? qproc->mpss_phys : min_addr;
> + writel(boot_addr, qproc->rmb_base + RMB_PMI_CODE_START_REG);
> + writel(RMB_CMD_LOAD_READY, qproc->rmb_base + RMB_MBA_COMMAND_REG);
> + writel(size, qproc->rmb_base + RMB_PMI_CODE_LENGTH_REG);

For ipq8074 [1], wcnss core has an Q6V5 version of the ip for which the
initialization/boot sequence is pretty much the same as that has been added
for msm8996 in this series. So wanted to understand if its better to
use this remoteproc itself by keeping the Q6 and mpss parts separately (or)
add a new remoteproc ?

[1] https://patchwork.kernel.org/patch/9711725/

Regards,
 Sricharan

>  
>   ret = q6v5_rmb_mba_wait(qproc, RMB_MBA_AUTH_COMPLETE, 1);
>   if (ret == -ETIMEDOUT)
> 

-- 
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum, hosted by The Linux Foundation


Re: [RESEND: PATCH v4 2/4] remoteproc: qcom: refactor mss fw image loading sequence

2017-05-19 Thread Sricharan R
Hi Bjorn/Avaneesh,

On 5/16/2017 11:32 PM, Avaneesh Kumar Dwivedi wrote:
> This patch refactor code to first load all firmware blobs
> and then update modem proc to authenticate and boot fw.
> Also make a trivial change in a error log.
> 
> Signed-off-by: Avaneesh Kumar Dwivedi 
> ---
>  drivers/remoteproc/qcom_q6v5_pil.c | 25 -
>  1 file changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/remoteproc/qcom_q6v5_pil.c 
> b/drivers/remoteproc/qcom_q6v5_pil.c
> index 8fd697a..2626954 100644
> --- a/drivers/remoteproc/qcom_q6v5_pil.c
> +++ b/drivers/remoteproc/qcom_q6v5_pil.c
> @@ -466,9 +466,10 @@ static int q6v5_mpss_init_image(struct q6v5 *qproc, 
> const struct firmware *fw)
>  
>   ret = q6v5_rmb_mba_wait(qproc, RMB_MBA_META_DATA_AUTH_SUCCESS, 1000);
>   if (ret == -ETIMEDOUT)
> - dev_err(qproc->dev, "MPSS header authentication timed out\n");
> + dev_err(qproc->dev, "metadata authentication timed out\n");
>   else if (ret < 0)
> - dev_err(qproc->dev, "MPSS header authentication failed: %d\n", 
> ret);
> + dev_err(qproc->dev,
> + "metadata authentication failed: %d\n", ret);
>  
>   dma_free_attrs(qproc->dev, fw->size, ptr, phys, dma_attrs);
>  
> @@ -503,7 +504,7 @@ static int q6v5_mpss_load(struct q6v5 *qproc)
>   bool relocate = false;
>   char seg_name[10];
>   ssize_t offset;
> - size_t size;
> + size_t size = 0;
>   void *ptr;
>   int ret;
>   int i;
> @@ -541,7 +542,9 @@ static int q6v5_mpss_load(struct q6v5 *qproc)
>   }
>  
>   mpss_reloc = relocate ? min_addr : qproc->mpss_phys;
> -
> + /* Load firmware completely before letting mss to start
> +  * authentication and then boot firmware
> +  */
>   for (i = 0; i < ehdr->e_phnum; i++) {
>   phdr = [i];
>  
> @@ -574,17 +577,13 @@ static int q6v5_mpss_load(struct q6v5 *qproc)
>   memset(ptr + phdr->p_filesz, 0,
>  phdr->p_memsz - phdr->p_filesz);
>   }
> -
> - size = readl(qproc->rmb_base + RMB_PMI_CODE_LENGTH_REG);
> - if (!size) {
> - boot_addr = relocate ? qproc->mpss_phys : min_addr;
> - writel(boot_addr, qproc->rmb_base + 
> RMB_PMI_CODE_START_REG);
> - writel(RMB_CMD_LOAD_READY, qproc->rmb_base + 
> RMB_MBA_COMMAND_REG);
> - }
> -
>   size += phdr->p_memsz;
> - writel(size, qproc->rmb_base + RMB_PMI_CODE_LENGTH_REG);
>   }

So while moving this down, can we use qcom_mdt_load instead for the mpss
image loading part above ?

> + /* Transfer ownership of modem region with modem fw */
> + boot_addr = relocate ? qproc->mpss_phys : min_addr;
> + writel(boot_addr, qproc->rmb_base + RMB_PMI_CODE_START_REG);
> + writel(RMB_CMD_LOAD_READY, qproc->rmb_base + RMB_MBA_COMMAND_REG);
> + writel(size, qproc->rmb_base + RMB_PMI_CODE_LENGTH_REG);

For ipq8074 [1], wcnss core has an Q6V5 version of the ip for which the
initialization/boot sequence is pretty much the same as that has been added
for msm8996 in this series. So wanted to understand if its better to
use this remoteproc itself by keeping the Q6 and mpss parts separately (or)
add a new remoteproc ?

[1] https://patchwork.kernel.org/patch/9711725/

Regards,
 Sricharan

>  
>   ret = q6v5_rmb_mba_wait(qproc, RMB_MBA_AUTH_COMPLETE, 1);
>   if (ret == -ETIMEDOUT)
> 

-- 
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum, hosted by The Linux Foundation


Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread Hugh Dickins
On Sat, 20 May 2017, Xishi Qiu wrote:
> 
> Here is a bug report form redhat: 
> https://bugzilla.redhat.com/show_bug.cgi?id=1305620
> And I meet the bug too. However it is hard to reproduce, and 
> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not help.
> 
> From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
> _count=2),
> and the value of mapping is a valid address(mapping = 0x8801b3e2a101),
> but anon_vma has been corrupted.
> 
> Any ideas?

Sorry, no.  I assume that _mapcount has been misaccounted, for example
a pte mapped in on top of another pte; but cannot begin tell you where
in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.

Hugh


Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread Hugh Dickins
On Sat, 20 May 2017, Xishi Qiu wrote:
> 
> Here is a bug report form redhat: 
> https://bugzilla.redhat.com/show_bug.cgi?id=1305620
> And I meet the bug too. However it is hard to reproduce, and 
> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not help.
> 
> From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
> _count=2),
> and the value of mapping is a valid address(mapping = 0x8801b3e2a101),
> but anon_vma has been corrupted.
> 
> Any ideas?

Sorry, no.  I assume that _mapcount has been misaccounted, for example
a pte mapped in on top of another pte; but cannot begin tell you where
in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.

Hugh


Re: next-20170515: WARNING: CPU: 0 PID: 1 at arch/x86/mm/dump_pagetables.c:236 note_page+0x630/0x7e0

2017-05-19 Thread Masami Hiramatsu
; > > >   GLB NX pte
> > > > 0x81c0-0x8220   6M RW PSE   
> > > >   GLB NX pmd
> > > > 0x8220-0x8240   2M RW   
> > > >   GLB NX pte
> > > > 0x8240-0xc000 988M  
> > > >  pmd
> > > > ---[ Modules ]---   
> > > > 
> > > > 0xc000-0xc0001000   4K RW   
> > > >   GLB x  pte
> > > > 0xc0001000-0xc0002000   4K  
> > > >  pte
> > > > 0xc0002000-0xc0039000 220K RW   
> > > >   GLB x  pte
> > > > 
> > > > root@piggy:~# cat /proc/modules | sort -k 6 | head -3   
> > > > 
> > > > scsi_mod 221979 4 sg,sd_mod,sr_mod,libata, Live 0xc0002000 (E)  
> > > > 
> > > > e1000 127757 0 - Live 0xc004d000 (E)
> > > > 
> > > > libata 229931 2 ata_generic,ata_piix, Live 0xc0076000 (E) 
> > > > 
> > > > So that 4K RW seems suspect of getting used for allocation purpose on 
> > > > edge
> > > > for a particular reason and it also happens to be on the edge of the 
> > > > high
> > > > kernel mapping. Could it be the boundary semantic issue ?
> > > > 
> > > > For instance can it be that since 0xc0002000 is given to the 
> > > > first
> > > > module by the allocator, scsi_mod, and since that address is 
> > > > *technically*
> > > > part of two boundaries we get a splat ?
> > > > 
> > > > 0xc0001000-0xc0002000   4K  
> > > >  pte
> > > > 0xc0002000-0xc0039000 220K RW   
> > > >   GLB x  pte
> > > 
> > > Note on the latest linux-next and on the commit that introduced this the 
> > > config
> > > and kernel yields only *one* page:
> > > 
> > > x86/mm: Checked W+X mappings: FAILED, 1 W+X pages found.
> > > 
> > > I believe this is more indications my suspicion might be right.
> > 
> > If the following is a legit forced way to get query the kernel to ask it 
> > who owns a page then perhaps this technique can be used in the future to
> > figure out who the hell caused this. Catalin, can you confirm? In this
> > case this is perhaps not a leaked page but I am trying to abuse the
> > kmemleak debugfs API to query who allocated the page. Is that fine?
> > 
> > [0.916771] WARNING: CPU: 0 PID: 1 at arch/x86/mm/dump_pagetables.c:235 
> > note_page+0x63c/0x7e0
> > [0.917636] x86/mm: Found insecure W+X mapping at address 
> > c03d5000/0xc03d5000
> > [0.918502] Modules linked in:
> > [0.918819] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> > 4.11.0-mcgrof-force-config #340
> > [0.919631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
> > [0.920011] Call Trace:
> > [0.920011]  dump_stack+0x63/0x81
> > [0.920011]  __warn+0xcb/0xf0
> > [0.920011]  warn_slowpath_fmt+0x5a/0x80
> > [0.920011]  note_page+0x63c/0x7e0
> > [0.920011]  ptdump_walk_pgd_level_core+0x3b1/0x460
> > [0.920011]  ? 0x86c0
> > [0.920011]  ptdump_walk_pgd_level_checkwx+0x17/0x20
> > [0.920011]  mark_rodata_ro+0xf4/0x100
> > [0.920011]  ? rest_init+0x80/0x80
> > [0.920011]  kernel_init+0x2a/0x100
> > [0.920011]  ret_from_fork+0x2c/0x40
> > [0.925474] ---[ end trace dca00cd779490a2b ]---
> > [0.925959] x86/mm: Checked W+X mappings: FAILED, 1 W+X pages found.
> > 
> > echo dump=0xc03d5000 > /sys/kernel/debug/kmemleak
> > dmesg | tail
> > 
> > [   49.209565] kmemleak: Object 0xc03d5000 (size 335):
> > [   49.210814] kmemleak:   comm "swapper/0", pid 1, jiffies 4294892440
> > [   49.212148] kmemleak:   min_count = 2
> > [   49.212852] kmemleak:   count = 0
> > [   49.213363] kmemleak:   flags = 0x1
> > [   49.213363] kmemleak:   checksum = 0
> > [   49.213363] kmemleak:   backtrace:
> > [   49.213363]  kmemleak_alloc+0x4a/0xa0
> > [   49.213363]  __vmalloc_node_range+0x20a/0x2b0
> > [   49.213363]  module_alloc+0x67/0xc0
> > [   49.213363]  arch_ftrace_update_trampoline+0xba/0x260
> > [   49.213363]  ftrace_startup+0x90/0x210
> > [   49.213363]  register_ftrace_function+0x4b/0x60
> > [   49.213363]  arm_kprobe+0x84/0xe0
> > [   49.213363]  register_kprobe+0x56e/0x5b0
> > [   49.213363]  init_test_probes+0x61/0x560
> > [   49.213363]  init_kprobes+0x1e3/0x206
> > [   49.213363]  do_one_initcall+0x52/0x1a0
> > [   49.213363]  kernel_init_freeable+0x178/0x200
> > [   49.213363]  kernel_init+0xe/0x100
> > [   49.213363]  ret_from_fork+0x2c/0x40
> > [   49.213363]  0x
> 
> Aha! And the winner is:
> 
> CONFIG_KPROBES_SANITY_TEST
> 
> I confirm disabling it on 4.3.0-rc3 and on linux-next next-20170519 avoids 
> the WARN.
> I also can confirm using the 'echo dump=mem-area > 
> /sys/kernel/debug/kmemleak' yields
> the same trace for both of these kernels.
> 
> So -- the above kmemleak hack seems to actually work to seek who owns that 
> page.
> 
> Now to figure out how the hell kernel/test_kprobes.c screws around with 
> things.

Ah, that was fixed recently;

https://marc.info/?l=linux-kernel=149076389011850

Note that this patch depends another patch in the series;

https://marc.info/?l=linux-kernel=149076370111812=2

Thank you,

-- 
Masami Hiramatsu <mhira...@kernel.org>


Re: next-20170515: WARNING: CPU: 0 PID: 1 at arch/x86/mm/dump_pagetables.c:236 note_page+0x630/0x7e0

2017-05-19 Thread Masami Hiramatsu
ff81c0-0x8220   6M RW PSE   
> > > >   GLB NX pmd
> > > > 0x8220-0x8240   2M RW   
> > > >   GLB NX pte
> > > > 0x8240-0xc000 988M  
> > > >  pmd
> > > > ---[ Modules ]---   
> > > > 
> > > > 0xc000-0xc0001000   4K RW   
> > > >   GLB x  pte
> > > > 0xc0001000-0xc0002000   4K  
> > > >  pte
> > > > 0xc0002000-0xc0039000 220K RW   
> > > >   GLB x  pte
> > > > 
> > > > root@piggy:~# cat /proc/modules | sort -k 6 | head -3   
> > > > 
> > > > scsi_mod 221979 4 sg,sd_mod,sr_mod,libata, Live 0xc0002000 (E)  
> > > > 
> > > > e1000 127757 0 - Live 0xc004d000 (E)
> > > > 
> > > > libata 229931 2 ata_generic,ata_piix, Live 0xc0076000 (E) 
> > > > 
> > > > So that 4K RW seems suspect of getting used for allocation purpose on 
> > > > edge
> > > > for a particular reason and it also happens to be on the edge of the 
> > > > high
> > > > kernel mapping. Could it be the boundary semantic issue ?
> > > > 
> > > > For instance can it be that since 0xc0002000 is given to the 
> > > > first
> > > > module by the allocator, scsi_mod, and since that address is 
> > > > *technically*
> > > > part of two boundaries we get a splat ?
> > > > 
> > > > 0xc0001000-0xc0002000   4K  
> > > >  pte
> > > > 0xc0002000-0xc0039000 220K RW   
> > > >   GLB x  pte
> > > 
> > > Note on the latest linux-next and on the commit that introduced this the 
> > > config
> > > and kernel yields only *one* page:
> > > 
> > > x86/mm: Checked W+X mappings: FAILED, 1 W+X pages found.
> > > 
> > > I believe this is more indications my suspicion might be right.
> > 
> > If the following is a legit forced way to get query the kernel to ask it 
> > who owns a page then perhaps this technique can be used in the future to
> > figure out who the hell caused this. Catalin, can you confirm? In this
> > case this is perhaps not a leaked page but I am trying to abuse the
> > kmemleak debugfs API to query who allocated the page. Is that fine?
> > 
> > [0.916771] WARNING: CPU: 0 PID: 1 at arch/x86/mm/dump_pagetables.c:235 
> > note_page+0x63c/0x7e0
> > [0.917636] x86/mm: Found insecure W+X mapping at address 
> > c03d5000/0xc03d5000
> > [0.918502] Modules linked in:
> > [0.918819] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> > 4.11.0-mcgrof-force-config #340
> > [0.919631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
> > [0.920011] Call Trace:
> > [0.920011]  dump_stack+0x63/0x81
> > [0.920011]  __warn+0xcb/0xf0
> > [0.920011]  warn_slowpath_fmt+0x5a/0x80
> > [0.920011]  note_page+0x63c/0x7e0
> > [0.920011]  ptdump_walk_pgd_level_core+0x3b1/0x460
> > [0.920011]  ? 0x86c0
> > [0.920011]  ptdump_walk_pgd_level_checkwx+0x17/0x20
> > [0.920011]  mark_rodata_ro+0xf4/0x100
> > [0.920011]  ? rest_init+0x80/0x80
> > [0.920011]  kernel_init+0x2a/0x100
> > [0.920011]  ret_from_fork+0x2c/0x40
> > [0.925474] ---[ end trace dca00cd779490a2b ]---
> > [0.925959] x86/mm: Checked W+X mappings: FAILED, 1 W+X pages found.
> > 
> > echo dump=0xc03d5000 > /sys/kernel/debug/kmemleak
> > dmesg | tail
> > 
> > [   49.209565] kmemleak: Object 0xc03d5000 (size 335):
> > [   49.210814] kmemleak:   comm "swapper/0", pid 1, jiffies 4294892440
> > [   49.212148] kmemleak:   min_count = 2
> > [   49.212852] kmemleak:   count = 0
> > [   49.213363] kmemleak:   flags = 0x1
> > [   49.213363] kmemleak:   checksum = 0
> > [   49.213363] kmemleak:   backtrace:
> > [   49.213363]  kmemleak_alloc+0x4a/0xa0
> > [   49.213363]  __vmalloc_node_range+0x20a/0x2b0
> > [   49.213363]  module_alloc+0x67/0xc0
> > [   49.213363]  arch_ftrace_update_trampoline+0xba/0x260
> > [   49.213363]  ftrace_startup+0x90/0x210
> > [   49.213363]  register_ftrace_function+0x4b/0x60
> > [   49.213363]  arm_kprobe+0x84/0xe0
> > [   49.213363]  register_kprobe+0x56e/0x5b0
> > [   49.213363]  init_test_probes+0x61/0x560
> > [   49.213363]  init_kprobes+0x1e3/0x206
> > [   49.213363]  do_one_initcall+0x52/0x1a0
> > [   49.213363]  kernel_init_freeable+0x178/0x200
> > [   49.213363]  kernel_init+0xe/0x100
> > [   49.213363]  ret_from_fork+0x2c/0x40
> > [   49.213363]  0x
> 
> Aha! And the winner is:
> 
> CONFIG_KPROBES_SANITY_TEST
> 
> I confirm disabling it on 4.3.0-rc3 and on linux-next next-20170519 avoids 
> the WARN.
> I also can confirm using the 'echo dump=mem-area > 
> /sys/kernel/debug/kmemleak' yields
> the same trace for both of these kernels.
> 
> So -- the above kmemleak hack seems to actually work to seek who owns that 
> page.
> 
> Now to figure out how the hell kernel/test_kprobes.c screws around with 
> things.

Ah, that was fixed recently;

https://marc.info/?l=linux-kernel=149076389011850

Note that this patch depends another patch in the series;

https://marc.info/?l=linux-kernel=149076370111812=2

Thank you,

-- 
Masami Hiramatsu 


[PATCH 1/1] xilinx ps uart: Adding a kernel parameter for the number of xilinx ps uarts

2017-05-19 Thread Sam Povilus
The number of xilinx ps uart should be set by a kernel parameter instead of
using a #define. This allows the user to set the number of xilinx ps uart
using only kconfig and not modifying kernel source.

The ps uart is used in Xilnx Zynq chips usually in quantities maxing at
two, but there may be other chips that use more in the future or that I
don't know about. 

Signed-off-by: Sam Povilus 
---
 drivers/tty/serial/Kconfig | 9 +
 drivers/tty/serial/xilinx_uartps.c | 2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/serial/Kconfig b/drivers/tty/serial/Kconfig
index 5c8850f7a2a0..fef25f17a4cc 100644
--- a/drivers/tty/serial/Kconfig
+++ b/drivers/tty/serial/Kconfig
@@ -1461,6 +1461,15 @@ config SERIAL_XILINX_PS_UART_CONSOLE
help
  Enable a Cadence UART port to be the system console.
 
+config SERIAL_XILINX_PS_UART_NR_UARTS
+   int "Maximum number of Cadence UART ports"
+   depends on CONFIG_SERIAL_XILINX_NR_UARTS
+   range 1 64
+   default 2
+   help
+ Set this to the number of Cadence UARTS in your system, or the number
+you think you might implement.
+
 config SERIAL_AR933X
tristate "AR933X serial port support"
depends on HAVE_CLK && SOC_AR933X
diff --git a/drivers/tty/serial/xilinx_uartps.c 
b/drivers/tty/serial/xilinx_uartps.c
index c0539950f8d7..a2c51c35da65 100644
--- a/drivers/tty/serial/xilinx_uartps.c
+++ b/drivers/tty/serial/xilinx_uartps.c
@@ -36,7 +36,7 @@
 #define CDNS_UART_NAME "xuartps"
 #define CDNS_UART_MAJOR0   /* use dynamic node allocation 
*/
 #define CDNS_UART_MINOR0   /* works best with devtmpfs */
-#define CDNS_UART_NR_PORTS 2
+#define CDNS_UART_NR_PORTS CONFIG_SERIAL_XILINX_NR_UARTS
 #define CDNS_UART_FIFO_SIZE64  /* FIFO size */
 #define CDNS_UART_REGISTER_SPACE   0x1000
 
-- 
2.11.0



[PATCH 1/1] xilinx ps uart: Adding a kernel parameter for the number of xilinx ps uarts

2017-05-19 Thread Sam Povilus
The number of xilinx ps uart should be set by a kernel parameter instead of
using a #define. This allows the user to set the number of xilinx ps uart
using only kconfig and not modifying kernel source.

The ps uart is used in Xilnx Zynq chips usually in quantities maxing at
two, but there may be other chips that use more in the future or that I
don't know about. 

Signed-off-by: Sam Povilus 
---
 drivers/tty/serial/Kconfig | 9 +
 drivers/tty/serial/xilinx_uartps.c | 2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/serial/Kconfig b/drivers/tty/serial/Kconfig
index 5c8850f7a2a0..fef25f17a4cc 100644
--- a/drivers/tty/serial/Kconfig
+++ b/drivers/tty/serial/Kconfig
@@ -1461,6 +1461,15 @@ config SERIAL_XILINX_PS_UART_CONSOLE
help
  Enable a Cadence UART port to be the system console.
 
+config SERIAL_XILINX_PS_UART_NR_UARTS
+   int "Maximum number of Cadence UART ports"
+   depends on CONFIG_SERIAL_XILINX_NR_UARTS
+   range 1 64
+   default 2
+   help
+ Set this to the number of Cadence UARTS in your system, or the number
+you think you might implement.
+
 config SERIAL_AR933X
tristate "AR933X serial port support"
depends on HAVE_CLK && SOC_AR933X
diff --git a/drivers/tty/serial/xilinx_uartps.c 
b/drivers/tty/serial/xilinx_uartps.c
index c0539950f8d7..a2c51c35da65 100644
--- a/drivers/tty/serial/xilinx_uartps.c
+++ b/drivers/tty/serial/xilinx_uartps.c
@@ -36,7 +36,7 @@
 #define CDNS_UART_NAME "xuartps"
 #define CDNS_UART_MAJOR0   /* use dynamic node allocation 
*/
 #define CDNS_UART_MINOR0   /* works best with devtmpfs */
-#define CDNS_UART_NR_PORTS 2
+#define CDNS_UART_NR_PORTS CONFIG_SERIAL_XILINX_NR_UARTS
 #define CDNS_UART_FIFO_SIZE64  /* FIFO size */
 #define CDNS_UART_REGISTER_SPACE   0x1000
 
-- 
2.11.0



Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread Xishi Qiu
On 2017/5/20 10:02, Hugh Dickins wrote:

> On Sat, 20 May 2017, Xishi Qiu wrote:
>> On 2017/5/20 6:00, Hugh Dickins wrote:
>>>
>>> You're ignoring the rcu_read_lock() on entry to page_lock_anon_vma_read(),
>>> and the SLAB_DESTROY_BY_RCU (recently renamed SLAB_TYPESAFE_BY_RCU) nature
>>> of the anon_vma_cachep kmem cache.  It is not safe to muck with anon_vma->
>>> root in anon_vma_free(), others could still be looking at it.
>>>
>>> Hugh
>>>
>>
>> Hi Hugh,
>>
>> Thanks for your reply.
>>
>> SLAB_DESTROY_BY_RCU will let it call call_rcu() in free_slab(), but if the
>> anon_vma *reuse* by someone again, access root_anon_vma is not safe, right?
> 
> That is safe, on reuse it is still a struct anon_vma; then the test for
> !page_mapped(page) will show that it's no longer a reliable anon_vma for
> this page, so page_lock_anon_vma_read() returns NULL.
> 
> But of course, if page->_mapcount has been corrupted or misaccounted,
> it may think page_mapped(page) when actually page is not mapped,
> and the anon_vma is not good for it.
> 

Hi Hugh,

Here is a bug report form redhat: 
https://bugzilla.redhat.com/show_bug.cgi?id=1305620
And I meet the bug too. However it is hard to reproduce, and 
624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not help.

>From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
>_count=2),
and the value of mapping is a valid address(mapping = 0x8801b3e2a101),
but anon_vma has been corrupted.

Any ideas?

Thanks,
Xishi Qiu

>>
>> e.g. if I clean the root pointer before free it, then access root_anon_vma
>> in page_lock_anon_vma_read() is NULL pointer access, right?
> 
> Yes, cleaning root pointer before free may result in NULL pointer access.
> 
> Hugh
> 
>>
>> anon_vma_free()
>>  ...
>>  anon_vma->root = NULL;
>>  kmem_cache_free(anon_vma_cachep, anon_vma);
>>  ...
>>
>> Thanks,
>> Xishi Qiu
> 
> .
> 





Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread Xishi Qiu
On 2017/5/20 10:02, Hugh Dickins wrote:

> On Sat, 20 May 2017, Xishi Qiu wrote:
>> On 2017/5/20 6:00, Hugh Dickins wrote:
>>>
>>> You're ignoring the rcu_read_lock() on entry to page_lock_anon_vma_read(),
>>> and the SLAB_DESTROY_BY_RCU (recently renamed SLAB_TYPESAFE_BY_RCU) nature
>>> of the anon_vma_cachep kmem cache.  It is not safe to muck with anon_vma->
>>> root in anon_vma_free(), others could still be looking at it.
>>>
>>> Hugh
>>>
>>
>> Hi Hugh,
>>
>> Thanks for your reply.
>>
>> SLAB_DESTROY_BY_RCU will let it call call_rcu() in free_slab(), but if the
>> anon_vma *reuse* by someone again, access root_anon_vma is not safe, right?
> 
> That is safe, on reuse it is still a struct anon_vma; then the test for
> !page_mapped(page) will show that it's no longer a reliable anon_vma for
> this page, so page_lock_anon_vma_read() returns NULL.
> 
> But of course, if page->_mapcount has been corrupted or misaccounted,
> it may think page_mapped(page) when actually page is not mapped,
> and the anon_vma is not good for it.
> 

Hi Hugh,

Here is a bug report form redhat: 
https://bugzilla.redhat.com/show_bug.cgi?id=1305620
And I meet the bug too. However it is hard to reproduce, and 
624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not help.

>From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
>_count=2),
and the value of mapping is a valid address(mapping = 0x8801b3e2a101),
but anon_vma has been corrupted.

Any ideas?

Thanks,
Xishi Qiu

>>
>> e.g. if I clean the root pointer before free it, then access root_anon_vma
>> in page_lock_anon_vma_read() is NULL pointer access, right?
> 
> Yes, cleaning root pointer before free may result in NULL pointer access.
> 
> Hugh
> 
>>
>> anon_vma_free()
>>  ...
>>  anon_vma->root = NULL;
>>  kmem_cache_free(anon_vma_cachep, anon_vma);
>>  ...
>>
>> Thanks,
>> Xishi Qiu
> 
> .
> 





Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint

2017-05-19 Thread Mike Galbraith
On Fri, 2017-05-19 at 16:38 -0400, Tejun Heo wrote:
> Hello, Waiman.
> 
> On Mon, May 15, 2017 at 09:34:11AM -0400, Waiman Long wrote:
> > The rationale behind the cgroup v2 no internal process constraint is
> > to avoid resouorce competition between internal processes and child
> > cgroups. However, not all controllers have problem with internal
> > process competiton. Enforcing this rule may lead to unnatural process
> > hierarchy and unneeded levels for those controllers.
> 
> This isn't necessarily something we can determine by looking at the
> current state of controllers.  It's true that some controllers - pid
> and perf - inherently only care about membership of each task but at
> the same time neither really suffers from the constraint either.  CPU
> which is the problematic one here...

(+ cpuacct + cpuset)


Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint

2017-05-19 Thread Mike Galbraith
On Fri, 2017-05-19 at 16:38 -0400, Tejun Heo wrote:
> Hello, Waiman.
> 
> On Mon, May 15, 2017 at 09:34:11AM -0400, Waiman Long wrote:
> > The rationale behind the cgroup v2 no internal process constraint is
> > to avoid resouorce competition between internal processes and child
> > cgroups. However, not all controllers have problem with internal
> > process competiton. Enforcing this rule may lead to unnatural process
> > hierarchy and unneeded levels for those controllers.
> 
> This isn't necessarily something we can determine by looking at the
> current state of controllers.  It's true that some controllers - pid
> and perf - inherently only care about membership of each task but at
> the same time neither really suffers from the constraint either.  CPU
> which is the problematic one here...

(+ cpuacct + cpuset)


Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread Hugh Dickins
On Sat, 20 May 2017, Xishi Qiu wrote:
> On 2017/5/20 6:00, Hugh Dickins wrote:
> > 
> > You're ignoring the rcu_read_lock() on entry to page_lock_anon_vma_read(),
> > and the SLAB_DESTROY_BY_RCU (recently renamed SLAB_TYPESAFE_BY_RCU) nature
> > of the anon_vma_cachep kmem cache.  It is not safe to muck with anon_vma->
> > root in anon_vma_free(), others could still be looking at it.
> > 
> > Hugh
> > 
> 
> Hi Hugh,
> 
> Thanks for your reply.
> 
> SLAB_DESTROY_BY_RCU will let it call call_rcu() in free_slab(), but if the
> anon_vma *reuse* by someone again, access root_anon_vma is not safe, right?

That is safe, on reuse it is still a struct anon_vma; then the test for
!page_mapped(page) will show that it's no longer a reliable anon_vma for
this page, so page_lock_anon_vma_read() returns NULL.

But of course, if page->_mapcount has been corrupted or misaccounted,
it may think page_mapped(page) when actually page is not mapped,
and the anon_vma is not good for it.

> 
> e.g. if I clean the root pointer before free it, then access root_anon_vma
> in page_lock_anon_vma_read() is NULL pointer access, right?

Yes, cleaning root pointer before free may result in NULL pointer access.

Hugh

> 
> anon_vma_free()
>   ...
>   anon_vma->root = NULL;
>   kmem_cache_free(anon_vma_cachep, anon_vma);
>   ...
> 
> Thanks,
> Xishi Qiu


Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread Hugh Dickins
On Sat, 20 May 2017, Xishi Qiu wrote:
> On 2017/5/20 6:00, Hugh Dickins wrote:
> > 
> > You're ignoring the rcu_read_lock() on entry to page_lock_anon_vma_read(),
> > and the SLAB_DESTROY_BY_RCU (recently renamed SLAB_TYPESAFE_BY_RCU) nature
> > of the anon_vma_cachep kmem cache.  It is not safe to muck with anon_vma->
> > root in anon_vma_free(), others could still be looking at it.
> > 
> > Hugh
> > 
> 
> Hi Hugh,
> 
> Thanks for your reply.
> 
> SLAB_DESTROY_BY_RCU will let it call call_rcu() in free_slab(), but if the
> anon_vma *reuse* by someone again, access root_anon_vma is not safe, right?

That is safe, on reuse it is still a struct anon_vma; then the test for
!page_mapped(page) will show that it's no longer a reliable anon_vma for
this page, so page_lock_anon_vma_read() returns NULL.

But of course, if page->_mapcount has been corrupted or misaccounted,
it may think page_mapped(page) when actually page is not mapped,
and the anon_vma is not good for it.

> 
> e.g. if I clean the root pointer before free it, then access root_anon_vma
> in page_lock_anon_vma_read() is NULL pointer access, right?

Yes, cleaning root pointer before free may result in NULL pointer access.

Hugh

> 
> anon_vma_free()
>   ...
>   anon_vma->root = NULL;
>   kmem_cache_free(anon_vma_cachep, anon_vma);
>   ...
> 
> Thanks,
> Xishi Qiu


Re: [linux-sunxi] Re: [RFC PATCH 01/11] dt-bindings: update the binding for Allwinner H3 TVE support

2017-05-19 Thread Chen-Yu Tsai
On Sat, May 20, 2017 at 2:06 AM, Icenowy Zheng  wrote:
>
>
> 于 2017年5月20日 GMT+08:00 上午2:02:15, Maxime Ripard 
>  写到:
>>On Thu, May 18, 2017 at 12:43:44AM +0800, Icenowy Zheng wrote:
>>> -On SoCs other than the A33 and V3s, there is one more clock
>>required:
>>> +For the following compatibles:
>>> +   * allwinner,sun5i-a13-tcon
>>> +   * allwinner,sun6i-a31-tcon
>>> +   * allwinner,sun6i-a31s-tcon
>>> +   * allwinner,sun8i-a33-tcon
>>> +   * allwinner,sun8i-v3s-tcon
>>> +there is one more clock and one more property required:
>>> + - clocks:
>>> +   - 'tcon-ch0': The clock driving the TCON channel 0
>>> + - clock-output-names: Name of the pixel clock created
>>> +
>>> +For the following compatibles:
>>> +   * allwinner,sun5i-a13-tcon
>>> +   * allwinner,sun6i-a31-tcon
>>> +   * allwinner,sun6i-a31s-tcon
>>> +   * allwinner,sun8i-h3-tcon0
>>> +there is one more clock required:
>>> - 'tcon-ch1': The clock driving the TCON channel 1
>>
>>Putting ID's in the compatible name is usually a bad idea. What is the
>>difference between the two? Only that the second one doesn't have a
>>clock?
>
> Yes.
>
>>
>>That seems highly unlikely. How does it generate the pixel clock
>>frequency?
>
> Yes it seems impossible, but it's also the fact.
>
> There's only one CLK_TCON in H3/5, which is for TCON0.
>
> It's possible that lcd-ch1 clk is CLK_TVE, but it's still a weird situation --
> Although we have a lcd-ch1 clock, we cannot touch it, otherwise
> the TVE will refuse to work (the TVE can only work under 216MHz).

Assuming the TV encoder is like the old one, then it never had a
separate module clock. Instead its timing signals are fed from the
TCON. So CLK_TVE is likely the clock for TCON1 here.

ChenYu


Re: [linux-sunxi] Re: [RFC PATCH 01/11] dt-bindings: update the binding for Allwinner H3 TVE support

2017-05-19 Thread Chen-Yu Tsai
On Sat, May 20, 2017 at 2:06 AM, Icenowy Zheng  wrote:
>
>
> 于 2017年5月20日 GMT+08:00 上午2:02:15, Maxime Ripard 
>  写到:
>>On Thu, May 18, 2017 at 12:43:44AM +0800, Icenowy Zheng wrote:
>>> -On SoCs other than the A33 and V3s, there is one more clock
>>required:
>>> +For the following compatibles:
>>> +   * allwinner,sun5i-a13-tcon
>>> +   * allwinner,sun6i-a31-tcon
>>> +   * allwinner,sun6i-a31s-tcon
>>> +   * allwinner,sun8i-a33-tcon
>>> +   * allwinner,sun8i-v3s-tcon
>>> +there is one more clock and one more property required:
>>> + - clocks:
>>> +   - 'tcon-ch0': The clock driving the TCON channel 0
>>> + - clock-output-names: Name of the pixel clock created
>>> +
>>> +For the following compatibles:
>>> +   * allwinner,sun5i-a13-tcon
>>> +   * allwinner,sun6i-a31-tcon
>>> +   * allwinner,sun6i-a31s-tcon
>>> +   * allwinner,sun8i-h3-tcon0
>>> +there is one more clock required:
>>> - 'tcon-ch1': The clock driving the TCON channel 1
>>
>>Putting ID's in the compatible name is usually a bad idea. What is the
>>difference between the two? Only that the second one doesn't have a
>>clock?
>
> Yes.
>
>>
>>That seems highly unlikely. How does it generate the pixel clock
>>frequency?
>
> Yes it seems impossible, but it's also the fact.
>
> There's only one CLK_TCON in H3/5, which is for TCON0.
>
> It's possible that lcd-ch1 clk is CLK_TVE, but it's still a weird situation --
> Although we have a lcd-ch1 clock, we cannot touch it, otherwise
> the TVE will refuse to work (the TVE can only work under 216MHz).

Assuming the TV encoder is like the old one, then it never had a
separate module clock. Instead its timing signals are fed from the
TCON. So CLK_TVE is likely the clock for TCON1 here.

ChenYu


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-05-19 Thread J. R. Okajima
"J. R. Okajima":
> I don't know whether the fix is good to me or not yet. I will test your
> fix, but I am busy now and my test will be a few weeks later. Other
> people may want the fix soon. So I'd suggest you to reproduce the
> problem on your side. I guess "mem=1G" or "mem=512M" will make it easier
> to reproduce the problem.
> Of course, if you are sure the fix is correct, then you don't have to
> wait for my test. Release it soon for other people.

Now I am going to able to run my local test.
But V3 patch failed to apply onto v4.11.0.

Would you provide us two versions of the patch, one for v4.11 series and
the other of v4.12-rcN?


J. R. Okajima


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-05-19 Thread J. R. Okajima
"J. R. Okajima":
> I don't know whether the fix is good to me or not yet. I will test your
> fix, but I am busy now and my test will be a few weeks later. Other
> people may want the fix soon. So I'd suggest you to reproduce the
> problem on your side. I guess "mem=1G" or "mem=512M" will make it easier
> to reproduce the problem.
> Of course, if you are sure the fix is correct, then you don't have to
> wait for my test. Release it soon for other people.

Now I am going to able to run my local test.
But V3 patch failed to apply onto v4.11.0.

Would you provide us two versions of the patch, one for v4.11 series and
the other of v4.12-rcN?


J. R. Okajima


Re: [PATCH v2 1/2] Documentation: dt: Update ti,emif bindings

2017-05-19 Thread Rob Herring
On Fri, May 19, 2017 at 12:57:07PM -0500, Dave Gerlach wrote:
> Update the Texas Instruments EMIF binding document to include the device
> tree bindings for ti,emif-am3352 and ti,emif-am4372 which are used by
> the ti-emif-sram driver to provide low-level PM functionality.
> 
> Signed-off-by: Dave Gerlach 
> ---
>  .../devicetree/bindings/memory-controllers/ti/emif.txt | 18 
> +-
>  1 file changed, 17 insertions(+), 1 deletion(-)

I acked v1. Please add acks when posting new versions.

Rob


Re: [PATCH v2 1/2] Documentation: dt: Update ti,emif bindings

2017-05-19 Thread Rob Herring
On Fri, May 19, 2017 at 12:57:07PM -0500, Dave Gerlach wrote:
> Update the Texas Instruments EMIF binding document to include the device
> tree bindings for ti,emif-am3352 and ti,emif-am4372 which are used by
> the ti-emif-sram driver to provide low-level PM functionality.
> 
> Signed-off-by: Dave Gerlach 
> ---
>  .../devicetree/bindings/memory-controllers/ti/emif.txt | 18 
> +-
>  1 file changed, 17 insertions(+), 1 deletion(-)

I acked v1. Please add acks when posting new versions.

Rob


Re: [linux-sunxi] Re: [RFC PATCH 07/11] drm: sun4i: add support for the TV encoder in H3 SoC

2017-05-19 Thread Chen-Yu Tsai
On Sat, May 20, 2017 at 2:23 AM, Jernej Škrabec  wrote:
> Hi,
>
> Dne petek, 19. maj 2017 ob 20:08:18 CEST je Icenowy Zheng napisal(a):
>> 于 2017年5月20日 GMT+08:00 上午2:03:30, Maxime Ripard  electrons.com> 写到:
>> >On Thu, May 18, 2017 at 12:43:50AM +0800, Icenowy Zheng wrote:
>> >> Allwinner H3 features a TV encoder similar to the one in earlier
>> >
>> >SoCs,
>> >
>> >> but with some different points about clocks:
>> >> - It has a mod clock and a bus clock.
>> >> - The mod clock must be at a fixed rate to generate signal.
>> >
>> >Why?
>>
>> It's experiment result by Jernej.
>>
>> The clock rates in BSP kernel is also specially designed
>> (PLL_DE at 432MHz) in order to be able to feed the TVE.
>
> My experiments and search through BSP code showed that TVE seems to have
> additional fixed predivider 8. So if you want to generate 27 MHz clock, unit
> has to be feed with 216 MHz.
>
> TVE has only one PLL source PLL_DE. And since 216 MHz is a bit low for DE2,
> BSP defaults to 432 MHz for PLL_DE and use divider 2 to generate 216 MHz. This
> clock is then divided by 8 internaly to get final 27 MHz.
>
> Please note that I don't have any hard evidence to support that, only
> experimental data. However, only that explanation make sense to me.
>
> BTW, BSP H3/H5 TV driver supports only PAL and NTSC which both use 27 MHz base
> clock. Further experiments are needed to check if there is any possibility to
> have other resolutions by manipulating clocks and give other proper settings.
> I plan to do that, but not in very near future.

You only have composite video output, and those are the only 2 standard
resolutions that make any sense.

ChenYu


Re: [linux-sunxi] Re: [RFC PATCH 07/11] drm: sun4i: add support for the TV encoder in H3 SoC

2017-05-19 Thread Chen-Yu Tsai
On Sat, May 20, 2017 at 2:23 AM, Jernej Škrabec  wrote:
> Hi,
>
> Dne petek, 19. maj 2017 ob 20:08:18 CEST je Icenowy Zheng napisal(a):
>> 于 2017年5月20日 GMT+08:00 上午2:03:30, Maxime Ripard  electrons.com> 写到:
>> >On Thu, May 18, 2017 at 12:43:50AM +0800, Icenowy Zheng wrote:
>> >> Allwinner H3 features a TV encoder similar to the one in earlier
>> >
>> >SoCs,
>> >
>> >> but with some different points about clocks:
>> >> - It has a mod clock and a bus clock.
>> >> - The mod clock must be at a fixed rate to generate signal.
>> >
>> >Why?
>>
>> It's experiment result by Jernej.
>>
>> The clock rates in BSP kernel is also specially designed
>> (PLL_DE at 432MHz) in order to be able to feed the TVE.
>
> My experiments and search through BSP code showed that TVE seems to have
> additional fixed predivider 8. So if you want to generate 27 MHz clock, unit
> has to be feed with 216 MHz.
>
> TVE has only one PLL source PLL_DE. And since 216 MHz is a bit low for DE2,
> BSP defaults to 432 MHz for PLL_DE and use divider 2 to generate 216 MHz. This
> clock is then divided by 8 internaly to get final 27 MHz.
>
> Please note that I don't have any hard evidence to support that, only
> experimental data. However, only that explanation make sense to me.
>
> BTW, BSP H3/H5 TV driver supports only PAL and NTSC which both use 27 MHz base
> clock. Further experiments are needed to check if there is any possibility to
> have other resolutions by manipulating clocks and give other proper settings.
> I plan to do that, but not in very near future.

You only have composite video output, and those are the only 2 standard
resolutions that make any sense.

ChenYu


[for-next][PATCH] tracing: Make sure RCU is watching before calling a stack trace

2017-05-19 Thread Steven Rostedt

I'll probably just push this to Linus tomorrow along with my other
changes.

-- Steve


  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
for-next

Head SHA1: a33d7d94eed92b23fbbc7b0de06a41b2bbaa49e3


Steven Rostedt (VMware) (1):
  tracing: Make sure RCU is watching before calling a stack trace


 kernel/trace/trace.c | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)
---
commit a33d7d94eed92b23fbbc7b0de06a41b2bbaa49e3
Author: Steven Rostedt (VMware) 
Date:   Fri May 12 13:15:45 2017 -0400

tracing: Make sure RCU is watching before calling a stack trace

As stack tracing now requires "rcu watching", force RCU to be watching when
recording a stack trace.

Link: http://lkml.kernel.org/r/20170512172449.879684...@goodmis.org

Acked-by: Paul E. McKenney 
Signed-off-by: Steven Rostedt (VMware) 

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index fcc9a2d774c3..1122f151466f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2568,7 +2568,36 @@ static inline void ftrace_trace_stack(struct trace_array 
*tr,
 void __trace_stack(struct trace_array *tr, unsigned long flags, int skip,
   int pc)
 {
-   __ftrace_trace_stack(tr->trace_buffer.buffer, flags, skip, pc, NULL);
+   struct ring_buffer *buffer = tr->trace_buffer.buffer;
+
+   if (rcu_is_watching()) {
+   __ftrace_trace_stack(buffer, flags, skip, pc, NULL);
+   return;
+   }
+
+   /*
+* When an NMI triggers, RCU is enabled via rcu_nmi_enter(),
+* but if the above rcu_is_watching() failed, then the NMI
+* triggered someplace critical, and rcu_irq_enter() should
+* not be called from NMI.
+*/
+   if (unlikely(in_nmi()))
+   return;
+
+   /*
+* It is possible that a function is being traced in a
+* location that RCU is not watching. A call to
+* rcu_irq_enter() will make sure that it is, but there's
+* a few internal rcu functions that could be traced
+* where that wont work either. In those cases, we just
+* do nothing.
+*/
+   if (unlikely(rcu_irq_enter_disabled()))
+   return;
+
+   rcu_irq_enter_irqson();
+   __ftrace_trace_stack(buffer, flags, skip, pc, NULL);
+   rcu_irq_exit_irqson();
 }
 
 /**


[for-next][PATCH] tracing: Make sure RCU is watching before calling a stack trace

2017-05-19 Thread Steven Rostedt

I'll probably just push this to Linus tomorrow along with my other
changes.

-- Steve


  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
for-next

Head SHA1: a33d7d94eed92b23fbbc7b0de06a41b2bbaa49e3


Steven Rostedt (VMware) (1):
  tracing: Make sure RCU is watching before calling a stack trace


 kernel/trace/trace.c | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)
---
commit a33d7d94eed92b23fbbc7b0de06a41b2bbaa49e3
Author: Steven Rostedt (VMware) 
Date:   Fri May 12 13:15:45 2017 -0400

tracing: Make sure RCU is watching before calling a stack trace

As stack tracing now requires "rcu watching", force RCU to be watching when
recording a stack trace.

Link: http://lkml.kernel.org/r/20170512172449.879684...@goodmis.org

Acked-by: Paul E. McKenney 
Signed-off-by: Steven Rostedt (VMware) 

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index fcc9a2d774c3..1122f151466f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2568,7 +2568,36 @@ static inline void ftrace_trace_stack(struct trace_array 
*tr,
 void __trace_stack(struct trace_array *tr, unsigned long flags, int skip,
   int pc)
 {
-   __ftrace_trace_stack(tr->trace_buffer.buffer, flags, skip, pc, NULL);
+   struct ring_buffer *buffer = tr->trace_buffer.buffer;
+
+   if (rcu_is_watching()) {
+   __ftrace_trace_stack(buffer, flags, skip, pc, NULL);
+   return;
+   }
+
+   /*
+* When an NMI triggers, RCU is enabled via rcu_nmi_enter(),
+* but if the above rcu_is_watching() failed, then the NMI
+* triggered someplace critical, and rcu_irq_enter() should
+* not be called from NMI.
+*/
+   if (unlikely(in_nmi()))
+   return;
+
+   /*
+* It is possible that a function is being traced in a
+* location that RCU is not watching. A call to
+* rcu_irq_enter() will make sure that it is, but there's
+* a few internal rcu functions that could be traced
+* where that wont work either. In those cases, we just
+* do nothing.
+*/
+   if (unlikely(rcu_irq_enter_disabled()))
+   return;
+
+   rcu_irq_enter_irqson();
+   __ftrace_trace_stack(buffer, flags, skip, pc, NULL);
+   rcu_irq_exit_irqson();
 }
 
 /**


Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread Xishi Qiu
On 2017/5/20 6:00, Hugh Dickins wrote:

> On Fri, 19 May 2017, Xishi Qiu wrote:
>> On 2017/5/19 16:52, Xishi Qiu wrote:
>>> On 2017/5/18 17:46, Xishi Qiu wrote:
>>>
 Hi, my system triggers this bug, and the vmcore shows the anon_vma seems 
 be freed.
 The kernel is RHEL 7.2, and the bug is hard to reproduce, so I don't know 
 if it
 exists in mainline, any reply is welcome!

>>>
>>> When we alloc anon_vma, we will init the value of anon_vma->root,
>>> so can we set anon_vma->root to NULL when calling
>>> anon_vma_free -> kmem_cache_free(anon_vma_cachep, anon_vma);
>>>
>>> anon_vma_free()
>>> ...
>>> anon_vma->root = NULL;
>>> kmem_cache_free(anon_vma_cachep, anon_vma);
>>>
>>> I find if we do this above, system boot failed, why?
>>>
>>
>> If anon_vma was freed, we should not to access the root_anon_vma, because it 
>> maybe also
>> freed(e.g. anon_vma == root_anon_vma), right?
>>
>> page_lock_anon_vma_read()
>>  ...
>>  anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>>  root_anon_vma = ACCESS_ONCE(anon_vma->root);
>>  if (down_read_trylock(_anon_vma->rwsem)) {  // it's not safe
>>  ...
>>  if (!atomic_inc_not_zero(_vma->refcount)) {  // check anon_vma was 
>> not freed
>>  ...
>>  anon_vma_lock_read(anon_vma);  // it's safe
>>  ...
> 
> You're ignoring the rcu_read_lock() on entry to page_lock_anon_vma_read(),
> and the SLAB_DESTROY_BY_RCU (recently renamed SLAB_TYPESAFE_BY_RCU) nature
> of the anon_vma_cachep kmem cache.  It is not safe to muck with anon_vma->
> root in anon_vma_free(), others could still be looking at it.
> 
> Hugh
> 

Hi Hugh,

Thanks for your reply.

SLAB_DESTROY_BY_RCU will let it call call_rcu() in free_slab(), but if the
anon_vma *reuse* by someone again, access root_anon_vma is not safe, right?

e.g. if I clean the root pointer before free it, then access root_anon_vma
in page_lock_anon_vma_read() is NULL pointer access, right?

anon_vma_free()
...
anon_vma->root = NULL;
kmem_cache_free(anon_vma_cachep, anon_vma);
...

Thanks,
Xishi Qiu

> .
> 





Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread Xishi Qiu
On 2017/5/20 6:00, Hugh Dickins wrote:

> On Fri, 19 May 2017, Xishi Qiu wrote:
>> On 2017/5/19 16:52, Xishi Qiu wrote:
>>> On 2017/5/18 17:46, Xishi Qiu wrote:
>>>
 Hi, my system triggers this bug, and the vmcore shows the anon_vma seems 
 be freed.
 The kernel is RHEL 7.2, and the bug is hard to reproduce, so I don't know 
 if it
 exists in mainline, any reply is welcome!

>>>
>>> When we alloc anon_vma, we will init the value of anon_vma->root,
>>> so can we set anon_vma->root to NULL when calling
>>> anon_vma_free -> kmem_cache_free(anon_vma_cachep, anon_vma);
>>>
>>> anon_vma_free()
>>> ...
>>> anon_vma->root = NULL;
>>> kmem_cache_free(anon_vma_cachep, anon_vma);
>>>
>>> I find if we do this above, system boot failed, why?
>>>
>>
>> If anon_vma was freed, we should not to access the root_anon_vma, because it 
>> maybe also
>> freed(e.g. anon_vma == root_anon_vma), right?
>>
>> page_lock_anon_vma_read()
>>  ...
>>  anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>>  root_anon_vma = ACCESS_ONCE(anon_vma->root);
>>  if (down_read_trylock(_anon_vma->rwsem)) {  // it's not safe
>>  ...
>>  if (!atomic_inc_not_zero(_vma->refcount)) {  // check anon_vma was 
>> not freed
>>  ...
>>  anon_vma_lock_read(anon_vma);  // it's safe
>>  ...
> 
> You're ignoring the rcu_read_lock() on entry to page_lock_anon_vma_read(),
> and the SLAB_DESTROY_BY_RCU (recently renamed SLAB_TYPESAFE_BY_RCU) nature
> of the anon_vma_cachep kmem cache.  It is not safe to muck with anon_vma->
> root in anon_vma_free(), others could still be looking at it.
> 
> Hugh
> 

Hi Hugh,

Thanks for your reply.

SLAB_DESTROY_BY_RCU will let it call call_rcu() in free_slab(), but if the
anon_vma *reuse* by someone again, access root_anon_vma is not safe, right?

e.g. if I clean the root pointer before free it, then access root_anon_vma
in page_lock_anon_vma_read() is NULL pointer access, right?

anon_vma_free()
...
anon_vma->root = NULL;
kmem_cache_free(anon_vma_cachep, anon_vma);
...

Thanks,
Xishi Qiu

> .
> 





Re: [PATCH] mm: clarify why we want kmalloc before falling backto vmallock

2017-05-19 Thread John Hubbard

On 05/17/2017 01:09 AM, Michal Hocko wrote:

From: Michal Hocko 

While converting drm_[cm]alloc* helpers to kvmalloc* variants Chris
Wilson has wondered why we want to try kmalloc before vmalloc fallback
even for larger allocations requests. Let's clarify that one larger
physically contiguous block is less likely to fragment memory than many
scattered pages which can prevent more large blocks from being created.

Suggested-by: Chris Wilson 
Signed-off-by: Michal Hocko 
---
  mm/util.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/util.c b/mm/util.c
index 464df3489903..87499f8119f2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,7 +357,10 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
  
  	/*

-* Make sure that larger requests are not too disruptive - no OOM
+* We want to attempt a large physically contiguous block first because
+* it is less likely to fragment multiple larger blocks and therefore
+* contribute to a long term fragmentation less than vmalloc fallback.
+* However make sure that larger requests are not too disruptive - no 
OOM
 * killer and no allocation failure warnings as we have a fallback
 */


Thanks for adding this, it's great to have. Here's a slightly polished version of your words, if you 
like:


/*
 * We want to attempt a large physically contiguous block first because
 * it is less likely to fragment multiple larger blocks. This approach
 * therefore contributes less to long term fragmentation than a vmalloc
 * fallback would. However, make sure that larger requests are not too
 * disruptive: no OOM killer and no allocation failure warnings, as we
 * have a fallback.
 */

thanks,
john h


if (size > PAGE_SIZE) {
--
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



Re: [PATCH] mm: clarify why we want kmalloc before falling backto vmallock

2017-05-19 Thread John Hubbard

On 05/17/2017 01:09 AM, Michal Hocko wrote:

From: Michal Hocko 

While converting drm_[cm]alloc* helpers to kvmalloc* variants Chris
Wilson has wondered why we want to try kmalloc before vmalloc fallback
even for larger allocations requests. Let's clarify that one larger
physically contiguous block is less likely to fragment memory than many
scattered pages which can prevent more large blocks from being created.

Suggested-by: Chris Wilson 
Signed-off-by: Michal Hocko 
---
  mm/util.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/util.c b/mm/util.c
index 464df3489903..87499f8119f2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,7 +357,10 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
  
  	/*

-* Make sure that larger requests are not too disruptive - no OOM
+* We want to attempt a large physically contiguous block first because
+* it is less likely to fragment multiple larger blocks and therefore
+* contribute to a long term fragmentation less than vmalloc fallback.
+* However make sure that larger requests are not too disruptive - no 
OOM
 * killer and no allocation failure warnings as we have a fallback
 */


Thanks for adding this, it's great to have. Here's a slightly polished version of your words, if you 
like:


/*
 * We want to attempt a large physically contiguous block first because
 * it is less likely to fragment multiple larger blocks. This approach
 * therefore contributes less to long term fragmentation than a vmalloc
 * fallback would. However, make sure that larger requests are not too
 * disruptive: no OOM killer and no allocation failure warnings, as we
 * have a fallback.
 */

thanks,
john h


if (size > PAGE_SIZE) {
--
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



Re: [PATCH RFC 0/3] Improve stability of system clock

2017-05-19 Thread John Stultz
On Wed, May 17, 2017 at 9:54 PM, Richard Cochran
 wrote:
> On Wed, May 17, 2017 at 04:06:07PM -0700, John Stultz wrote:
>> On Wed, May 17, 2017 at 10:22 AM, Miroslav Lichvar  
>> wrote:
>> > Is there a better way to run the timekeeping code in an userspace
>> > application? I suspect it would need something like the Linux Kernel
>> > Library project.
>>
>> I dunno. There's probably a cleaner way to go about it, but I also
>> feel like the benefit of just having the test in the kernel tree is
>> that it can be managed as a unified whole, rather then the test being
>> a separate thing and always playing catchup to kernel changes.
>
> I vaguely recall a rant on the list years ago from a Linux bigwhig
> saying how we don't support that kind of thing.  But maybe it is my
> imagination.  In any case, IMHO running user space tests for chunks of
> kernel code can be quite useful.

So a few years ago I mentioned this at a testing session at I think
Linux Plubmers' and Rusty (CC'ed) commented that he had some netfilter
(or iptables?) simulator code that never made it upstream. However,
now that kselftests are integrated with the kernel this could change.
At least that's my memory of the discussion.

Anyway, I still think its worth trying to submit. Worse case its a
huge pain and we pull it back out?

thanks
-john


Re: [PATCH RFC 0/3] Improve stability of system clock

2017-05-19 Thread John Stultz
On Wed, May 17, 2017 at 9:54 PM, Richard Cochran
 wrote:
> On Wed, May 17, 2017 at 04:06:07PM -0700, John Stultz wrote:
>> On Wed, May 17, 2017 at 10:22 AM, Miroslav Lichvar  
>> wrote:
>> > Is there a better way to run the timekeeping code in an userspace
>> > application? I suspect it would need something like the Linux Kernel
>> > Library project.
>>
>> I dunno. There's probably a cleaner way to go about it, but I also
>> feel like the benefit of just having the test in the kernel tree is
>> that it can be managed as a unified whole, rather then the test being
>> a separate thing and always playing catchup to kernel changes.
>
> I vaguely recall a rant on the list years ago from a Linux bigwhig
> saying how we don't support that kind of thing.  But maybe it is my
> imagination.  In any case, IMHO running user space tests for chunks of
> kernel code can be quite useful.

So a few years ago I mentioned this at a testing session at I think
Linux Plubmers' and Rusty (CC'ed) commented that he had some netfilter
(or iptables?) simulator code that never made it upstream. However,
now that kselftests are integrated with the kernel this could change.
At least that's my memory of the discussion.

Anyway, I still think its worth trying to submit. Worse case its a
huge pain and we pull it back out?

thanks
-john


Re: [PATCH] x86/mm: synchronize pgd in vmemmap_free()

2017-05-19 Thread John Hubbard

Hi Jerome,

On 05/19/2017 11:01 AM, Jérôme Glisse wrote:

When we free kernel virtual map we should synchronize p4d/pud for
all the pgds to avoid any stall entry in non canonical pgd.


"any stale entry in the non-canonical pgd", is what I think you meant to type 
there.

Also, it would be nice to clarify that commit description a bit: I'm not sure what is meant here by 
a "non-canonical pgd".


Also, it seems like the reshuffling of the internals of sync_global_pgds() deserves at least some 
mention here. More below.




Signed-off-by: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Michal Hocko 
Cc: Mel Gorman 
---
  arch/x86/mm/init_64.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ff95fe8..df753f8 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -108,8 +108,6 @@ void sync_global_pgds(unsigned long start, unsigned long 
end)
BUILD_BUG_ON(pgd_none(*pgd_ref));
p4d_ref = p4d_offset(pgd_ref, address);
  
-		if (p4d_none(*p4d_ref))

-   continue;
  
  		spin_lock(_lock);

list_for_each_entry(page, _list, lru) {
@@ -123,12 +121,16 @@ void sync_global_pgds(unsigned long start, unsigned long 
end)
pgt_lock = _page_get_mm(page)->page_table_lock;
spin_lock(pgt_lock);
  
-			if (!p4d_none(*p4d_ref) && !p4d_none(*p4d))

-   BUG_ON(p4d_page_vaddr(*p4d)
-  != p4d_page_vaddr(*p4d_ref));
-
-   if (p4d_none(*p4d))
+   if (p4d_none(*p4d_ref)) {
set_p4d(p4d, *p4d_ref);


Is the intention really to set p4d to a zeroed *p4d_ref, or is that a mistake?


+   } else {
+   if (!p4d_none(*p4d_ref) && !p4d_none(*p4d))


I think the code needs to be somewhat restructured, but as it stands, the above !p4d_none(*p4d_ref) 
will always be true, because first part of the if/else checked for the opposite case: 
p4d_none(*p4d_ref).  This is a side effect of moving that block of code.



+   BUG_ON(p4d_page_vaddr(*p4d)
+  != p4d_page_vaddr(*p4d_ref));
+
+   if (p4d_none(*p4d))
+   set_p4d(p4d, *p4d_ref);
+   }
  
  			spin_unlock(pgt_lock);

}
@@ -1024,6 +1026,7 @@ remove_pagetable(unsigned long start, unsigned long end, 
bool direct)
  void __ref vmemmap_free(unsigned long start, unsigned long end)
  {
remove_pagetable(start, end, false);
+   sync_global_pgds(start, end - 1);


This does fix the HMM crash that I was seeing in hmm-next.

thanks,
John Hubbard
NVIDIA


  }
  
  #ifdef CONFIG_MEMORY_HOTREMOVE

--
2.4.11

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



Re: [PATCH] x86/mm: synchronize pgd in vmemmap_free()

2017-05-19 Thread John Hubbard

Hi Jerome,

On 05/19/2017 11:01 AM, Jérôme Glisse wrote:

When we free kernel virtual map we should synchronize p4d/pud for
all the pgds to avoid any stall entry in non canonical pgd.


"any stale entry in the non-canonical pgd", is what I think you meant to type 
there.

Also, it would be nice to clarify that commit description a bit: I'm not sure what is meant here by 
a "non-canonical pgd".


Also, it seems like the reshuffling of the internals of sync_global_pgds() deserves at least some 
mention here. More below.




Signed-off-by: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Michal Hocko 
Cc: Mel Gorman 
---
  arch/x86/mm/init_64.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ff95fe8..df753f8 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -108,8 +108,6 @@ void sync_global_pgds(unsigned long start, unsigned long 
end)
BUILD_BUG_ON(pgd_none(*pgd_ref));
p4d_ref = p4d_offset(pgd_ref, address);
  
-		if (p4d_none(*p4d_ref))

-   continue;
  
  		spin_lock(_lock);

list_for_each_entry(page, _list, lru) {
@@ -123,12 +121,16 @@ void sync_global_pgds(unsigned long start, unsigned long 
end)
pgt_lock = _page_get_mm(page)->page_table_lock;
spin_lock(pgt_lock);
  
-			if (!p4d_none(*p4d_ref) && !p4d_none(*p4d))

-   BUG_ON(p4d_page_vaddr(*p4d)
-  != p4d_page_vaddr(*p4d_ref));
-
-   if (p4d_none(*p4d))
+   if (p4d_none(*p4d_ref)) {
set_p4d(p4d, *p4d_ref);


Is the intention really to set p4d to a zeroed *p4d_ref, or is that a mistake?


+   } else {
+   if (!p4d_none(*p4d_ref) && !p4d_none(*p4d))


I think the code needs to be somewhat restructured, but as it stands, the above !p4d_none(*p4d_ref) 
will always be true, because first part of the if/else checked for the opposite case: 
p4d_none(*p4d_ref).  This is a side effect of moving that block of code.



+   BUG_ON(p4d_page_vaddr(*p4d)
+  != p4d_page_vaddr(*p4d_ref));
+
+   if (p4d_none(*p4d))
+   set_p4d(p4d, *p4d_ref);
+   }
  
  			spin_unlock(pgt_lock);

}
@@ -1024,6 +1026,7 @@ remove_pagetable(unsigned long start, unsigned long end, 
bool direct)
  void __ref vmemmap_free(unsigned long start, unsigned long end)
  {
remove_pagetable(start, end, false);
+   sync_global_pgds(start, end - 1);


This does fix the HMM crash that I was seeing in hmm-next.

thanks,
John Hubbard
NVIDIA


  }
  
  #ifdef CONFIG_MEMORY_HOTREMOVE

--
2.4.11

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



[PATCH v2 2/7] arch/sparc: Remove the check #ifndef __LINUX_SPINLOCK_TYPES_H

2017-05-19 Thread Babu Moger
Remove the un-necessary "ifndef __LINUX_SPINLOCK_TYPES_H" stanza from SPARC.

Signed-off-by: Babu Moger 
Suggested-by: David Miller 
---
 arch/sparc/include/asm/spinlock_types.h |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/arch/sparc/include/asm/spinlock_types.h 
b/arch/sparc/include/asm/spinlock_types.h
index 9c454fd..019c085 100644
--- a/arch/sparc/include/asm/spinlock_types.h
+++ b/arch/sparc/include/asm/spinlock_types.h
@@ -1,10 +1,6 @@
 #ifndef __SPARC_SPINLOCK_TYPES_H
 #define __SPARC_SPINLOCK_TYPES_H
 
-#ifndef __LINUX_SPINLOCK_TYPES_H
-# error "please don't include this file directly"
-#endif
-
 typedef struct {
volatile unsigned char lock;
 } arch_spinlock_t;
-- 
1.7.1



[PATCH v2 0/7] Enable queued rwlock and queued spinlock for SPARC

2017-05-19 Thread Babu Moger
This series of patches enables queued rwlock and queued spinlock support
for SPARC. These features were introduced some time ago in upstream.
Here are some of the earlier discussions.
https://lwn.net/Articles/572765/
https://lwn.net/Articles/582200/
https://lwn.net/Articles/561775/
https://lwn.net/Articles/590243/

Tests: Ran AIM7 benchmark to verify the performance on various workloads.
https://github.com/davidlohr/areaim. Same benchmark was used when this
feature was introduced and enabled on x86. Here are the test results.

Kernel  4.11.0-rc6 4.11.0-rc6 + Change
baselinequeued locks
  (Avg No.of jobs) (Avg No.of jobs)
Workload
High systime 10-100 user 17290.4817295.18   +0.02
High systime 200-1000 users 109814.95   110248.87   +0.39
High systime 1200-2000 users107912.40   127923.16   +18.54

Disk IO 10-100 users168910.16   158834.17   -5.96
Disk IO 200-1000 users  242781.74   281285.80   +15.85
Disk IO 1200-2000 users 228518.23   218421.23   -4.41

Disk IO 10-100 users183933.77   207928.67   +13.04
Disk IO 200-1000 users  491981.56   500162.33   +1.66
Disk IO 1200-2000 users 463395.66   467312.70   +0.84

fserver 10-100 users254177.53   270283.08   +6.33
fserver IO 200-1000 users   269017.35   324812.2+20.74
fserver IO 1200-2000 users  229538.87   284713.77   +24.03

Disk I/O results are little bit in negative territory. But majority of the 
performance changes are in positive and it is significant in some cases.

Changes:
v1->v2:
Addressed the comments from David Miller.
1. Added CPU_BIG_ENDIAN for all SPARC
2. Removed #ifndef __LINUX_SPINLOCK_TYPES_H guard from spinlock_types.h
3. Removed check for CONFIG_QUEUED_RWLOCKS in SPARC64 as it is the 
   default definition for SPARC64 now. Cleaned-up the previous arch_read_xxx
   and arch_write_xxx definitions as it is defined now in qrwlock.h.
4. Removed check for CONFIG_QUEUED_SPINLOCKS in SPARC64 as it is the default
   definition now for SPARC64 now. Cleaned-up the previous arch_spin_xxx
   definitions as it is defined in qspinlock.h. 

v1: Initial version

Babu Moger (7):
  kernel/locking: Fix compile error with qrwlock.c
  arch/sparc: Remove the check #ifndef __LINUX_SPINLOCK_TYPES_H
  arch/sparc: Define config parameter CPU_BIG_ENDIAN
  arch/sparc: Introduce cmpxchg_u8 SPARC
  arch/sparc: Enable queued rwlocks for SPARC
  arch/sparc: Introduce xchg16 for SPARC
  arch/sparc: Enable queued spinlock support for SPARC

 arch/sparc/Kconfig  |6 +
 arch/sparc/include/asm/cmpxchg_64.h |   76 ++--
 arch/sparc/include/asm/qrwlock.h|7 +
 arch/sparc/include/asm/qspinlock.h  |7 +
 arch/sparc/include/asm/spinlock_64.h|  208 +--
 arch/sparc/include/asm/spinlock_types.h |   12 ++-
 include/asm-generic/qrwlock_types.h |6 +-
 kernel/locking/qrwlock.c|1 +
 8 files changed, 101 insertions(+), 222 deletions(-)
 create mode 100644 arch/sparc/include/asm/qrwlock.h
 create mode 100644 arch/sparc/include/asm/qspinlock.h



[PATCH v2 7/7] arch/sparc: Enable queued spinlock support for SPARC

2017-05-19 Thread Babu Moger
This patch makes the necessary changes in SPARC architecture to enable
queued spinlock support. Here are some of the earlier discussions about
this feature.
https://lwn.net/Articles/561775/
https://lwn.net/Articles/590243/

Cleaned-up the spinlock_64.h. The definitions of arch_spin_xxx are
replaced by the function in 

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Jane Chu 
Reviewed-by: Shannon Nelson 
Reviewed-by: Vijay Kumar 
---
 arch/sparc/Kconfig  |1 +
 arch/sparc/include/asm/qspinlock.h  |7 +++
 arch/sparc/include/asm/spinlock_64.h|   84 +--
 arch/sparc/include/asm/spinlock_types.h |5 ++
 4 files changed, 14 insertions(+), 83 deletions(-)
 create mode 100644 arch/sparc/include/asm/qspinlock.h

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 9ec1d2f..d4a24ea 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -83,6 +83,7 @@ config SPARC64
select ARCH_SUPPORTS_ATOMIC_RMW
select HAVE_NMI
select ARCH_USE_QUEUED_RWLOCKS
+   select ARCH_USE_QUEUED_SPINLOCKS
 
 config ARCH_DEFCONFIG
string
diff --git a/arch/sparc/include/asm/qspinlock.h 
b/arch/sparc/include/asm/qspinlock.h
new file mode 100644
index 000..5ae9a28
--- /dev/null
+++ b/arch/sparc/include/asm/qspinlock.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_SPARC_QSPINLOCK_H
+#define _ASM_SPARC_QSPINLOCK_H
+
+#include 
+#include 
+
+#endif /* _ASM_SPARC_QSPINLOCK_H */
diff --git a/arch/sparc/include/asm/spinlock_64.h 
b/arch/sparc/include/asm/spinlock_64.h
index 8901c2d..f7028f5 100644
--- a/arch/sparc/include/asm/spinlock_64.h
+++ b/arch/sparc/include/asm/spinlock_64.h
@@ -11,89 +11,7 @@
 #include 
 #include 
 #include 
-
-/* To get debugging spinlocks which detect and catch
- * deadlock situations, set CONFIG_DEBUG_SPINLOCK
- * and rebuild your kernel.
- */
-
-/* Because we play games to save cycles in the non-contention case, we
- * need to be extra careful about branch targets into the "spinning"
- * code.  They live in their own section, but the newer V9 branches
- * have a shorter range than the traditional 32-bit sparc branch
- * variants.  The rule is that the branches that go into and out of
- * the spinner sections must be pre-V9 branches.
- */
-
-#define arch_spin_is_locked(lp)((lp)->lock != 0)
-
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, !VAL);
-}
-
-static inline void arch_spin_lock(arch_spinlock_t *lock)
-{
-   unsigned long tmp;
-
-   __asm__ __volatile__(
-"1:ldstub  [%1], %0\n"
-"  brnz,pn %0, 2f\n"
-"   nop\n"
-"  .subsection 2\n"
-"2:ldub[%1], %0\n"
-"  brnz,pt %0, 2b\n"
-"   nop\n"
-"  ba,a,pt %%xcc, 1b\n"
-"  .previous"
-   : "=" (tmp)
-   : "r" (lock)
-   : "memory");
-}
-
-static inline int arch_spin_trylock(arch_spinlock_t *lock)
-{
-   unsigned long result;
-
-   __asm__ __volatile__(
-"  ldstub  [%1], %0\n"
-   : "=r" (result)
-   : "r" (lock)
-   : "memory");
-
-   return (result == 0UL);
-}
-
-static inline void arch_spin_unlock(arch_spinlock_t *lock)
-{
-   __asm__ __volatile__(
-"  stb %%g0, [%0]"
-   : /* No outputs */
-   : "r" (lock)
-   : "memory");
-}
-
-static inline void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long 
flags)
-{
-   unsigned long tmp1, tmp2;
-
-   __asm__ __volatile__(
-"1:ldstub  [%2], %0\n"
-"  brnz,pn %0, 2f\n"
-"   nop\n"
-"  .subsection 2\n"
-"2:rdpr%%pil, %1\n"
-"  wrpr%3, %%pil\n"
-"3:ldub[%2], %0\n"
-"  brnz,pt %0, 3b\n"
-"   nop\n"
-"  ba,pt   %%xcc, 1b\n"
-"   wrpr   %1, %%pil\n"
-"  .previous"
-   : "=" (tmp1), "=" (tmp2)
-   : "r"(lock), "r"(flags)
-   : "memory");
-}
+#include 
 
 #define arch_read_lock_flags(p, f) arch_read_lock(p)
 #define arch_write_lock_flags(p, f) arch_write_lock(p)
diff --git a/arch/sparc/include/asm/spinlock_types.h 
b/arch/sparc/include/asm/spinlock_types.h
index 64fce21..bce8ef4 100644
--- a/arch/sparc/include/asm/spinlock_types.h
+++ b/arch/sparc/include/asm/spinlock_types.h
@@ -1,11 +1,16 @@
 #ifndef __SPARC_SPINLOCK_TYPES_H
 #define __SPARC_SPINLOCK_TYPES_H
 
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#include 
+#else
+
 typedef struct {
volatile unsigned char lock;
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { 0 }
+#endif /* CONFIG_QUEUED_SPINLOCKS */
 
 #ifdef CONFIG_QUEUED_RWLOCKS
 #include 
-- 
1.7.1



[PATCH v2 2/7] arch/sparc: Remove the check #ifndef __LINUX_SPINLOCK_TYPES_H

2017-05-19 Thread Babu Moger
Remove the un-necessary "ifndef __LINUX_SPINLOCK_TYPES_H" stanza from SPARC.

Signed-off-by: Babu Moger 
Suggested-by: David Miller 
---
 arch/sparc/include/asm/spinlock_types.h |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/arch/sparc/include/asm/spinlock_types.h 
b/arch/sparc/include/asm/spinlock_types.h
index 9c454fd..019c085 100644
--- a/arch/sparc/include/asm/spinlock_types.h
+++ b/arch/sparc/include/asm/spinlock_types.h
@@ -1,10 +1,6 @@
 #ifndef __SPARC_SPINLOCK_TYPES_H
 #define __SPARC_SPINLOCK_TYPES_H
 
-#ifndef __LINUX_SPINLOCK_TYPES_H
-# error "please don't include this file directly"
-#endif
-
 typedef struct {
volatile unsigned char lock;
 } arch_spinlock_t;
-- 
1.7.1



[PATCH v2 0/7] Enable queued rwlock and queued spinlock for SPARC

2017-05-19 Thread Babu Moger
This series of patches enables queued rwlock and queued spinlock support
for SPARC. These features were introduced some time ago in upstream.
Here are some of the earlier discussions.
https://lwn.net/Articles/572765/
https://lwn.net/Articles/582200/
https://lwn.net/Articles/561775/
https://lwn.net/Articles/590243/

Tests: Ran AIM7 benchmark to verify the performance on various workloads.
https://github.com/davidlohr/areaim. Same benchmark was used when this
feature was introduced and enabled on x86. Here are the test results.

Kernel  4.11.0-rc6 4.11.0-rc6 + Change
baselinequeued locks
  (Avg No.of jobs) (Avg No.of jobs)
Workload
High systime 10-100 user 17290.4817295.18   +0.02
High systime 200-1000 users 109814.95   110248.87   +0.39
High systime 1200-2000 users107912.40   127923.16   +18.54

Disk IO 10-100 users168910.16   158834.17   -5.96
Disk IO 200-1000 users  242781.74   281285.80   +15.85
Disk IO 1200-2000 users 228518.23   218421.23   -4.41

Disk IO 10-100 users183933.77   207928.67   +13.04
Disk IO 200-1000 users  491981.56   500162.33   +1.66
Disk IO 1200-2000 users 463395.66   467312.70   +0.84

fserver 10-100 users254177.53   270283.08   +6.33
fserver IO 200-1000 users   269017.35   324812.2+20.74
fserver IO 1200-2000 users  229538.87   284713.77   +24.03

Disk I/O results are little bit in negative territory. But majority of the 
performance changes are in positive and it is significant in some cases.

Changes:
v1->v2:
Addressed the comments from David Miller.
1. Added CPU_BIG_ENDIAN for all SPARC
2. Removed #ifndef __LINUX_SPINLOCK_TYPES_H guard from spinlock_types.h
3. Removed check for CONFIG_QUEUED_RWLOCKS in SPARC64 as it is the 
   default definition for SPARC64 now. Cleaned-up the previous arch_read_xxx
   and arch_write_xxx definitions as it is defined now in qrwlock.h.
4. Removed check for CONFIG_QUEUED_SPINLOCKS in SPARC64 as it is the default
   definition now for SPARC64 now. Cleaned-up the previous arch_spin_xxx
   definitions as it is defined in qspinlock.h. 

v1: Initial version

Babu Moger (7):
  kernel/locking: Fix compile error with qrwlock.c
  arch/sparc: Remove the check #ifndef __LINUX_SPINLOCK_TYPES_H
  arch/sparc: Define config parameter CPU_BIG_ENDIAN
  arch/sparc: Introduce cmpxchg_u8 SPARC
  arch/sparc: Enable queued rwlocks for SPARC
  arch/sparc: Introduce xchg16 for SPARC
  arch/sparc: Enable queued spinlock support for SPARC

 arch/sparc/Kconfig  |6 +
 arch/sparc/include/asm/cmpxchg_64.h |   76 ++--
 arch/sparc/include/asm/qrwlock.h|7 +
 arch/sparc/include/asm/qspinlock.h  |7 +
 arch/sparc/include/asm/spinlock_64.h|  208 +--
 arch/sparc/include/asm/spinlock_types.h |   12 ++-
 include/asm-generic/qrwlock_types.h |6 +-
 kernel/locking/qrwlock.c|1 +
 8 files changed, 101 insertions(+), 222 deletions(-)
 create mode 100644 arch/sparc/include/asm/qrwlock.h
 create mode 100644 arch/sparc/include/asm/qspinlock.h



[PATCH v2 7/7] arch/sparc: Enable queued spinlock support for SPARC

2017-05-19 Thread Babu Moger
This patch makes the necessary changes in SPARC architecture to enable
queued spinlock support. Here are some of the earlier discussions about
this feature.
https://lwn.net/Articles/561775/
https://lwn.net/Articles/590243/

Cleaned-up the spinlock_64.h. The definitions of arch_spin_xxx are
replaced by the function in 

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Jane Chu 
Reviewed-by: Shannon Nelson 
Reviewed-by: Vijay Kumar 
---
 arch/sparc/Kconfig  |1 +
 arch/sparc/include/asm/qspinlock.h  |7 +++
 arch/sparc/include/asm/spinlock_64.h|   84 +--
 arch/sparc/include/asm/spinlock_types.h |5 ++
 4 files changed, 14 insertions(+), 83 deletions(-)
 create mode 100644 arch/sparc/include/asm/qspinlock.h

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 9ec1d2f..d4a24ea 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -83,6 +83,7 @@ config SPARC64
select ARCH_SUPPORTS_ATOMIC_RMW
select HAVE_NMI
select ARCH_USE_QUEUED_RWLOCKS
+   select ARCH_USE_QUEUED_SPINLOCKS
 
 config ARCH_DEFCONFIG
string
diff --git a/arch/sparc/include/asm/qspinlock.h 
b/arch/sparc/include/asm/qspinlock.h
new file mode 100644
index 000..5ae9a28
--- /dev/null
+++ b/arch/sparc/include/asm/qspinlock.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_SPARC_QSPINLOCK_H
+#define _ASM_SPARC_QSPINLOCK_H
+
+#include 
+#include 
+
+#endif /* _ASM_SPARC_QSPINLOCK_H */
diff --git a/arch/sparc/include/asm/spinlock_64.h 
b/arch/sparc/include/asm/spinlock_64.h
index 8901c2d..f7028f5 100644
--- a/arch/sparc/include/asm/spinlock_64.h
+++ b/arch/sparc/include/asm/spinlock_64.h
@@ -11,89 +11,7 @@
 #include 
 #include 
 #include 
-
-/* To get debugging spinlocks which detect and catch
- * deadlock situations, set CONFIG_DEBUG_SPINLOCK
- * and rebuild your kernel.
- */
-
-/* Because we play games to save cycles in the non-contention case, we
- * need to be extra careful about branch targets into the "spinning"
- * code.  They live in their own section, but the newer V9 branches
- * have a shorter range than the traditional 32-bit sparc branch
- * variants.  The rule is that the branches that go into and out of
- * the spinner sections must be pre-V9 branches.
- */
-
-#define arch_spin_is_locked(lp)((lp)->lock != 0)
-
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, !VAL);
-}
-
-static inline void arch_spin_lock(arch_spinlock_t *lock)
-{
-   unsigned long tmp;
-
-   __asm__ __volatile__(
-"1:ldstub  [%1], %0\n"
-"  brnz,pn %0, 2f\n"
-"   nop\n"
-"  .subsection 2\n"
-"2:ldub[%1], %0\n"
-"  brnz,pt %0, 2b\n"
-"   nop\n"
-"  ba,a,pt %%xcc, 1b\n"
-"  .previous"
-   : "=" (tmp)
-   : "r" (lock)
-   : "memory");
-}
-
-static inline int arch_spin_trylock(arch_spinlock_t *lock)
-{
-   unsigned long result;
-
-   __asm__ __volatile__(
-"  ldstub  [%1], %0\n"
-   : "=r" (result)
-   : "r" (lock)
-   : "memory");
-
-   return (result == 0UL);
-}
-
-static inline void arch_spin_unlock(arch_spinlock_t *lock)
-{
-   __asm__ __volatile__(
-"  stb %%g0, [%0]"
-   : /* No outputs */
-   : "r" (lock)
-   : "memory");
-}
-
-static inline void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long 
flags)
-{
-   unsigned long tmp1, tmp2;
-
-   __asm__ __volatile__(
-"1:ldstub  [%2], %0\n"
-"  brnz,pn %0, 2f\n"
-"   nop\n"
-"  .subsection 2\n"
-"2:rdpr%%pil, %1\n"
-"  wrpr%3, %%pil\n"
-"3:ldub[%2], %0\n"
-"  brnz,pt %0, 3b\n"
-"   nop\n"
-"  ba,pt   %%xcc, 1b\n"
-"   wrpr   %1, %%pil\n"
-"  .previous"
-   : "=" (tmp1), "=" (tmp2)
-   : "r"(lock), "r"(flags)
-   : "memory");
-}
+#include 
 
 #define arch_read_lock_flags(p, f) arch_read_lock(p)
 #define arch_write_lock_flags(p, f) arch_write_lock(p)
diff --git a/arch/sparc/include/asm/spinlock_types.h 
b/arch/sparc/include/asm/spinlock_types.h
index 64fce21..bce8ef4 100644
--- a/arch/sparc/include/asm/spinlock_types.h
+++ b/arch/sparc/include/asm/spinlock_types.h
@@ -1,11 +1,16 @@
 #ifndef __SPARC_SPINLOCK_TYPES_H
 #define __SPARC_SPINLOCK_TYPES_H
 
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#include 
+#else
+
 typedef struct {
volatile unsigned char lock;
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { 0 }
+#endif /* CONFIG_QUEUED_SPINLOCKS */
 
 #ifdef CONFIG_QUEUED_RWLOCKS
 #include 
-- 
1.7.1



[PATCH v2 6/7] arch/sparc: Introduce xchg16 for SPARC

2017-05-19 Thread Babu Moger
SPARC supports 32 bit and 64 bit xchg right now. Add the support
for 16 bit (2 byte) xchg. This is required to support queued spinlock
feature which uses 2 byte xchg. This is achieved using 4 byte cas
instructions with byte manipulations.

Also re-arranged the code to call __cmpxchg_u32 inside xchg16.

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Steven Sistare 
Reviewed-by: Shannon Nelson 
Reviewed-by: Jane Chu 
Reviewed-by: Vijay Kumar 
---
 arch/sparc/include/asm/cmpxchg_64.h |   49 +++---
 1 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/arch/sparc/include/asm/cmpxchg_64.h 
b/arch/sparc/include/asm/cmpxchg_64.h
index 000f7d7..4028f4f 100644
--- a/arch/sparc/include/asm/cmpxchg_64.h
+++ b/arch/sparc/include/asm/cmpxchg_64.h
@@ -6,6 +6,17 @@
 #ifndef __ARCH_SPARC64_CMPXCHG__
 #define __ARCH_SPARC64_CMPXCHG__
 
+static inline unsigned long
+__cmpxchg_u32(volatile int *m, int old, int new)
+{
+   __asm__ __volatile__("cas [%2], %3, %0"
+: "=" (new)
+: "0" (new), "r" (m), "r" (old)
+: "memory");
+
+   return new;
+}
+
 static inline unsigned long xchg32(__volatile__ unsigned int *m, unsigned int 
val)
 {
unsigned long tmp1, tmp2;
@@ -44,10 +55,38 @@ static inline unsigned long xchg64(__volatile__ unsigned 
long *m, unsigned long 
 
 void __xchg_called_with_bad_pointer(void);
 
+/*
+ * Use 4 byte cas instruction to achieve 2 byte xchg. Main logic
+ * here is to get the bit shift of the byte we are interested in.
+ * The XOR is handy for reversing the bits for big-endian byte order.
+ */
+static inline unsigned long
+xchg16(__volatile__ unsigned short *m, unsigned short val)
+{
+   unsigned long maddr = (unsigned long)m;
+   int bit_shift = (((unsigned long)m & 2) ^ 2) << 3;
+   unsigned int mask = 0x << bit_shift;
+   unsigned int *ptr = (unsigned int  *) (maddr & ~2);
+   unsigned int old32, new32, load32;
+
+   /* Read the old value */
+   load32 = *ptr;
+
+   do {
+   old32 = load32;
+   new32 = (load32 & (~mask)) | val << bit_shift;
+   load32 = __cmpxchg_u32(ptr, old32, new32);
+   } while (load32 != old32);
+
+   return (load32 & mask) >> bit_shift;
+}
+
 static inline unsigned long __xchg(unsigned long x, __volatile__ void * ptr,
   int size)
 {
switch (size) {
+   case 2:
+   return xchg16(ptr, x);
case 4:
return xchg32(ptr, x);
case 8:
@@ -65,16 +104,6 @@ static inline unsigned long __xchg(unsigned long x, 
__volatile__ void * ptr,
 
 #include 
 
-static inline unsigned long
-__cmpxchg_u32(volatile int *m, int old, int new)
-{
-   __asm__ __volatile__("cas [%2], %3, %0"
-: "=" (new)
-: "0" (new), "r" (m), "r" (old)
-: "memory");
-
-   return new;
-}
 
 static inline unsigned long
 __cmpxchg_u64(volatile long *m, unsigned long old, unsigned long new)
-- 
1.7.1



[PATCH v2 6/7] arch/sparc: Introduce xchg16 for SPARC

2017-05-19 Thread Babu Moger
SPARC supports 32 bit and 64 bit xchg right now. Add the support
for 16 bit (2 byte) xchg. This is required to support queued spinlock
feature which uses 2 byte xchg. This is achieved using 4 byte cas
instructions with byte manipulations.

Also re-arranged the code to call __cmpxchg_u32 inside xchg16.

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Steven Sistare 
Reviewed-by: Shannon Nelson 
Reviewed-by: Jane Chu 
Reviewed-by: Vijay Kumar 
---
 arch/sparc/include/asm/cmpxchg_64.h |   49 +++---
 1 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/arch/sparc/include/asm/cmpxchg_64.h 
b/arch/sparc/include/asm/cmpxchg_64.h
index 000f7d7..4028f4f 100644
--- a/arch/sparc/include/asm/cmpxchg_64.h
+++ b/arch/sparc/include/asm/cmpxchg_64.h
@@ -6,6 +6,17 @@
 #ifndef __ARCH_SPARC64_CMPXCHG__
 #define __ARCH_SPARC64_CMPXCHG__
 
+static inline unsigned long
+__cmpxchg_u32(volatile int *m, int old, int new)
+{
+   __asm__ __volatile__("cas [%2], %3, %0"
+: "=" (new)
+: "0" (new), "r" (m), "r" (old)
+: "memory");
+
+   return new;
+}
+
 static inline unsigned long xchg32(__volatile__ unsigned int *m, unsigned int 
val)
 {
unsigned long tmp1, tmp2;
@@ -44,10 +55,38 @@ static inline unsigned long xchg64(__volatile__ unsigned 
long *m, unsigned long 
 
 void __xchg_called_with_bad_pointer(void);
 
+/*
+ * Use 4 byte cas instruction to achieve 2 byte xchg. Main logic
+ * here is to get the bit shift of the byte we are interested in.
+ * The XOR is handy for reversing the bits for big-endian byte order.
+ */
+static inline unsigned long
+xchg16(__volatile__ unsigned short *m, unsigned short val)
+{
+   unsigned long maddr = (unsigned long)m;
+   int bit_shift = (((unsigned long)m & 2) ^ 2) << 3;
+   unsigned int mask = 0x << bit_shift;
+   unsigned int *ptr = (unsigned int  *) (maddr & ~2);
+   unsigned int old32, new32, load32;
+
+   /* Read the old value */
+   load32 = *ptr;
+
+   do {
+   old32 = load32;
+   new32 = (load32 & (~mask)) | val << bit_shift;
+   load32 = __cmpxchg_u32(ptr, old32, new32);
+   } while (load32 != old32);
+
+   return (load32 & mask) >> bit_shift;
+}
+
 static inline unsigned long __xchg(unsigned long x, __volatile__ void * ptr,
   int size)
 {
switch (size) {
+   case 2:
+   return xchg16(ptr, x);
case 4:
return xchg32(ptr, x);
case 8:
@@ -65,16 +104,6 @@ static inline unsigned long __xchg(unsigned long x, 
__volatile__ void * ptr,
 
 #include 
 
-static inline unsigned long
-__cmpxchg_u32(volatile int *m, int old, int new)
-{
-   __asm__ __volatile__("cas [%2], %3, %0"
-: "=" (new)
-: "0" (new), "r" (m), "r" (old)
-: "memory");
-
-   return new;
-}
 
 static inline unsigned long
 __cmpxchg_u64(volatile long *m, unsigned long old, unsigned long new)
-- 
1.7.1



[PATCH v2 5/7] arch/sparc: Enable queued rwlocks for SPARC

2017-05-19 Thread Babu Moger
Enable queued rwlocks for SPARC. Here are the discussions on this feature
when this was introduced.
https://lwn.net/Articles/572765/
https://lwn.net/Articles/582200/

Cleaned-up the arch_read_xxx and arch_write_xxx definitions in spinlock_64.h.
These routines are replaced by the functions in include/asm-generic/qrwlock.h

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Jane Chu 
Reviewed-by: Shannon Nelson 
Reviewed-by: Vijay Kumar 
---
 arch/sparc/Kconfig  |1 +
 arch/sparc/include/asm/qrwlock.h|7 ++
 arch/sparc/include/asm/spinlock_64.h|  124 +--
 arch/sparc/include/asm/spinlock_types.h |5 +-
 4 files changed, 13 insertions(+), 124 deletions(-)
 create mode 100644 arch/sparc/include/asm/qrwlock.h

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 0f5813b..9ec1d2f 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -82,6 +82,7 @@ config SPARC64
select HAVE_ARCH_AUDITSYSCALL
select ARCH_SUPPORTS_ATOMIC_RMW
select HAVE_NMI
+   select ARCH_USE_QUEUED_RWLOCKS
 
 config ARCH_DEFCONFIG
string
diff --git a/arch/sparc/include/asm/qrwlock.h b/arch/sparc/include/asm/qrwlock.h
new file mode 100644
index 000..d68a4b1
--- /dev/null
+++ b/arch/sparc/include/asm/qrwlock.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_SPARC_QRWLOCK_H
+#define _ASM_SPARC_QRWLOCK_H
+
+#include 
+#include 
+
+#endif /* _ASM_SPARC_QRWLOCK_H */
diff --git a/arch/sparc/include/asm/spinlock_64.h 
b/arch/sparc/include/asm/spinlock_64.h
index 07c9f2e..8901c2d 100644
--- a/arch/sparc/include/asm/spinlock_64.h
+++ b/arch/sparc/include/asm/spinlock_64.h
@@ -10,6 +10,7 @@
 
 #include 
 #include 
+#include 
 
 /* To get debugging spinlocks which detect and catch
  * deadlock situations, set CONFIG_DEBUG_SPINLOCK
@@ -94,132 +95,9 @@ static inline void arch_spin_lock_flags(arch_spinlock_t 
*lock, unsigned long fla
: "memory");
 }
 
-/* Multi-reader locks, these are much saner than the 32-bit Sparc ones... */
-
-static inline void arch_read_lock(arch_rwlock_t *lock)
-{
-   unsigned long tmp1, tmp2;
-
-   __asm__ __volatile__ (
-"1:ldsw[%2], %0\n"
-"  brlz,pn %0, 2f\n"
-"4: add%0, 1, %1\n"
-"  cas [%2], %0, %1\n"
-"  cmp %0, %1\n"
-"  bne,pn  %%icc, 1b\n"
-"   nop\n"
-"  .subsection 2\n"
-"2:ldsw[%2], %0\n"
-"  brlz,pt %0, 2b\n"
-"   nop\n"
-"  ba,a,pt %%xcc, 4b\n"
-"  .previous"
-   : "=" (tmp1), "=" (tmp2)
-   : "r" (lock)
-   : "memory");
-}
-
-static inline int arch_read_trylock(arch_rwlock_t *lock)
-{
-   int tmp1, tmp2;
-
-   __asm__ __volatile__ (
-"1:ldsw[%2], %0\n"
-"  brlz,a,pn   %0, 2f\n"
-"   mov0, %0\n"
-"  add %0, 1, %1\n"
-"  cas [%2], %0, %1\n"
-"  cmp %0, %1\n"
-"  bne,pn  %%icc, 1b\n"
-"   mov1, %0\n"
-"2:"
-   : "=" (tmp1), "=" (tmp2)
-   : "r" (lock)
-   : "memory");
-
-   return tmp1;
-}
-
-static inline void arch_read_unlock(arch_rwlock_t *lock)
-{
-   unsigned long tmp1, tmp2;
-
-   __asm__ __volatile__(
-"1:lduw[%2], %0\n"
-"  sub %0, 1, %1\n"
-"  cas [%2], %0, %1\n"
-"  cmp %0, %1\n"
-"  bne,pn  %%xcc, 1b\n"
-"   nop"
-   : "=" (tmp1), "=" (tmp2)
-   : "r" (lock)
-   : "memory");
-}
-
-static inline void arch_write_lock(arch_rwlock_t *lock)
-{
-   unsigned long mask, tmp1, tmp2;
-
-   mask = 0x8000UL;
-
-   __asm__ __volatile__(
-"1:lduw[%2], %0\n"
-"  brnz,pn %0, 2f\n"
-"4: or %0, %3, %1\n"
-"  cas [%2], %0, %1\n"
-"  cmp %0, %1\n"
-"  bne,pn  %%icc, 1b\n"
-"   nop\n"
-"  .subsection 2\n"
-"2:lduw[%2], %0\n"
-"  brnz,pt %0, 2b\n"
-"   nop\n"
-"  ba,a,pt %%xcc, 4b\n"
-"  .previous"
-   : "=" (tmp1), "=" (tmp2)
-   : "r" (lock), "r" (mask)
-   : "memory");
-}
-
-static inline void arch_write_unlock(arch_rwlock_t *lock)
-{
-   __asm__ __volatile__(
-"  stw %%g0, [%0]"
-   : /* no outputs */
-   : "r" (lock)
-   : "memory");
-}
-
-static inline int arch_write_trylock(arch_rwlock_t *lock)
-{
-   unsigned long mask, tmp1, tmp2, result;
-
-   mask = 0x8000UL;
-
-   __asm__ __volatile__(
-"  mov 0, %2\n"
-"1:lduw[%3], %0\n"
-"  brnz,pn %0, 2f\n"
-"   or %0, %4, %1\n"
-"  cas [%3], %0, %1\n"
-"  cmp %0, %1\n"
-"  bne,pn  %%icc, 1b\n"
-"   nop\n"
-"  mov 1, %2\n"
-"2:"
-   : 

[PATCH v2 3/7] arch/sparc: Define config parameter CPU_BIG_ENDIAN

2017-05-19 Thread Babu Moger
Found this problem while enabling queued rwlock on SPARC.
The parameter CONFIG_CPU_BIG_ENDIAN is used to clear the
specific byte in qrwlock structure. Without this parameter,
we clear the wrong byte. Here is the code.

static inline u8 *__qrwlock_write_byte(struct qrwlock *lock)
 {
return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN);
 }

Define CPU_BIG_ENDIAN for SPARC to fix it.

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Jane Chu 
Reviewed-by: Shannon Nelson 
Reviewed-by: Vijay Kumar 
---
 arch/sparc/Kconfig |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index a2ad946..0f5813b 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -91,6 +91,10 @@ config ARCH_DEFCONFIG
 config ARCH_PROC_KCORE_TEXT
def_bool y
 
+config CPU_BIG_ENDIAN
+   bool
+   default y if SPARC
+
 config ARCH_ATU
bool
default y if SPARC64
-- 
1.7.1



[PATCH v2 4/7] arch/sparc: Introduce cmpxchg_u8 SPARC

2017-05-19 Thread Babu Moger
SPARC supports 32 bit and 64 bit cmpxchg right now. Add support
for 8 bit (1 byte) cmpxchg. This is required to support queued
rwlocks feature  which uses 1 byte cmpxchg.

The function __cmpxchg_u8 here uses the 4 byte cas instruction with a
byte manipulation to achieve 1 byte cmpxchg.

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Steve Sistare 
Reviewed-by: Shannon Nelson 
Reviewed-by: Jane Chu 
Reviewed-by: Vijay Kumar 
---
 arch/sparc/include/asm/cmpxchg_64.h |   29 +
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/arch/sparc/include/asm/cmpxchg_64.h 
b/arch/sparc/include/asm/cmpxchg_64.h
index faa2f61..000f7d7 100644
--- a/arch/sparc/include/asm/cmpxchg_64.h
+++ b/arch/sparc/include/asm/cmpxchg_64.h
@@ -87,6 +87,33 @@ static inline unsigned long __xchg(unsigned long x, 
__volatile__ void * ptr,
return new;
 }
 
+/*
+ * Use 4 byte cas instruction to achieve 1 byte cmpxchg. Main logic
+ * here is to get the bit shift of the byte we are interested in.
+ * The XOR is handy for reversing the bits for big-endian byte order
+ */
+static inline unsigned long
+__cmpxchg_u8(volatile unsigned char *m, unsigned char old, unsigned char new)
+{
+   unsigned long maddr = (unsigned long)m;
+   int bit_shift = (((unsigned long)m & 3) ^ 3) << 3;
+   unsigned int mask = 0xff << bit_shift;
+   unsigned int *ptr = (unsigned int *) (maddr & ~3);
+   unsigned int old32, new32, load;
+   unsigned int load32 = *ptr;
+
+   do {
+   new32 = (load32 & ~mask) | (new << bit_shift);
+   old32 = (load32 & ~mask) | (old << bit_shift);
+   load32 = __cmpxchg_u32(ptr, old32, new32);
+   if (load32 == old32)
+   return old;
+   load = (load32 & mask) >> bit_shift;
+   } while (load == old);
+
+   return load;
+}
+
 /* This function doesn't exist, so you'll get a linker error
if something tries to do an invalid cmpxchg().  */
 void __cmpxchg_called_with_bad_pointer(void);
@@ -95,6 +122,8 @@ static inline unsigned long __xchg(unsigned long x, 
__volatile__ void * ptr,
 __cmpxchg(volatile void *ptr, unsigned long old, unsigned long new, int size)
 {
switch (size) {
+   case 1:
+   return __cmpxchg_u8(ptr, old, new);
case 4:
return __cmpxchg_u32(ptr, old, new);
case 8:
-- 
1.7.1



[PATCH v2 1/7] kernel/locking: Fix compile error with qrwlock.c

2017-05-19 Thread Babu Moger
Some architectures use the following guard in include file
 "asm/spinlock_types.h" to discourage including the file directly.

Saw these compile errors on SPARC when queued rwlock feature is enabled.

  CC  kernel/locking/qrwlock.o
In file included from ./include/asm-generic/qrwlock_types.h:5,
 from ./arch/sparc/include/asm/qrwlock.h:4,
 from kernel/locking/qrwlock.c:24:
./arch/sparc/include/asm/spinlock_types.h:5:3: error:
#error "please don't include this file directly"

Re-arrange the includes in qrwlock_types.h and include spinlock.h
in qrwlock.c to fix it.

Also will be removing this stanza from SPARC. Stay tuned.

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Jane Chu 
Reviewed-by: Shannon Nelson 
Reviewed-by: Vijay Kumar 
---
 include/asm-generic/qrwlock_types.h |6 +++---
 kernel/locking/qrwlock.c|1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/asm-generic/qrwlock_types.h 
b/include/asm-generic/qrwlock_types.h
index 0abc6b6..3988c7d 100644
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -1,9 +1,6 @@
 #ifndef __ASM_GENERIC_QRWLOCK_TYPES_H
 #define __ASM_GENERIC_QRWLOCK_TYPES_H
 
-#include 
-#include 
-
 /*
  * The queue read/write lock data structure
  */
@@ -18,4 +15,7 @@
.wait_lock = __ARCH_SPIN_LOCK_UNLOCKED, \
 }
 
+#include 
+#include 
+
 #endif /* __ASM_GENERIC_QRWLOCK_TYPES_H */
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index cc3ed0c..2655f26 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
-- 
1.7.1



[PATCH v2 5/7] arch/sparc: Enable queued rwlocks for SPARC

2017-05-19 Thread Babu Moger
Enable queued rwlocks for SPARC. Here are the discussions on this feature
when this was introduced.
https://lwn.net/Articles/572765/
https://lwn.net/Articles/582200/

Cleaned-up the arch_read_xxx and arch_write_xxx definitions in spinlock_64.h.
These routines are replaced by the functions in include/asm-generic/qrwlock.h

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Jane Chu 
Reviewed-by: Shannon Nelson 
Reviewed-by: Vijay Kumar 
---
 arch/sparc/Kconfig  |1 +
 arch/sparc/include/asm/qrwlock.h|7 ++
 arch/sparc/include/asm/spinlock_64.h|  124 +--
 arch/sparc/include/asm/spinlock_types.h |5 +-
 4 files changed, 13 insertions(+), 124 deletions(-)
 create mode 100644 arch/sparc/include/asm/qrwlock.h

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 0f5813b..9ec1d2f 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -82,6 +82,7 @@ config SPARC64
select HAVE_ARCH_AUDITSYSCALL
select ARCH_SUPPORTS_ATOMIC_RMW
select HAVE_NMI
+   select ARCH_USE_QUEUED_RWLOCKS
 
 config ARCH_DEFCONFIG
string
diff --git a/arch/sparc/include/asm/qrwlock.h b/arch/sparc/include/asm/qrwlock.h
new file mode 100644
index 000..d68a4b1
--- /dev/null
+++ b/arch/sparc/include/asm/qrwlock.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_SPARC_QRWLOCK_H
+#define _ASM_SPARC_QRWLOCK_H
+
+#include 
+#include 
+
+#endif /* _ASM_SPARC_QRWLOCK_H */
diff --git a/arch/sparc/include/asm/spinlock_64.h 
b/arch/sparc/include/asm/spinlock_64.h
index 07c9f2e..8901c2d 100644
--- a/arch/sparc/include/asm/spinlock_64.h
+++ b/arch/sparc/include/asm/spinlock_64.h
@@ -10,6 +10,7 @@
 
 #include 
 #include 
+#include 
 
 /* To get debugging spinlocks which detect and catch
  * deadlock situations, set CONFIG_DEBUG_SPINLOCK
@@ -94,132 +95,9 @@ static inline void arch_spin_lock_flags(arch_spinlock_t 
*lock, unsigned long fla
: "memory");
 }
 
-/* Multi-reader locks, these are much saner than the 32-bit Sparc ones... */
-
-static inline void arch_read_lock(arch_rwlock_t *lock)
-{
-   unsigned long tmp1, tmp2;
-
-   __asm__ __volatile__ (
-"1:ldsw[%2], %0\n"
-"  brlz,pn %0, 2f\n"
-"4: add%0, 1, %1\n"
-"  cas [%2], %0, %1\n"
-"  cmp %0, %1\n"
-"  bne,pn  %%icc, 1b\n"
-"   nop\n"
-"  .subsection 2\n"
-"2:ldsw[%2], %0\n"
-"  brlz,pt %0, 2b\n"
-"   nop\n"
-"  ba,a,pt %%xcc, 4b\n"
-"  .previous"
-   : "=" (tmp1), "=" (tmp2)
-   : "r" (lock)
-   : "memory");
-}
-
-static inline int arch_read_trylock(arch_rwlock_t *lock)
-{
-   int tmp1, tmp2;
-
-   __asm__ __volatile__ (
-"1:ldsw[%2], %0\n"
-"  brlz,a,pn   %0, 2f\n"
-"   mov0, %0\n"
-"  add %0, 1, %1\n"
-"  cas [%2], %0, %1\n"
-"  cmp %0, %1\n"
-"  bne,pn  %%icc, 1b\n"
-"   mov1, %0\n"
-"2:"
-   : "=" (tmp1), "=" (tmp2)
-   : "r" (lock)
-   : "memory");
-
-   return tmp1;
-}
-
-static inline void arch_read_unlock(arch_rwlock_t *lock)
-{
-   unsigned long tmp1, tmp2;
-
-   __asm__ __volatile__(
-"1:lduw[%2], %0\n"
-"  sub %0, 1, %1\n"
-"  cas [%2], %0, %1\n"
-"  cmp %0, %1\n"
-"  bne,pn  %%xcc, 1b\n"
-"   nop"
-   : "=" (tmp1), "=" (tmp2)
-   : "r" (lock)
-   : "memory");
-}
-
-static inline void arch_write_lock(arch_rwlock_t *lock)
-{
-   unsigned long mask, tmp1, tmp2;
-
-   mask = 0x8000UL;
-
-   __asm__ __volatile__(
-"1:lduw[%2], %0\n"
-"  brnz,pn %0, 2f\n"
-"4: or %0, %3, %1\n"
-"  cas [%2], %0, %1\n"
-"  cmp %0, %1\n"
-"  bne,pn  %%icc, 1b\n"
-"   nop\n"
-"  .subsection 2\n"
-"2:lduw[%2], %0\n"
-"  brnz,pt %0, 2b\n"
-"   nop\n"
-"  ba,a,pt %%xcc, 4b\n"
-"  .previous"
-   : "=" (tmp1), "=" (tmp2)
-   : "r" (lock), "r" (mask)
-   : "memory");
-}
-
-static inline void arch_write_unlock(arch_rwlock_t *lock)
-{
-   __asm__ __volatile__(
-"  stw %%g0, [%0]"
-   : /* no outputs */
-   : "r" (lock)
-   : "memory");
-}
-
-static inline int arch_write_trylock(arch_rwlock_t *lock)
-{
-   unsigned long mask, tmp1, tmp2, result;
-
-   mask = 0x8000UL;
-
-   __asm__ __volatile__(
-"  mov 0, %2\n"
-"1:lduw[%3], %0\n"
-"  brnz,pn %0, 2f\n"
-"   or %0, %4, %1\n"
-"  cas [%3], %0, %1\n"
-"  cmp %0, %1\n"
-"  bne,pn  %%icc, 1b\n"
-"   nop\n"
-"  mov 1, %2\n"
-"2:"
-   : "=" (tmp1), "=" (tmp2), "=" (result)
-   : "r" (lock), "r" (mask)
-   : "memory");
-
-   return result;
-}
-
 

[PATCH v2 3/7] arch/sparc: Define config parameter CPU_BIG_ENDIAN

2017-05-19 Thread Babu Moger
Found this problem while enabling queued rwlock on SPARC.
The parameter CONFIG_CPU_BIG_ENDIAN is used to clear the
specific byte in qrwlock structure. Without this parameter,
we clear the wrong byte. Here is the code.

static inline u8 *__qrwlock_write_byte(struct qrwlock *lock)
 {
return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN);
 }

Define CPU_BIG_ENDIAN for SPARC to fix it.

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Jane Chu 
Reviewed-by: Shannon Nelson 
Reviewed-by: Vijay Kumar 
---
 arch/sparc/Kconfig |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index a2ad946..0f5813b 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -91,6 +91,10 @@ config ARCH_DEFCONFIG
 config ARCH_PROC_KCORE_TEXT
def_bool y
 
+config CPU_BIG_ENDIAN
+   bool
+   default y if SPARC
+
 config ARCH_ATU
bool
default y if SPARC64
-- 
1.7.1



[PATCH v2 4/7] arch/sparc: Introduce cmpxchg_u8 SPARC

2017-05-19 Thread Babu Moger
SPARC supports 32 bit and 64 bit cmpxchg right now. Add support
for 8 bit (1 byte) cmpxchg. This is required to support queued
rwlocks feature  which uses 1 byte cmpxchg.

The function __cmpxchg_u8 here uses the 4 byte cas instruction with a
byte manipulation to achieve 1 byte cmpxchg.

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Steve Sistare 
Reviewed-by: Shannon Nelson 
Reviewed-by: Jane Chu 
Reviewed-by: Vijay Kumar 
---
 arch/sparc/include/asm/cmpxchg_64.h |   29 +
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/arch/sparc/include/asm/cmpxchg_64.h 
b/arch/sparc/include/asm/cmpxchg_64.h
index faa2f61..000f7d7 100644
--- a/arch/sparc/include/asm/cmpxchg_64.h
+++ b/arch/sparc/include/asm/cmpxchg_64.h
@@ -87,6 +87,33 @@ static inline unsigned long __xchg(unsigned long x, 
__volatile__ void * ptr,
return new;
 }
 
+/*
+ * Use 4 byte cas instruction to achieve 1 byte cmpxchg. Main logic
+ * here is to get the bit shift of the byte we are interested in.
+ * The XOR is handy for reversing the bits for big-endian byte order
+ */
+static inline unsigned long
+__cmpxchg_u8(volatile unsigned char *m, unsigned char old, unsigned char new)
+{
+   unsigned long maddr = (unsigned long)m;
+   int bit_shift = (((unsigned long)m & 3) ^ 3) << 3;
+   unsigned int mask = 0xff << bit_shift;
+   unsigned int *ptr = (unsigned int *) (maddr & ~3);
+   unsigned int old32, new32, load;
+   unsigned int load32 = *ptr;
+
+   do {
+   new32 = (load32 & ~mask) | (new << bit_shift);
+   old32 = (load32 & ~mask) | (old << bit_shift);
+   load32 = __cmpxchg_u32(ptr, old32, new32);
+   if (load32 == old32)
+   return old;
+   load = (load32 & mask) >> bit_shift;
+   } while (load == old);
+
+   return load;
+}
+
 /* This function doesn't exist, so you'll get a linker error
if something tries to do an invalid cmpxchg().  */
 void __cmpxchg_called_with_bad_pointer(void);
@@ -95,6 +122,8 @@ static inline unsigned long __xchg(unsigned long x, 
__volatile__ void * ptr,
 __cmpxchg(volatile void *ptr, unsigned long old, unsigned long new, int size)
 {
switch (size) {
+   case 1:
+   return __cmpxchg_u8(ptr, old, new);
case 4:
return __cmpxchg_u32(ptr, old, new);
case 8:
-- 
1.7.1



[PATCH v2 1/7] kernel/locking: Fix compile error with qrwlock.c

2017-05-19 Thread Babu Moger
Some architectures use the following guard in include file
 "asm/spinlock_types.h" to discourage including the file directly.

Saw these compile errors on SPARC when queued rwlock feature is enabled.

  CC  kernel/locking/qrwlock.o
In file included from ./include/asm-generic/qrwlock_types.h:5,
 from ./arch/sparc/include/asm/qrwlock.h:4,
 from kernel/locking/qrwlock.c:24:
./arch/sparc/include/asm/spinlock_types.h:5:3: error:
#error "please don't include this file directly"

Re-arrange the includes in qrwlock_types.h and include spinlock.h
in qrwlock.c to fix it.

Also will be removing this stanza from SPARC. Stay tuned.

Signed-off-by: Babu Moger 
Reviewed-by: Håkon Bugge 
Reviewed-by: Jane Chu 
Reviewed-by: Shannon Nelson 
Reviewed-by: Vijay Kumar 
---
 include/asm-generic/qrwlock_types.h |6 +++---
 kernel/locking/qrwlock.c|1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/asm-generic/qrwlock_types.h 
b/include/asm-generic/qrwlock_types.h
index 0abc6b6..3988c7d 100644
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -1,9 +1,6 @@
 #ifndef __ASM_GENERIC_QRWLOCK_TYPES_H
 #define __ASM_GENERIC_QRWLOCK_TYPES_H
 
-#include 
-#include 
-
 /*
  * The queue read/write lock data structure
  */
@@ -18,4 +15,7 @@
.wait_lock = __ARCH_SPIN_LOCK_UNLOCKED, \
 }
 
+#include 
+#include 
+
 #endif /* __ASM_GENERIC_QRWLOCK_TYPES_H */
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index cc3ed0c..2655f26 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
-- 
1.7.1



Re: [PATCH] dt-bindings: Document optional "reserved-names" property

2017-05-19 Thread Florian Fainelli
On 05/12/2017 04:55 PM, Rob Herring wrote:
> On Tue, May 09, 2017 at 10:18:47AM -0700, Florian Fainelli wrote:
>> Define an optional string property: "reserved-names" which can be used
>> by the client program to tag/identify reserved memory regions.
>>
>> Signed-off-by: Florian Fainelli 
>> ---
>>  Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt | 4 
>> 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git 
>> a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
>> b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
>> index 3da0ebdba8d9..bd3c9485f637 100644
>> --- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
>> +++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
>> @@ -64,6 +64,10 @@ reusable (optional) - empty property
>>system can use that region to store volatile or cached data that
>>can be otherwise regenerated or migrated elsewhere.
>>  
>> +reserved-names (optional)
>> +- Provides a named tag to the client program to help pretty 
>> print/identify
>> +  the reserved memory region.
> 
> *-names is normally on the client side. I'd like to keep that 
> consistent.
> 
> Second, I don't see the need for this. The compatible is not enough?

Sounds like you are right, compatible is good enough, thanks
-- 
Florian


Re: [PATCH] dt-bindings: Document optional "reserved-names" property

2017-05-19 Thread Florian Fainelli
On 05/12/2017 04:55 PM, Rob Herring wrote:
> On Tue, May 09, 2017 at 10:18:47AM -0700, Florian Fainelli wrote:
>> Define an optional string property: "reserved-names" which can be used
>> by the client program to tag/identify reserved memory regions.
>>
>> Signed-off-by: Florian Fainelli 
>> ---
>>  Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt | 4 
>> 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git 
>> a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
>> b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
>> index 3da0ebdba8d9..bd3c9485f637 100644
>> --- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
>> +++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
>> @@ -64,6 +64,10 @@ reusable (optional) - empty property
>>system can use that region to store volatile or cached data that
>>can be otherwise regenerated or migrated elsewhere.
>>  
>> +reserved-names (optional)
>> +- Provides a named tag to the client program to help pretty 
>> print/identify
>> +  the reserved memory region.
> 
> *-names is normally on the client side. I'd like to keep that 
> consistent.
> 
> Second, I don't see the need for this. The compatible is not enough?

Sounds like you are right, compatible is good enough, thanks
-- 
Florian


Re: stackprotector: ascii armor the stack canary

2017-05-19 Thread Daniel Micay
On Fri, 2017-05-19 at 17:26 -0400, r...@redhat.com wrote:
> Zero out the first byte of the stack canary value on 64 bit systems,
> in order to prevent unterminated C string overflows from being able
> to successfully overwrite the canary, even if an attacker somehow
> guessed or obtained the canary value.
> 
> Inspired by execshield ascii-armor and PaX/grsecurity.
> 
> Thanks to Daniel Micay for extracting code of similar functionality
> from PaX/grsecurity and making it easy to find in his linux-hardened
> git tree on https://github.com/thestinger/linux-hardened/

To clarify something this part isn't from PaX / grsecurity. I marked the
commits from PaX / grsecurity as such in their commit messages and these
are among the ones that aren't from there.

This is from a set of changes that I did for CopperheadOS and forward
ported to mainline recently to start linux-hardened. It was only arm64
for CopperheadOS. The overlap with PaX is that when adding the leading
zero byte for x86, I needed to first fix get_random_int being used for
the per-task canary value. I didn't know PaX fixed it way back in 2007.

I implemented heap canaries for our userspace malloc implementation and
then later did the same thing for slub in the kernel. I added a leading
zero byte to both of those heap canaries later on and then did the SSP
implementation in Bionic and the kernel's arm64 code. I took the idea
from glibc but limited it to 64-bit where there's entropy to spare. The
glibc leading zero might have come from execshield, but I don't know.


Re: stackprotector: ascii armor the stack canary

2017-05-19 Thread Daniel Micay
On Fri, 2017-05-19 at 17:26 -0400, r...@redhat.com wrote:
> Zero out the first byte of the stack canary value on 64 bit systems,
> in order to prevent unterminated C string overflows from being able
> to successfully overwrite the canary, even if an attacker somehow
> guessed or obtained the canary value.
> 
> Inspired by execshield ascii-armor and PaX/grsecurity.
> 
> Thanks to Daniel Micay for extracting code of similar functionality
> from PaX/grsecurity and making it easy to find in his linux-hardened
> git tree on https://github.com/thestinger/linux-hardened/

To clarify something this part isn't from PaX / grsecurity. I marked the
commits from PaX / grsecurity as such in their commit messages and these
are among the ones that aren't from there.

This is from a set of changes that I did for CopperheadOS and forward
ported to mainline recently to start linux-hardened. It was only arm64
for CopperheadOS. The overlap with PaX is that when adding the leading
zero byte for x86, I needed to first fix get_random_int being used for
the per-task canary value. I didn't know PaX fixed it way back in 2007.

I implemented heap canaries for our userspace malloc implementation and
then later did the same thing for slub in the kernel. I added a leading
zero byte to both of those heap canaries later on and then did the SSP
implementation in Bionic and the kernel's arm64 code. I took the idea
from glibc but limited it to 64-bit where there's entropy to spare. The
glibc leading zero might have come from execshield, but I don't know.


[PATCH v2 3/3] drm/radeon: Cleanup pageflipping IRQ handling for evergreen, si

2017-05-19 Thread Lyude
Same as the previous patch, but for pageflipping now. This also lets us
clear up the copy paste for vblank/vline IRQs.

Changes since v1:
- Preserve the order all registers are written back

Signed-off-by: Lyude 
---
 drivers/gpu/drm/radeon/evergreen.c | 105 ---
 drivers/gpu/drm/radeon/radeon.h|   7 +--
 drivers/gpu/drm/radeon/si.c| 111 +
 3 files changed, 51 insertions(+), 172 deletions(-)

diff --git a/drivers/gpu/drm/radeon/evergreen.c 
b/drivers/gpu/drm/radeon/evergreen.c
index 507a773..44527e6 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -4467,17 +4467,8 @@ void evergreen_disable_interrupt_state(struct 
radeon_device *rdev)
WREG32(SRBM_INT_CNTL, 0);
for (i = 0; i < rdev->num_crtc; i++)
WREG32(INT_MASK + crtc_offsets[i], 0);
-
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC0_REGISTER_OFFSET, 0);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC1_REGISTER_OFFSET, 0);
-   if (rdev->num_crtc >= 4) {
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC2_REGISTER_OFFSET, 0);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC3_REGISTER_OFFSET, 0);
-   }
-   if (rdev->num_crtc >= 6) {
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC4_REGISTER_OFFSET, 0);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC5_REGISTER_OFFSET, 0);
-   }
+   for (i = 0; i < rdev->num_crtc; i++)
+   WREG32(GRPH_INT_CONTROL + crtc_offsets[i], 0);
 
/* only one DAC on DCE5 */
if (!ASIC_IS_DCE5(rdev))
@@ -4581,22 +4572,8 @@ int evergreen_irq_set(struct radeon_device *rdev)
atomic_read(>irq.pflip[i]), "vblank", i);
}
 
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC0_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC1_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   if (rdev->num_crtc >= 4) {
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC2_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC3_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   }
-   if (rdev->num_crtc >= 6) {
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC4_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC5_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   }
+   for (i = 0; i < rdev->num_crtc; i++)
+   WREG32(GRPH_INT_CONTROL + crtc_offsets[i], GRPH_PFLIP_INT_MASK);
 
for (i = 0; i < 6; i++) {
radeon_irq_kms_set_irq_n_enabled(
@@ -4626,68 +4603,34 @@ int evergreen_irq_set(struct radeon_device *rdev)
 /* Note that the order we write back regs here is important */
 static void evergreen_irq_ack(struct radeon_device *rdev)
 {
-   int i;
+   int i, j;
+   u32 *grph_int = rdev->irq.stat_regs.evergreen.grph_int;
u32 *disp_int = rdev->irq.stat_regs.evergreen.disp_int;
u32 *afmt_status = rdev->irq.stat_regs.evergreen.afmt_status;
 
for (i = 0; i < 6; i++) {
disp_int[i] = RREG32(evergreen_disp_int_status[i]);
afmt_status[i] = RREG32(AFMT_STATUS + crtc_offsets[i]);
+   if (i < rdev->num_crtc)
+   grph_int[i] = RREG32(GRPH_INT_STATUS + crtc_offsets[i]);
}
 
-   rdev->irq.stat_regs.evergreen.d1grph_int = RREG32(GRPH_INT_STATUS + 
EVERGREEN_CRTC0_REGISTER_OFFSET);
-   rdev->irq.stat_regs.evergreen.d2grph_int = RREG32(GRPH_INT_STATUS + 
EVERGREEN_CRTC1_REGISTER_OFFSET);
-   if (rdev->num_crtc >= 4) {
-   rdev->irq.stat_regs.evergreen.d3grph_int = 
RREG32(GRPH_INT_STATUS + EVERGREEN_CRTC2_REGISTER_OFFSET);
-   rdev->irq.stat_regs.evergreen.d4grph_int = 
RREG32(GRPH_INT_STATUS + EVERGREEN_CRTC3_REGISTER_OFFSET);
-   }
-   if (rdev->num_crtc >= 6) {
-   rdev->irq.stat_regs.evergreen.d5grph_int = 
RREG32(GRPH_INT_STATUS + EVERGREEN_CRTC4_REGISTER_OFFSET);
-   rdev->irq.stat_regs.evergreen.d6grph_int = 
RREG32(GRPH_INT_STATUS + EVERGREEN_CRTC5_REGISTER_OFFSET);
-   }
-
-
-   if (rdev->irq.stat_regs.evergreen.d1grph_int & GRPH_PFLIP_INT_OCCURRED)
-   WREG32(GRPH_INT_STATUS + EVERGREEN_CRTC0_REGISTER_OFFSET, 
GRPH_PFLIP_INT_CLEAR);
-   if (rdev->irq.stat_regs.evergreen.d2grph_int & GRPH_PFLIP_INT_OCCURRED)
-   WREG32(GRPH_INT_STATUS + EVERGREEN_CRTC1_REGISTER_OFFSET, 
GRPH_PFLIP_INT_CLEAR);
-   if (disp_int[0] & LB_D1_VBLANK_INTERRUPT)
-   WREG32(VBLANK_STATUS + crtc_offsets[0], VBLANK_ACK);
-   if (disp_int[0] & LB_D1_VLINE_INTERRUPT)
-   WREG32(VLINE_STATUS + crtc_offsets[0], VLINE_ACK);
-   if (disp_int[1] & LB_D1_VBLANK_INTERRUPT)
-   WREG32(VBLANK_STATUS + 

[PATCH v2 3/3] drm/radeon: Cleanup pageflipping IRQ handling for evergreen, si

2017-05-19 Thread Lyude
Same as the previous patch, but for pageflipping now. This also lets us
clear up the copy paste for vblank/vline IRQs.

Changes since v1:
- Preserve the order all registers are written back

Signed-off-by: Lyude 
---
 drivers/gpu/drm/radeon/evergreen.c | 105 ---
 drivers/gpu/drm/radeon/radeon.h|   7 +--
 drivers/gpu/drm/radeon/si.c| 111 +
 3 files changed, 51 insertions(+), 172 deletions(-)

diff --git a/drivers/gpu/drm/radeon/evergreen.c 
b/drivers/gpu/drm/radeon/evergreen.c
index 507a773..44527e6 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -4467,17 +4467,8 @@ void evergreen_disable_interrupt_state(struct 
radeon_device *rdev)
WREG32(SRBM_INT_CNTL, 0);
for (i = 0; i < rdev->num_crtc; i++)
WREG32(INT_MASK + crtc_offsets[i], 0);
-
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC0_REGISTER_OFFSET, 0);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC1_REGISTER_OFFSET, 0);
-   if (rdev->num_crtc >= 4) {
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC2_REGISTER_OFFSET, 0);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC3_REGISTER_OFFSET, 0);
-   }
-   if (rdev->num_crtc >= 6) {
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC4_REGISTER_OFFSET, 0);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC5_REGISTER_OFFSET, 0);
-   }
+   for (i = 0; i < rdev->num_crtc; i++)
+   WREG32(GRPH_INT_CONTROL + crtc_offsets[i], 0);
 
/* only one DAC on DCE5 */
if (!ASIC_IS_DCE5(rdev))
@@ -4581,22 +4572,8 @@ int evergreen_irq_set(struct radeon_device *rdev)
atomic_read(>irq.pflip[i]), "vblank", i);
}
 
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC0_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC1_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   if (rdev->num_crtc >= 4) {
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC2_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC3_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   }
-   if (rdev->num_crtc >= 6) {
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC4_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   WREG32(GRPH_INT_CONTROL + EVERGREEN_CRTC5_REGISTER_OFFSET,
-  GRPH_PFLIP_INT_MASK);
-   }
+   for (i = 0; i < rdev->num_crtc; i++)
+   WREG32(GRPH_INT_CONTROL + crtc_offsets[i], GRPH_PFLIP_INT_MASK);
 
for (i = 0; i < 6; i++) {
radeon_irq_kms_set_irq_n_enabled(
@@ -4626,68 +4603,34 @@ int evergreen_irq_set(struct radeon_device *rdev)
 /* Note that the order we write back regs here is important */
 static void evergreen_irq_ack(struct radeon_device *rdev)
 {
-   int i;
+   int i, j;
+   u32 *grph_int = rdev->irq.stat_regs.evergreen.grph_int;
u32 *disp_int = rdev->irq.stat_regs.evergreen.disp_int;
u32 *afmt_status = rdev->irq.stat_regs.evergreen.afmt_status;
 
for (i = 0; i < 6; i++) {
disp_int[i] = RREG32(evergreen_disp_int_status[i]);
afmt_status[i] = RREG32(AFMT_STATUS + crtc_offsets[i]);
+   if (i < rdev->num_crtc)
+   grph_int[i] = RREG32(GRPH_INT_STATUS + crtc_offsets[i]);
}
 
-   rdev->irq.stat_regs.evergreen.d1grph_int = RREG32(GRPH_INT_STATUS + 
EVERGREEN_CRTC0_REGISTER_OFFSET);
-   rdev->irq.stat_regs.evergreen.d2grph_int = RREG32(GRPH_INT_STATUS + 
EVERGREEN_CRTC1_REGISTER_OFFSET);
-   if (rdev->num_crtc >= 4) {
-   rdev->irq.stat_regs.evergreen.d3grph_int = 
RREG32(GRPH_INT_STATUS + EVERGREEN_CRTC2_REGISTER_OFFSET);
-   rdev->irq.stat_regs.evergreen.d4grph_int = 
RREG32(GRPH_INT_STATUS + EVERGREEN_CRTC3_REGISTER_OFFSET);
-   }
-   if (rdev->num_crtc >= 6) {
-   rdev->irq.stat_regs.evergreen.d5grph_int = 
RREG32(GRPH_INT_STATUS + EVERGREEN_CRTC4_REGISTER_OFFSET);
-   rdev->irq.stat_regs.evergreen.d6grph_int = 
RREG32(GRPH_INT_STATUS + EVERGREEN_CRTC5_REGISTER_OFFSET);
-   }
-
-
-   if (rdev->irq.stat_regs.evergreen.d1grph_int & GRPH_PFLIP_INT_OCCURRED)
-   WREG32(GRPH_INT_STATUS + EVERGREEN_CRTC0_REGISTER_OFFSET, 
GRPH_PFLIP_INT_CLEAR);
-   if (rdev->irq.stat_regs.evergreen.d2grph_int & GRPH_PFLIP_INT_OCCURRED)
-   WREG32(GRPH_INT_STATUS + EVERGREEN_CRTC1_REGISTER_OFFSET, 
GRPH_PFLIP_INT_CLEAR);
-   if (disp_int[0] & LB_D1_VBLANK_INTERRUPT)
-   WREG32(VBLANK_STATUS + crtc_offsets[0], VBLANK_ACK);
-   if (disp_int[0] & LB_D1_VLINE_INTERRUPT)
-   WREG32(VLINE_STATUS + crtc_offsets[0], VLINE_ACK);
-   if (disp_int[1] & LB_D1_VBLANK_INTERRUPT)
-   WREG32(VBLANK_STATUS + crtc_offsets[1], 

[PATCH v2 0/3] Cleanup evergreen/si IRQ handling code

2017-05-19 Thread Lyude
This is the first part of me going through and cleaning up the IRQ handling
code for radeon, since after taking a look at it the other day while trying to
debug something I realized basically all of the code was copy pasted
everywhere, and quite difficult to actually read through.

Will come up with something for r600 and cik once I've got the chipsets on hand
to test with.

Lyude (3):
  drm/radeon: Cleanup display interrupt handling for evergreen, si
  drm/radeon: Cleanup HDMI audio interrupt handling for evergreen
  drm/radeon: Cleanup pageflipping IRQ handling for evergreen, si

 drivers/gpu/drm/radeon/evergreen.c  | 943 ++--
 drivers/gpu/drm/radeon/radeon.h |  27 +-
 drivers/gpu/drm/radeon/radeon_irq_kms.c |  35 ++
 drivers/gpu/drm/radeon/si.c | 655 +-
 4 files changed, 344 insertions(+), 1316 deletions(-)

-- 
2.9.4



[PATCH v2 0/3] Cleanup evergreen/si IRQ handling code

2017-05-19 Thread Lyude
This is the first part of me going through and cleaning up the IRQ handling
code for radeon, since after taking a look at it the other day while trying to
debug something I realized basically all of the code was copy pasted
everywhere, and quite difficult to actually read through.

Will come up with something for r600 and cik once I've got the chipsets on hand
to test with.

Lyude (3):
  drm/radeon: Cleanup display interrupt handling for evergreen, si
  drm/radeon: Cleanup HDMI audio interrupt handling for evergreen
  drm/radeon: Cleanup pageflipping IRQ handling for evergreen, si

 drivers/gpu/drm/radeon/evergreen.c  | 943 ++--
 drivers/gpu/drm/radeon/radeon.h |  27 +-
 drivers/gpu/drm/radeon/radeon_irq_kms.c |  35 ++
 drivers/gpu/drm/radeon/si.c | 655 +-
 4 files changed, 344 insertions(+), 1316 deletions(-)

-- 
2.9.4



[PATCH v2 2/3] drm/radeon: Cleanup HDMI audio interrupt handling for evergreen

2017-05-19 Thread Lyude
Same as the previous patch, but now for handling HDMI audio interrupts.

Changes since v1:
- Preserve the order we write back all registers

Signed-off-by: Lyude 
---
 drivers/gpu/drm/radeon/evergreen.c | 153 +++--
 drivers/gpu/drm/radeon/radeon.h|   7 +-
 2 files changed, 27 insertions(+), 133 deletions(-)

diff --git a/drivers/gpu/drm/radeon/evergreen.c 
b/drivers/gpu/drm/radeon/evergreen.c
index 44ac6d3..507a773 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -4495,7 +4495,6 @@ int evergreen_irq_set(struct radeon_device *rdev)
u32 cp_int_cntl = CNTX_BUSY_INT_ENABLE | CNTX_EMPTY_INT_ENABLE;
u32 cp_int_cntl1 = 0, cp_int_cntl2 = 0;
u32 grbm_int_cntl = 0;
-   u32 afmt1 = 0, afmt2 = 0, afmt3 = 0, afmt4 = 0, afmt5 = 0, afmt6 = 0;
u32 dma_cntl, dma_cntl1 = 0;
u32 thermal_int = 0;
 
@@ -4518,13 +4517,6 @@ int evergreen_irq_set(struct radeon_device *rdev)
thermal_int = RREG32(CG_THERMAL_INT) &
~(THERM_INT_MASK_HIGH | THERM_INT_MASK_LOW);
 
-   afmt1 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC0_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-   afmt2 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC1_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-   afmt3 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC2_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-   afmt4 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC3_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-   afmt5 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC4_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-   afmt6 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC5_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-
dma_cntl = RREG32(DMA_CNTL) & ~TRAP_ENABLE;
 
if (rdev->family >= CHIP_CAYMAN) {
@@ -4567,31 +4559,6 @@ int evergreen_irq_set(struct radeon_device *rdev)
thermal_int |= THERM_INT_MASK_HIGH | THERM_INT_MASK_LOW;
}
 
-   if (rdev->irq.afmt[0]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 0\n");
-   afmt1 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-   if (rdev->irq.afmt[1]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 1\n");
-   afmt2 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-   if (rdev->irq.afmt[2]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 2\n");
-   afmt3 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-   if (rdev->irq.afmt[3]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 3\n");
-   afmt4 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-   if (rdev->irq.afmt[4]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 4\n");
-   afmt5 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-   if (rdev->irq.afmt[5]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 5\n");
-   afmt6 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-
if (rdev->family >= CHIP_CAYMAN) {
cayman_cp_int_cntl_setup(rdev, 0, cp_int_cntl);
cayman_cp_int_cntl_setup(rdev, 1, cp_int_cntl1);
@@ -4643,12 +4610,12 @@ int evergreen_irq_set(struct radeon_device *rdev)
else
WREG32(CG_THERMAL_INT, thermal_int);
 
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC0_REGISTER_OFFSET, 
afmt1);
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC1_REGISTER_OFFSET, 
afmt2);
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC2_REGISTER_OFFSET, 
afmt3);
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC3_REGISTER_OFFSET, 
afmt4);
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC4_REGISTER_OFFSET, 
afmt5);
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC5_REGISTER_OFFSET, 
afmt6);
+   for (i = 0; i < 6; i++) {
+   radeon_irq_kms_set_irq_n_enabled(
+   rdev, AFMT_AUDIO_PACKET_CONTROL + crtc_offsets[i],
+   AFMT_AZ_FORMAT_WTRIG_MASK,
+   rdev->irq.afmt[i], "HDMI", i);
+   }
 
/* posting read */
RREG32(SRBM_STATUS);
@@ -4661,10 +4628,12 @@ static void evergreen_irq_ack(struct radeon_device 
*rdev)
 {
int i;
u32 *disp_int = rdev->irq.stat_regs.evergreen.disp_int;
-   u32 tmp;
+   u32 *afmt_status = rdev->irq.stat_regs.evergreen.afmt_status;
 
-   for (i = 0; i < 6; i++)
+   for (i = 0; i < 6; i++) {
disp_int[i] = RREG32(evergreen_disp_int_status[i]);
+   afmt_status[i] = RREG32(AFMT_STATUS + crtc_offsets[i]);
+   }
 
rdev->irq.stat_regs.evergreen.d1grph_int = RREG32(GRPH_INT_STATUS + 
EVERGREEN_CRTC0_REGISTER_OFFSET);
rdev->irq.stat_regs.evergreen.d2grph_int = RREG32(GRPH_INT_STATUS + 
EVERGREEN_CRTC1_REGISTER_OFFSET);
@@ -4677,12 +4646,6 @@ static void evergreen_irq_ack(struct radeon_device *rdev)
rdev->irq.stat_regs.evergreen.d6grph_int = 

[PATCH v2 1/3] drm/radeon: Cleanup display interrupt handling for evergreen, si

2017-05-19 Thread Lyude
The current code here is really, really bad. A huge amount of it looks
to be copy pasted, it has some weird hatred of arrays and code sharing,
switch cases everywhere for things that really don't need them, and it
makes the file seem immensely more complex then it actually is. This is
a pain for maintanence, and is vulnerable to more weird irq handling
bugs.

So, let's start cleaning this up a bit. Modify all of the IRQ handlers
for evergreen/si so that they just use for loops. As well, we add a
helper function radeon_irq_kms_set_irq_n_enabled(), whose purpose is
just to update the state of registers that enable/disable interrupts
while printing any changes to the set of enabled interrupts to the
kernel log.

Note in this commit, since vblank/vline irq acking is intertwined with
page flip irq acking, we can't cut out all of the copy paste in
evergreen/si_irq_ack() just yet.

Changes since v1:
- Preserve order we write back all registers

Signed-off-by: Lyude 
---
 drivers/gpu/drm/radeon/evergreen.c  | 729 +++-
 drivers/gpu/drm/radeon/radeon.h |  13 +-
 drivers/gpu/drm/radeon/radeon_irq_kms.c |  35 ++
 drivers/gpu/drm/radeon/si.c | 596 ++
 4 files changed, 314 insertions(+), 1059 deletions(-)

diff --git a/drivers/gpu/drm/radeon/evergreen.c 
b/drivers/gpu/drm/radeon/evergreen.c
index 0bf1035..44ac6d3 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -35,6 +35,10 @@
 #include "evergreen_blit_shaders.h"
 #include "radeon_ucode.h"
 
+#define DC_HPDx_CONTROL(x)(DC_HPD1_CONTROL + (x * 0xc))
+#define DC_HPDx_INT_CONTROL(x)(DC_HPD1_INT_CONTROL + (x * 0xc))
+#define DC_HPDx_INT_STATUS_REG(x) (DC_HPD1_INT_STATUS  + (x * 0xc))
+
 /*
  * Indirect registers accessor
  */
@@ -1714,38 +1718,10 @@ void evergreen_pm_finish(struct radeon_device *rdev)
  */
 bool evergreen_hpd_sense(struct radeon_device *rdev, enum radeon_hpd_id hpd)
 {
-   bool connected = false;
-
-   switch (hpd) {
-   case RADEON_HPD_1:
-   if (RREG32(DC_HPD1_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   case RADEON_HPD_2:
-   if (RREG32(DC_HPD2_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   case RADEON_HPD_3:
-   if (RREG32(DC_HPD3_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   case RADEON_HPD_4:
-   if (RREG32(DC_HPD4_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   case RADEON_HPD_5:
-   if (RREG32(DC_HPD5_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   case RADEON_HPD_6:
-   if (RREG32(DC_HPD6_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   default:
-   break;
-   }
+   if (hpd == RADEON_HPD_NONE)
+   return false;
 
-   return connected;
+   return !!(RREG32(DC_HPDx_INT_STATUS_REG(hpd)) & DC_HPDx_SENSE);
 }
 
 /**
@@ -1759,61 +1735,15 @@ bool evergreen_hpd_sense(struct radeon_device *rdev, 
enum radeon_hpd_id hpd)
 void evergreen_hpd_set_polarity(struct radeon_device *rdev,
enum radeon_hpd_id hpd)
 {
-   u32 tmp;
bool connected = evergreen_hpd_sense(rdev, hpd);
 
-   switch (hpd) {
-   case RADEON_HPD_1:
-   tmp = RREG32(DC_HPD1_INT_CONTROL);
-   if (connected)
-   tmp &= ~DC_HPDx_INT_POLARITY;
-   else
-   tmp |= DC_HPDx_INT_POLARITY;
-   WREG32(DC_HPD1_INT_CONTROL, tmp);
-   break;
-   case RADEON_HPD_2:
-   tmp = RREG32(DC_HPD2_INT_CONTROL);
-   if (connected)
-   tmp &= ~DC_HPDx_INT_POLARITY;
-   else
-   tmp |= DC_HPDx_INT_POLARITY;
-   WREG32(DC_HPD2_INT_CONTROL, tmp);
-   break;
-   case RADEON_HPD_3:
-   tmp = RREG32(DC_HPD3_INT_CONTROL);
-   if (connected)
-   tmp &= ~DC_HPDx_INT_POLARITY;
-   else
-   tmp |= DC_HPDx_INT_POLARITY;
-   WREG32(DC_HPD3_INT_CONTROL, tmp);
-   break;
-   case RADEON_HPD_4:
-   tmp = RREG32(DC_HPD4_INT_CONTROL);
-   if (connected)
-   tmp &= ~DC_HPDx_INT_POLARITY;
-   else
-   tmp |= DC_HPDx_INT_POLARITY;
-   WREG32(DC_HPD4_INT_CONTROL, tmp);
-   break;
-   case RADEON_HPD_5:
-   tmp = RREG32(DC_HPD5_INT_CONTROL);
-   if (connected)
-   tmp &= ~DC_HPDx_INT_POLARITY;
-   else
-   tmp |= 

[PATCH v2 2/3] drm/radeon: Cleanup HDMI audio interrupt handling for evergreen

2017-05-19 Thread Lyude
Same as the previous patch, but now for handling HDMI audio interrupts.

Changes since v1:
- Preserve the order we write back all registers

Signed-off-by: Lyude 
---
 drivers/gpu/drm/radeon/evergreen.c | 153 +++--
 drivers/gpu/drm/radeon/radeon.h|   7 +-
 2 files changed, 27 insertions(+), 133 deletions(-)

diff --git a/drivers/gpu/drm/radeon/evergreen.c 
b/drivers/gpu/drm/radeon/evergreen.c
index 44ac6d3..507a773 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -4495,7 +4495,6 @@ int evergreen_irq_set(struct radeon_device *rdev)
u32 cp_int_cntl = CNTX_BUSY_INT_ENABLE | CNTX_EMPTY_INT_ENABLE;
u32 cp_int_cntl1 = 0, cp_int_cntl2 = 0;
u32 grbm_int_cntl = 0;
-   u32 afmt1 = 0, afmt2 = 0, afmt3 = 0, afmt4 = 0, afmt5 = 0, afmt6 = 0;
u32 dma_cntl, dma_cntl1 = 0;
u32 thermal_int = 0;
 
@@ -4518,13 +4517,6 @@ int evergreen_irq_set(struct radeon_device *rdev)
thermal_int = RREG32(CG_THERMAL_INT) &
~(THERM_INT_MASK_HIGH | THERM_INT_MASK_LOW);
 
-   afmt1 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC0_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-   afmt2 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC1_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-   afmt3 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC2_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-   afmt4 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC3_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-   afmt5 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC4_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-   afmt6 = RREG32(AFMT_AUDIO_PACKET_CONTROL + 
EVERGREEN_CRTC5_REGISTER_OFFSET) & ~AFMT_AZ_FORMAT_WTRIG_MASK;
-
dma_cntl = RREG32(DMA_CNTL) & ~TRAP_ENABLE;
 
if (rdev->family >= CHIP_CAYMAN) {
@@ -4567,31 +4559,6 @@ int evergreen_irq_set(struct radeon_device *rdev)
thermal_int |= THERM_INT_MASK_HIGH | THERM_INT_MASK_LOW;
}
 
-   if (rdev->irq.afmt[0]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 0\n");
-   afmt1 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-   if (rdev->irq.afmt[1]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 1\n");
-   afmt2 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-   if (rdev->irq.afmt[2]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 2\n");
-   afmt3 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-   if (rdev->irq.afmt[3]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 3\n");
-   afmt4 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-   if (rdev->irq.afmt[4]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 4\n");
-   afmt5 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-   if (rdev->irq.afmt[5]) {
-   DRM_DEBUG("evergreen_irq_set: hdmi 5\n");
-   afmt6 |= AFMT_AZ_FORMAT_WTRIG_MASK;
-   }
-
if (rdev->family >= CHIP_CAYMAN) {
cayman_cp_int_cntl_setup(rdev, 0, cp_int_cntl);
cayman_cp_int_cntl_setup(rdev, 1, cp_int_cntl1);
@@ -4643,12 +4610,12 @@ int evergreen_irq_set(struct radeon_device *rdev)
else
WREG32(CG_THERMAL_INT, thermal_int);
 
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC0_REGISTER_OFFSET, 
afmt1);
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC1_REGISTER_OFFSET, 
afmt2);
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC2_REGISTER_OFFSET, 
afmt3);
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC3_REGISTER_OFFSET, 
afmt4);
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC4_REGISTER_OFFSET, 
afmt5);
-   WREG32(AFMT_AUDIO_PACKET_CONTROL + EVERGREEN_CRTC5_REGISTER_OFFSET, 
afmt6);
+   for (i = 0; i < 6; i++) {
+   radeon_irq_kms_set_irq_n_enabled(
+   rdev, AFMT_AUDIO_PACKET_CONTROL + crtc_offsets[i],
+   AFMT_AZ_FORMAT_WTRIG_MASK,
+   rdev->irq.afmt[i], "HDMI", i);
+   }
 
/* posting read */
RREG32(SRBM_STATUS);
@@ -4661,10 +4628,12 @@ static void evergreen_irq_ack(struct radeon_device 
*rdev)
 {
int i;
u32 *disp_int = rdev->irq.stat_regs.evergreen.disp_int;
-   u32 tmp;
+   u32 *afmt_status = rdev->irq.stat_regs.evergreen.afmt_status;
 
-   for (i = 0; i < 6; i++)
+   for (i = 0; i < 6; i++) {
disp_int[i] = RREG32(evergreen_disp_int_status[i]);
+   afmt_status[i] = RREG32(AFMT_STATUS + crtc_offsets[i]);
+   }
 
rdev->irq.stat_regs.evergreen.d1grph_int = RREG32(GRPH_INT_STATUS + 
EVERGREEN_CRTC0_REGISTER_OFFSET);
rdev->irq.stat_regs.evergreen.d2grph_int = RREG32(GRPH_INT_STATUS + 
EVERGREEN_CRTC1_REGISTER_OFFSET);
@@ -4677,12 +4646,6 @@ static void evergreen_irq_ack(struct radeon_device *rdev)
rdev->irq.stat_regs.evergreen.d6grph_int = 
RREG32(GRPH_INT_STATUS + 

[PATCH v2 1/3] drm/radeon: Cleanup display interrupt handling for evergreen, si

2017-05-19 Thread Lyude
The current code here is really, really bad. A huge amount of it looks
to be copy pasted, it has some weird hatred of arrays and code sharing,
switch cases everywhere for things that really don't need them, and it
makes the file seem immensely more complex then it actually is. This is
a pain for maintanence, and is vulnerable to more weird irq handling
bugs.

So, let's start cleaning this up a bit. Modify all of the IRQ handlers
for evergreen/si so that they just use for loops. As well, we add a
helper function radeon_irq_kms_set_irq_n_enabled(), whose purpose is
just to update the state of registers that enable/disable interrupts
while printing any changes to the set of enabled interrupts to the
kernel log.

Note in this commit, since vblank/vline irq acking is intertwined with
page flip irq acking, we can't cut out all of the copy paste in
evergreen/si_irq_ack() just yet.

Changes since v1:
- Preserve order we write back all registers

Signed-off-by: Lyude 
---
 drivers/gpu/drm/radeon/evergreen.c  | 729 +++-
 drivers/gpu/drm/radeon/radeon.h |  13 +-
 drivers/gpu/drm/radeon/radeon_irq_kms.c |  35 ++
 drivers/gpu/drm/radeon/si.c | 596 ++
 4 files changed, 314 insertions(+), 1059 deletions(-)

diff --git a/drivers/gpu/drm/radeon/evergreen.c 
b/drivers/gpu/drm/radeon/evergreen.c
index 0bf1035..44ac6d3 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -35,6 +35,10 @@
 #include "evergreen_blit_shaders.h"
 #include "radeon_ucode.h"
 
+#define DC_HPDx_CONTROL(x)(DC_HPD1_CONTROL + (x * 0xc))
+#define DC_HPDx_INT_CONTROL(x)(DC_HPD1_INT_CONTROL + (x * 0xc))
+#define DC_HPDx_INT_STATUS_REG(x) (DC_HPD1_INT_STATUS  + (x * 0xc))
+
 /*
  * Indirect registers accessor
  */
@@ -1714,38 +1718,10 @@ void evergreen_pm_finish(struct radeon_device *rdev)
  */
 bool evergreen_hpd_sense(struct radeon_device *rdev, enum radeon_hpd_id hpd)
 {
-   bool connected = false;
-
-   switch (hpd) {
-   case RADEON_HPD_1:
-   if (RREG32(DC_HPD1_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   case RADEON_HPD_2:
-   if (RREG32(DC_HPD2_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   case RADEON_HPD_3:
-   if (RREG32(DC_HPD3_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   case RADEON_HPD_4:
-   if (RREG32(DC_HPD4_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   case RADEON_HPD_5:
-   if (RREG32(DC_HPD5_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   case RADEON_HPD_6:
-   if (RREG32(DC_HPD6_INT_STATUS) & DC_HPDx_SENSE)
-   connected = true;
-   break;
-   default:
-   break;
-   }
+   if (hpd == RADEON_HPD_NONE)
+   return false;
 
-   return connected;
+   return !!(RREG32(DC_HPDx_INT_STATUS_REG(hpd)) & DC_HPDx_SENSE);
 }
 
 /**
@@ -1759,61 +1735,15 @@ bool evergreen_hpd_sense(struct radeon_device *rdev, 
enum radeon_hpd_id hpd)
 void evergreen_hpd_set_polarity(struct radeon_device *rdev,
enum radeon_hpd_id hpd)
 {
-   u32 tmp;
bool connected = evergreen_hpd_sense(rdev, hpd);
 
-   switch (hpd) {
-   case RADEON_HPD_1:
-   tmp = RREG32(DC_HPD1_INT_CONTROL);
-   if (connected)
-   tmp &= ~DC_HPDx_INT_POLARITY;
-   else
-   tmp |= DC_HPDx_INT_POLARITY;
-   WREG32(DC_HPD1_INT_CONTROL, tmp);
-   break;
-   case RADEON_HPD_2:
-   tmp = RREG32(DC_HPD2_INT_CONTROL);
-   if (connected)
-   tmp &= ~DC_HPDx_INT_POLARITY;
-   else
-   tmp |= DC_HPDx_INT_POLARITY;
-   WREG32(DC_HPD2_INT_CONTROL, tmp);
-   break;
-   case RADEON_HPD_3:
-   tmp = RREG32(DC_HPD3_INT_CONTROL);
-   if (connected)
-   tmp &= ~DC_HPDx_INT_POLARITY;
-   else
-   tmp |= DC_HPDx_INT_POLARITY;
-   WREG32(DC_HPD3_INT_CONTROL, tmp);
-   break;
-   case RADEON_HPD_4:
-   tmp = RREG32(DC_HPD4_INT_CONTROL);
-   if (connected)
-   tmp &= ~DC_HPDx_INT_POLARITY;
-   else
-   tmp |= DC_HPDx_INT_POLARITY;
-   WREG32(DC_HPD4_INT_CONTROL, tmp);
-   break;
-   case RADEON_HPD_5:
-   tmp = RREG32(DC_HPD5_INT_CONTROL);
-   if (connected)
-   tmp &= ~DC_HPDx_INT_POLARITY;
-   else
-   tmp |= 

[PATCH net v2] bonding: fix accounting of active ports in 3ad

2017-05-19 Thread Jarod Wilson
As of 7bb11dc9f59d and 0622cab0341c, bond slaves in a 3ad bond are not
removed from the aggregator when they are down, and the active slave count
is NOT equal to number of ports in the aggregator, but rather the number
of ports in the aggregator that are still enabled. The sysfs spew for
bonding_show_ad_num_ports() has a comment that says "Show number of active
802.3ad ports.", but it's currently showing total number of ports, both
active and inactive. Remedy it by using the same logic introduced in
0622cab0341c in __bond_3ad_get_active_agg_info(), so sysfs, procfs and
netlink all report the number of active ports. Note that this means that
IFLA_BOND_AD_INFO_NUM_PORTS really means NUM_ACTIVE_PORTS instead of
NUM_PORTS, and thus perhaps should be renamed for clarity.

Lightly tested on a dual i40e lacp bond, simulating link downs with an ip
link set dev  down, was able to produce the state where I could
see both in the same aggregator, but a number of ports count of 1.

MII Status: up
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2 <---
Slave Interface: ens10
MII Status: up <---
Aggregator ID: 1
Slave Interface: ens11
MII Status: up
Aggregator ID: 1

MII Status: up
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 1 <---
Slave Interface: ens10
MII Status: down <---
Aggregator ID: 1
Slave Interface: ens11
MII Status: up
Aggregator ID: 1

CC: Jay Vosburgh 
CC: Veaceslav Falico 
CC: Andy Gospodarek 
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
v2: fix incorrect git sha reference, add more testing data

 drivers/net/bonding/bond_3ad.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index c5fd4259da33..b44a6aeb346d 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2577,7 +2577,7 @@ int __bond_3ad_get_active_agg_info(struct bonding *bond,
return -1;
 
ad_info->aggregator_id = aggregator->aggregator_identifier;
-   ad_info->ports = aggregator->num_of_ports;
+   ad_info->ports = __agg_active_ports(aggregator);
ad_info->actor_key = aggregator->actor_oper_aggregator_key;
ad_info->partner_key = aggregator->partner_oper_aggregator_key;
ether_addr_copy(ad_info->partner_system,
-- 
2.12.1



Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF

2017-05-19 Thread Tetsuo Handa
Michal Hocko wrote:
> On Sat 20-05-17 00:22:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 19-05-17 22:02:44, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > Any allocation failure during the #PF path will return with 
> > > > > VM_FAULT_OOM
> > > > > which in turn results in pagefault_out_of_memory. This can happen for
> > > > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > > > normal allocation fails.
> > > > > 
> > > > > The later is quite problematic because allocation paths already 
> > > > > trigger
> > > > > out_of_memory and the page allocator tries really hard to not fail
> > > > 
> > > > We made many memory allocation requests from page fault path (e.g. XFS)
> > > > __GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
> > > > find the message), there are some allocation requests from page fault 
> > > > path
> > > > which cannot use __GFP_FS. Then, not all allocation requests can call
> > > > oom_kill_process() and reaching pagefault_out_of_memory() will be
> > > > inevitable.
> > > 
> > > Even if such an allocation fail without the OOM killer then we simply
> > > retry the PF and will do that the same way how we keep retrying the
> > > allocation inside the page allocator. So how is this any different?
> > 
> > You are trying to remove out_of_memory() from pagefault_out_of_memory()
> > by this patch. But you also want to make !__GFP_FS allocations not to
> > keep retrying inside the page allocator in future kernels, don't you?
> 
> I would _love_ to but I am much less optimistic this is achiveable
> 
> > Then, a thread which need to allocate memory from page fault path but
> > cannot call oom_kill_process() will spin forever (unless somebody else
> > calls oom_kill_process() via a __GFP_FS allocation request). I consider
> > that introducing such possibility is a problem.
> 
> What I am trying to say is that this is already happening. The
> difference with the VM_FAULT_OOM would only be that the whole PF path
> would be unwinded back to the PF, all locks dropped and then the PF
> retries so in principle this would be safer.

I can't understand why the PF retries helps. If the PF has to be retried due to
failing to allocate memory, situation will not improve until memory is 
allocated.
And I don't like an assumption that somebody else calls oom_kill_process() via
a __GFP_FS allocation request so that the PF will succeed without calling
oom_kill_process().

> 
> > > > > allocations. Anyway, if the OOM killer has been already invoked there
> > > > > is no reason to invoke it again from the #PF path. Especially when the
> > > > > OOM condition might be gone by that time and we have no way to find 
> > > > > out
> > > > > other than allocate.
> > > > > 
> > > > > Moreover if the allocation failed and the OOM killer hasn't been
> > > > > invoked then we are unlikely to do the right thing from the #PF 
> > > > > context
> > > > > because we have already lost the allocation context and restictions 
> > > > > and
> > > > > therefore might oom kill a task from a different NUMA domain.
> > > > 
> > > > If we carry a flag via task_struct that indicates whether it is an 
> > > > memory
> > > > allocation request from page fault and allocation failure is not 
> > > > acceptable,
> > > > we can call out_of_memory() from page allocator path.
> > > 
> > > I do not understand
> > 
> > We need to allocate memory from page fault path in order to avoid spinning 
> > forever
> > (unless somebody else calls oom_kill_process() via a __GFP_FS allocation 
> > request),
> > doesn't it? Then, memory allocation requests from page fault path can pass 
> > flags
> > like __GFP_NOFAIL | __GFP_KILLABLE because retrying the page fault without
> > allocating memory is pointless. I called such flags as carry a flag via 
> > task_struct.
> > 
> > > > By the way, can page fault occur after reaching do_exit()? When a thread
> > > > reached do_exit(), fatal_signal_pending(current) becomes false, doesn't 
> > > > it?
> > > 
> > > yes fatal_signal_pending will be false at the time and I believe we can
> > > perform a page fault past that moment  and go via allocation path which 
> > > would
> > > trigger the OOM or give this task access to reserves but it is more
> > > likely that the oom reaper will push to kill another task by that time
> > > if the situation didn't get resolved. Or did I miss your concern?
> > 
> > How checking fatal_signal_pending() here helps?
> 
> It just skips the warning because we know that we would handle the
> signal before retrying the page fault and go to exit path. Those that do
> not have such a signal should warn just that we know that such a
> situation happens. With the current allocator semantic it shouldn't
> 
> > It only suppresses printk().
> > If current thread needs to allocate memory because not all allocation 
> > requests
> > can call 

[PATCH] ipc: Fail build if IPC structures change layout

2017-05-19 Thread Kees Cook
Since struct layout can be seen at build-time, turn runtime report
into a build failure so it can be fixed more quickly.

Cc: Manfred Spraul 
Signed-off-by: Kees Cook 
---
Should be applied on top of the -mm tree's IPC changes
---
 ipc/msg.c | 5 +
 ipc/sem.c | 5 +
 ipc/shm.c | 5 +
 3 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index e9785c4d2e0d..0ed7dae7d4e8 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -118,10 +118,7 @@ static int newque(struct ipc_namespace *ns, struct 
ipc_params *params)
key_t key = params->key;
int msgflg = params->flg;
 
-   if (offsetof(struct msg_queue, q_perm) != 0) {
-   pr_err("Invalid struct sem_perm, failing msgget().\n");
-   return -ENOMEM;
-   }
+   BUILD_BUG_ON(offsetof(struct msg_queue, q_perm) != 0);
 
msq = container_of(ipc_rcu_alloc(sizeof(*msq)), struct msg_queue,
q_perm);
diff --git a/ipc/sem.c b/ipc/sem.c
index 18241c16d07c..2109fad750b5 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -469,10 +469,7 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
if (ns->used_sems + nsems > ns->sc_semmns)
return -ENOSPC;
 
-   if (offsetof(struct sem_array, sem_perm) != 0) {
-   pr_err("Invalid struct sem_perm, failing semget().\n");
-   return -ENOMEM;
-   }
+   BUILD_BUG_ON(offsetof(struct sem_array, sem_perm) != 0);
 
size = sizeof(*sma) + nsems * sizeof(sma->sems[0]);
sma = container_of(ipc_rcu_alloc(size), struct sem_array, sem_perm);
diff --git a/ipc/shm.c b/ipc/shm.c
index cec6df186050..2eb85bd5b855 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -543,10 +543,7 @@ static int newseg(struct ipc_namespace *ns, struct 
ipc_params *params)
ns->shm_tot + numpages > ns->shm_ctlall)
return -ENOSPC;
 
-   if (offsetof(struct shmid_kernel, shm_perm) != 0) {
-   pr_err("Invalid struct sem_perm, failing msgget().\n");
-   return -ENOMEM;
-   }
+   BUILD_BUG_ON(offsetof(struct shmid_kernel, shm_perm) != 0);
 
shp = container_of(ipc_rcu_alloc(sizeof(*shp)), struct shmid_kernel,
shm_perm);
-- 
2.7.4


-- 
Kees Cook
Pixel Security


[PATCH net v2] bonding: fix accounting of active ports in 3ad

2017-05-19 Thread Jarod Wilson
As of 7bb11dc9f59d and 0622cab0341c, bond slaves in a 3ad bond are not
removed from the aggregator when they are down, and the active slave count
is NOT equal to number of ports in the aggregator, but rather the number
of ports in the aggregator that are still enabled. The sysfs spew for
bonding_show_ad_num_ports() has a comment that says "Show number of active
802.3ad ports.", but it's currently showing total number of ports, both
active and inactive. Remedy it by using the same logic introduced in
0622cab0341c in __bond_3ad_get_active_agg_info(), so sysfs, procfs and
netlink all report the number of active ports. Note that this means that
IFLA_BOND_AD_INFO_NUM_PORTS really means NUM_ACTIVE_PORTS instead of
NUM_PORTS, and thus perhaps should be renamed for clarity.

Lightly tested on a dual i40e lacp bond, simulating link downs with an ip
link set dev  down, was able to produce the state where I could
see both in the same aggregator, but a number of ports count of 1.

MII Status: up
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2 <---
Slave Interface: ens10
MII Status: up <---
Aggregator ID: 1
Slave Interface: ens11
MII Status: up
Aggregator ID: 1

MII Status: up
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 1 <---
Slave Interface: ens10
MII Status: down <---
Aggregator ID: 1
Slave Interface: ens11
MII Status: up
Aggregator ID: 1

CC: Jay Vosburgh 
CC: Veaceslav Falico 
CC: Andy Gospodarek 
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
v2: fix incorrect git sha reference, add more testing data

 drivers/net/bonding/bond_3ad.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index c5fd4259da33..b44a6aeb346d 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2577,7 +2577,7 @@ int __bond_3ad_get_active_agg_info(struct bonding *bond,
return -1;
 
ad_info->aggregator_id = aggregator->aggregator_identifier;
-   ad_info->ports = aggregator->num_of_ports;
+   ad_info->ports = __agg_active_ports(aggregator);
ad_info->actor_key = aggregator->actor_oper_aggregator_key;
ad_info->partner_key = aggregator->partner_oper_aggregator_key;
ether_addr_copy(ad_info->partner_system,
-- 
2.12.1



Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF

2017-05-19 Thread Tetsuo Handa
Michal Hocko wrote:
> On Sat 20-05-17 00:22:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 19-05-17 22:02:44, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > Any allocation failure during the #PF path will return with 
> > > > > VM_FAULT_OOM
> > > > > which in turn results in pagefault_out_of_memory. This can happen for
> > > > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > > > normal allocation fails.
> > > > > 
> > > > > The later is quite problematic because allocation paths already 
> > > > > trigger
> > > > > out_of_memory and the page allocator tries really hard to not fail
> > > > 
> > > > We made many memory allocation requests from page fault path (e.g. XFS)
> > > > __GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
> > > > find the message), there are some allocation requests from page fault 
> > > > path
> > > > which cannot use __GFP_FS. Then, not all allocation requests can call
> > > > oom_kill_process() and reaching pagefault_out_of_memory() will be
> > > > inevitable.
> > > 
> > > Even if such an allocation fail without the OOM killer then we simply
> > > retry the PF and will do that the same way how we keep retrying the
> > > allocation inside the page allocator. So how is this any different?
> > 
> > You are trying to remove out_of_memory() from pagefault_out_of_memory()
> > by this patch. But you also want to make !__GFP_FS allocations not to
> > keep retrying inside the page allocator in future kernels, don't you?
> 
> I would _love_ to but I am much less optimistic this is achiveable
> 
> > Then, a thread which need to allocate memory from page fault path but
> > cannot call oom_kill_process() will spin forever (unless somebody else
> > calls oom_kill_process() via a __GFP_FS allocation request). I consider
> > that introducing such possibility is a problem.
> 
> What I am trying to say is that this is already happening. The
> difference with the VM_FAULT_OOM would only be that the whole PF path
> would be unwinded back to the PF, all locks dropped and then the PF
> retries so in principle this would be safer.

I can't understand why the PF retries helps. If the PF has to be retried due to
failing to allocate memory, situation will not improve until memory is 
allocated.
And I don't like an assumption that somebody else calls oom_kill_process() via
a __GFP_FS allocation request so that the PF will succeed without calling
oom_kill_process().

> 
> > > > > allocations. Anyway, if the OOM killer has been already invoked there
> > > > > is no reason to invoke it again from the #PF path. Especially when the
> > > > > OOM condition might be gone by that time and we have no way to find 
> > > > > out
> > > > > other than allocate.
> > > > > 
> > > > > Moreover if the allocation failed and the OOM killer hasn't been
> > > > > invoked then we are unlikely to do the right thing from the #PF 
> > > > > context
> > > > > because we have already lost the allocation context and restictions 
> > > > > and
> > > > > therefore might oom kill a task from a different NUMA domain.
> > > > 
> > > > If we carry a flag via task_struct that indicates whether it is an 
> > > > memory
> > > > allocation request from page fault and allocation failure is not 
> > > > acceptable,
> > > > we can call out_of_memory() from page allocator path.
> > > 
> > > I do not understand
> > 
> > We need to allocate memory from page fault path in order to avoid spinning 
> > forever
> > (unless somebody else calls oom_kill_process() via a __GFP_FS allocation 
> > request),
> > doesn't it? Then, memory allocation requests from page fault path can pass 
> > flags
> > like __GFP_NOFAIL | __GFP_KILLABLE because retrying the page fault without
> > allocating memory is pointless. I called such flags as carry a flag via 
> > task_struct.
> > 
> > > > By the way, can page fault occur after reaching do_exit()? When a thread
> > > > reached do_exit(), fatal_signal_pending(current) becomes false, doesn't 
> > > > it?
> > > 
> > > yes fatal_signal_pending will be false at the time and I believe we can
> > > perform a page fault past that moment  and go via allocation path which 
> > > would
> > > trigger the OOM or give this task access to reserves but it is more
> > > likely that the oom reaper will push to kill another task by that time
> > > if the situation didn't get resolved. Or did I miss your concern?
> > 
> > How checking fatal_signal_pending() here helps?
> 
> It just skips the warning because we know that we would handle the
> signal before retrying the page fault and go to exit path. Those that do
> not have such a signal should warn just that we know that such a
> situation happens. With the current allocator semantic it shouldn't
> 
> > It only suppresses printk().
> > If current thread needs to allocate memory because not all allocation 
> > requests
> > can call 

[PATCH] ipc: Fail build if IPC structures change layout

2017-05-19 Thread Kees Cook
Since struct layout can be seen at build-time, turn runtime report
into a build failure so it can be fixed more quickly.

Cc: Manfred Spraul 
Signed-off-by: Kees Cook 
---
Should be applied on top of the -mm tree's IPC changes
---
 ipc/msg.c | 5 +
 ipc/sem.c | 5 +
 ipc/shm.c | 5 +
 3 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index e9785c4d2e0d..0ed7dae7d4e8 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -118,10 +118,7 @@ static int newque(struct ipc_namespace *ns, struct 
ipc_params *params)
key_t key = params->key;
int msgflg = params->flg;
 
-   if (offsetof(struct msg_queue, q_perm) != 0) {
-   pr_err("Invalid struct sem_perm, failing msgget().\n");
-   return -ENOMEM;
-   }
+   BUILD_BUG_ON(offsetof(struct msg_queue, q_perm) != 0);
 
msq = container_of(ipc_rcu_alloc(sizeof(*msq)), struct msg_queue,
q_perm);
diff --git a/ipc/sem.c b/ipc/sem.c
index 18241c16d07c..2109fad750b5 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -469,10 +469,7 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
if (ns->used_sems + nsems > ns->sc_semmns)
return -ENOSPC;
 
-   if (offsetof(struct sem_array, sem_perm) != 0) {
-   pr_err("Invalid struct sem_perm, failing semget().\n");
-   return -ENOMEM;
-   }
+   BUILD_BUG_ON(offsetof(struct sem_array, sem_perm) != 0);
 
size = sizeof(*sma) + nsems * sizeof(sma->sems[0]);
sma = container_of(ipc_rcu_alloc(size), struct sem_array, sem_perm);
diff --git a/ipc/shm.c b/ipc/shm.c
index cec6df186050..2eb85bd5b855 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -543,10 +543,7 @@ static int newseg(struct ipc_namespace *ns, struct 
ipc_params *params)
ns->shm_tot + numpages > ns->shm_ctlall)
return -ENOSPC;
 
-   if (offsetof(struct shmid_kernel, shm_perm) != 0) {
-   pr_err("Invalid struct sem_perm, failing msgget().\n");
-   return -ENOMEM;
-   }
+   BUILD_BUG_ON(offsetof(struct shmid_kernel, shm_perm) != 0);
 
shp = container_of(ipc_rcu_alloc(sizeof(*shp)), struct shmid_kernel,
shm_perm);
-- 
2.7.4


-- 
Kees Cook
Pixel Security


Re: [PATCH] dm ioctl: Restore __GFP_HIGH in copy_params()

2017-05-19 Thread Mikulas Patocka


On Fri, 19 May 2017, Michal Hocko wrote:

> On Thu 18-05-17 19:50:46, Junaid Shahid wrote:
> > (Adding back the correct linux-mm email address and also adding 
> > linux-kernel.)
> > 
> > On Thursday, May 18, 2017 01:41:33 PM David Rientjes wrote:
> [...]
> > > Let's ask Mikulas, who changed this from PF_MEMALLOC to __GFP_HIGH, 
> > > assuming there was a reason to do it in the first place in two different 
> > > ways.
> 
> Hmm, the old PF_MEMALLOC used to have the following comment
> /*
>  * Trying to avoid low memory issues when a device is
>  * suspended. 
>  */
> 
> I am not really sure what that means but __GFP_HIGH certainly have a
> different semantic than PF_MEMALLOC. The later grants the full access to
> the memory reserves while the prior on partial access. If this is _really_
> needed then it deserves a comment explaining why.
> -- 
> Michal Hocko
> SUSE Labs

Sometimes, I/O to a device mapper device is blocked until the userspace 
daemon dmeventd does some action (for example, when dm-mirror leg fails, 
dmeventd needs to mark the leg as failed in the lvm metadata and then 
reload the device).

The dmeventd daemon mlocks itself in memory so that it doesn't generate 
any I/O. But it must be able to call ioctls. __GFP_HIGH is there so that 
the ioctls issued by dmeventd have higher chance of succeeding if some I/O 
is blocked, waiting for dmeventd action. It reduces the possibility of 
low-memory-deadlock, though it doesn't eliminate it entirely.

Mikulas


Re: [PATCH] dm ioctl: Restore __GFP_HIGH in copy_params()

2017-05-19 Thread Mikulas Patocka


On Fri, 19 May 2017, Michal Hocko wrote:

> On Thu 18-05-17 19:50:46, Junaid Shahid wrote:
> > (Adding back the correct linux-mm email address and also adding 
> > linux-kernel.)
> > 
> > On Thursday, May 18, 2017 01:41:33 PM David Rientjes wrote:
> [...]
> > > Let's ask Mikulas, who changed this from PF_MEMALLOC to __GFP_HIGH, 
> > > assuming there was a reason to do it in the first place in two different 
> > > ways.
> 
> Hmm, the old PF_MEMALLOC used to have the following comment
> /*
>  * Trying to avoid low memory issues when a device is
>  * suspended. 
>  */
> 
> I am not really sure what that means but __GFP_HIGH certainly have a
> different semantic than PF_MEMALLOC. The later grants the full access to
> the memory reserves while the prior on partial access. If this is _really_
> needed then it deserves a comment explaining why.
> -- 
> Michal Hocko
> SUSE Labs

Sometimes, I/O to a device mapper device is blocked until the userspace 
daemon dmeventd does some action (for example, when dm-mirror leg fails, 
dmeventd needs to mark the leg as failed in the lvm metadata and then 
reload the device).

The dmeventd daemon mlocks itself in memory so that it doesn't generate 
any I/O. But it must be able to call ioctls. __GFP_HIGH is there so that 
the ioctls issued by dmeventd have higher chance of succeeding if some I/O 
is blocked, waiting for dmeventd action. It reduces the possibility of 
low-memory-deadlock, though it doesn't eliminate it entirely.

Mikulas


Re: [PATCH] ARM64: defconfig: enable IR core, decoders and Meson IR device

2017-05-19 Thread Kevin Hilman
Neil Armstrong  writes:

> This patch enables the MEDIA Infrared RC Decoders and Meson Infrared
> decoder for ARM64 defconfig.
> These drivers are selected as modules by default.
>
> Signed-off-by: Neil Armstrong 
> ---
>  arch/arm64/configs/defconfig | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> index c021aefa..59c400f 100644
> --- a/arch/arm64/configs/defconfig
> +++ b/arch/arm64/configs/defconfig
> @@ -321,6 +321,11 @@ CONFIG_MEDIA_CAMERA_SUPPORT=y
>  CONFIG_MEDIA_ANALOG_TV_SUPPORT=y
>  CONFIG_MEDIA_DIGITAL_TV_SUPPORT=y
>  CONFIG_MEDIA_CONTROLLER=y
> +CONFIG_MEDIA_RC_SUPPORT=y
> +CONFIG_RC_CORE=y

This one should be a module too.

With that fixed, applied to v4.13/defconfig

Kevin

> +CONFIG_RC_DEVICES=y
> +CONFIG_RC_DECODERS=y
> +CONFIG_IR_MESON=m
>  CONFIG_VIDEO_V4L2_SUBDEV_API=y
>  # CONFIG_DVB_NET is not set
>  CONFIG_V4L_MEM2MEM_DRIVERS=y


Re: [PATCH] ARM64: defconfig: enable IR core, decoders and Meson IR device

2017-05-19 Thread Kevin Hilman
Neil Armstrong  writes:

> This patch enables the MEDIA Infrared RC Decoders and Meson Infrared
> decoder for ARM64 defconfig.
> These drivers are selected as modules by default.
>
> Signed-off-by: Neil Armstrong 
> ---
>  arch/arm64/configs/defconfig | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> index c021aefa..59c400f 100644
> --- a/arch/arm64/configs/defconfig
> +++ b/arch/arm64/configs/defconfig
> @@ -321,6 +321,11 @@ CONFIG_MEDIA_CAMERA_SUPPORT=y
>  CONFIG_MEDIA_ANALOG_TV_SUPPORT=y
>  CONFIG_MEDIA_DIGITAL_TV_SUPPORT=y
>  CONFIG_MEDIA_CONTROLLER=y
> +CONFIG_MEDIA_RC_SUPPORT=y
> +CONFIG_RC_CORE=y

This one should be a module too.

With that fixed, applied to v4.13/defconfig

Kevin

> +CONFIG_RC_DEVICES=y
> +CONFIG_RC_DECODERS=y
> +CONFIG_IR_MESON=m
>  CONFIG_VIDEO_V4L2_SUBDEV_API=y
>  # CONFIG_DVB_NET is not set
>  CONFIG_V4L_MEM2MEM_DRIVERS=y


Re: [PATCH 04/29] printk-formats.txt: standardize document format

2017-05-19 Thread Randy Dunlap
On 05/19/17 13:28, Mauro Carvalho Chehab wrote:
> Em Fri, 19 May 2017 03:26:08 -0700
> Joe Perches  escreveu:
> 
>> On Thu, 2017-05-18 at 22:25 -0300, Mauro Carvalho Chehab wrote:
>>> Each text file under Documentation follows a different
>>> format. Some doesn't even have titles!
>>>
>>> Change its representation to follow the adopted standard,
>>> using ReST markups for it to be parseable by Sphinx:
>>>
>>> - add a title for the document;
>>> - add markups for section titles;
>>> - move authorship to the beginning and use :Author:;
>>> - use right markup for tables;
>>> - mark literals and literal blocks.  
>>
>> I think the .rst markup is far uglier to read as text
>> and harder to modify for new additions to %p
>> given column alignments.
>>
>> For instance below, but other than that, the .rst
>> is easier to read.
>>
>>> diff --git a/Documentation/printk-formats.txt 
>>> b/Documentation/printk-formats.txt  
>> []
>>> @@ -1,139 +1,180 @@
>>> -If variable is of Type,use printk format specifier:
>>> --
>>> -   int %d or %x
>>> -   unsigned int%u or %x
>>> -   long%ld or %lx
>>> -   unsigned long   %lu or %lx
>>> -   long long   %lld or %llx
>>> -   unsigned long long  %llu or %llx
>>> -   size_t  %zu or %zx
>>> -   ssize_t %zd or %zx
>>> -   s32 %d or %x
>>> -   u32 %u or %x
>>> -   s64 %lld or %llx
>>> -   u64 %llu or %llx  
>> []
>>> +=== ===
>>> +If variable is of Type use printk format specifier
>>> +=== ===
>>> +``int````%d or %x``
>>> +``unsigned int``   ``%u or %x``
>>> +``long``   ``%ld or %lx``
>>> +``unsigned long``  ``%lu or %lx``
>>> +``long long``  ``%lld or %llx``
>>> +``unsigned long long`` ``%llu or %llx``
>>> +``size_t`` ``%zu or %zx``
>>> +``ssize_t````%zd or %zx``
>>> +``s32````%d or %x``
>>> +``u32````%u or %x``
>>> +``s64````%lld or %llx``
>>> +``u64````%llu or %llx``
>>> +=== ===  
> 
> Well, we could,instead, use literal blocks, e. g. something like:
> 
> ::
> 
>  -If variable is of Type, use printk format specifier:
>  --
>  -int %d or %x
>  -unsigned int%u or %x
>  -long%ld or %lx
>  -unsigned long   %lu or %lx
>  -long long   %lld or %llx
>  -unsigned long long  %llu or %llx
>  -size_t  %zu or %zx
>  -ssize_t %zd or %zx
>  -s32 %d or %x
>  -u32 %u or %x
>  -s64 %lld or %llx
>  -u64 %llu or %llx  
> 

I like that latter option.

thanks.
-- 
~Randy


Re: [PATCH 04/29] printk-formats.txt: standardize document format

2017-05-19 Thread Randy Dunlap
On 05/19/17 13:28, Mauro Carvalho Chehab wrote:
> Em Fri, 19 May 2017 03:26:08 -0700
> Joe Perches  escreveu:
> 
>> On Thu, 2017-05-18 at 22:25 -0300, Mauro Carvalho Chehab wrote:
>>> Each text file under Documentation follows a different
>>> format. Some doesn't even have titles!
>>>
>>> Change its representation to follow the adopted standard,
>>> using ReST markups for it to be parseable by Sphinx:
>>>
>>> - add a title for the document;
>>> - add markups for section titles;
>>> - move authorship to the beginning and use :Author:;
>>> - use right markup for tables;
>>> - mark literals and literal blocks.  
>>
>> I think the .rst markup is far uglier to read as text
>> and harder to modify for new additions to %p
>> given column alignments.
>>
>> For instance below, but other than that, the .rst
>> is easier to read.
>>
>>> diff --git a/Documentation/printk-formats.txt 
>>> b/Documentation/printk-formats.txt  
>> []
>>> @@ -1,139 +1,180 @@
>>> -If variable is of Type,use printk format specifier:
>>> --
>>> -   int %d or %x
>>> -   unsigned int%u or %x
>>> -   long%ld or %lx
>>> -   unsigned long   %lu or %lx
>>> -   long long   %lld or %llx
>>> -   unsigned long long  %llu or %llx
>>> -   size_t  %zu or %zx
>>> -   ssize_t %zd or %zx
>>> -   s32 %d or %x
>>> -   u32 %u or %x
>>> -   s64 %lld or %llx
>>> -   u64 %llu or %llx  
>> []
>>> +=== ===
>>> +If variable is of Type use printk format specifier
>>> +=== ===
>>> +``int````%d or %x``
>>> +``unsigned int``   ``%u or %x``
>>> +``long``   ``%ld or %lx``
>>> +``unsigned long``  ``%lu or %lx``
>>> +``long long``  ``%lld or %llx``
>>> +``unsigned long long`` ``%llu or %llx``
>>> +``size_t`` ``%zu or %zx``
>>> +``ssize_t````%zd or %zx``
>>> +``s32````%d or %x``
>>> +``u32````%u or %x``
>>> +``s64````%lld or %llx``
>>> +``u64````%llu or %llx``
>>> +=== ===  
> 
> Well, we could,instead, use literal blocks, e. g. something like:
> 
> ::
> 
>  -If variable is of Type, use printk format specifier:
>  --
>  -int %d or %x
>  -unsigned int%u or %x
>  -long%ld or %lx
>  -unsigned long   %lu or %lx
>  -long long   %lld or %llx
>  -unsigned long long  %llu or %llx
>  -size_t  %zu or %zx
>  -ssize_t %zd or %zx
>  -s32 %d or %x
>  -u32 %u or %x
>  -s64 %lld or %llx
>  -u64 %llu or %llx  
> 

I like that latter option.

thanks.
-- 
~Randy


[PATCH v2 05/18] xen/pvcalls: connect to a frontend

2017-05-19 Thread Stefano Stabellini
Introduce a per-frontend data structure named pvcalls_back_priv. It
contains pointers to the command ring, its event channel, a list of
active sockets and a tree of passive sockets (passing sockets need to be
looked up from the id on listen, accept and poll commands, while active
sockets only on release).

It also has an unbound workqueue to schedule the work of parsing and
executing commands on the command ring. socket_lock protects the two
lists. In pvcalls_back_global, keep a list of connected frontends.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 95 ++
 1 file changed, 95 insertions(+)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index b4da138..a48b0d9 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -33,9 +33,104 @@ struct pvcalls_back_global {
struct semaphore frontends_lock;
 } pvcalls_back_global;
 
+/*
+ * Per-frontend data structure. It contains pointers to the command
+ * ring, its event channel, a list of active sockets and a tree of
+ * passive sockets.
+ */
+struct pvcalls_back_priv {
+   struct list_head list;
+   struct xenbus_device *dev;
+   struct xen_pvcalls_sring *sring;
+   struct xen_pvcalls_back_ring ring;
+   int irq;
+   struct list_head socket_mappings;
+   struct radix_tree_root socketpass_mappings;
+   struct semaphore socket_lock;
+   atomic_t work;
+   struct workqueue_struct *wq;
+   struct work_struct register_work;
+};
+
+static void pvcalls_back_work(struct work_struct *work)
+{
+}
+
+static irqreturn_t pvcalls_back_event(int irq, void *dev_id)
+{
+   return IRQ_HANDLED;
+}
+
 static int backend_connect(struct xenbus_device *dev)
 {
+   int err, evtchn;
+   grant_ref_t ring_ref;
+   void *addr = NULL;
+   struct pvcalls_back_priv *priv = NULL;
+
+   priv = kzalloc(sizeof(struct pvcalls_back_priv), GFP_KERNEL);
+   if (!priv)
+   return -ENOMEM;
+
+   err = xenbus_scanf(XBT_NIL, dev->otherend, "port", "%u",
+  );
+   if (err != 1) {
+   err = -EINVAL;
+   xenbus_dev_fatal(dev, err, "reading %s/event-channel",
+dev->otherend);
+   goto error;
+   }
+
+   err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-ref", "%u", _ref);
+   if (err != 1) {
+   err = -EINVAL;
+   xenbus_dev_fatal(dev, err, "reading %s/ring-ref",
+dev->otherend);
+   goto error;
+   }
+
+   err = bind_interdomain_evtchn_to_irqhandler(dev->otherend_id, evtchn,
+   pvcalls_back_event, 0,
+   "pvcalls-backend", dev);
+   if (err < 0)
+   goto error;
+   priv->irq = err;
+
+   priv->wq = alloc_workqueue("pvcalls_back_wq", WQ_UNBOUND, 1);
+   if (!priv->wq) {
+   err = -ENOMEM;
+   goto error;
+   }
+
+   err = xenbus_map_ring_valloc(dev, _ref, 1, );
+   if (err < 0)
+   goto error;
+   priv->sring = addr;
+
+   BACK_RING_INIT(>ring, priv->sring, XEN_PAGE_SIZE * 1);
+   priv->dev = dev;
+
+   INIT_WORK(>register_work, pvcalls_back_work);
+   INIT_LIST_HEAD(>socket_mappings);
+   INIT_RADIX_TREE(>socketpass_mappings, GFP_KERNEL);
+   sema_init(>socket_lock, 1);
+   dev_set_drvdata(>dev, priv);
+
+   down(_back_global.frontends_lock);
+   list_add_tail(>list, _back_global.frontends);
+   up(_back_global.frontends_lock);
+   queue_work(priv->wq, >register_work);
+
return 0;
+
+ error:
+   if (addr != NULL)
+   xenbus_unmap_ring_vfree(dev, addr);
+   if (priv->wq)
+   destroy_workqueue(priv->wq);
+   unbind_from_irqhandler(priv->irq, dev);
+   kfree(priv);
+   return err;
 }
 
 static int backend_disconnect(struct xenbus_device *dev)
-- 
1.9.1



[PATCH v2 05/18] xen/pvcalls: connect to a frontend

2017-05-19 Thread Stefano Stabellini
Introduce a per-frontend data structure named pvcalls_back_priv. It
contains pointers to the command ring, its event channel, a list of
active sockets and a tree of passive sockets (passing sockets need to be
looked up from the id on listen, accept and poll commands, while active
sockets only on release).

It also has an unbound workqueue to schedule the work of parsing and
executing commands on the command ring. socket_lock protects the two
lists. In pvcalls_back_global, keep a list of connected frontends.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 95 ++
 1 file changed, 95 insertions(+)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index b4da138..a48b0d9 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -33,9 +33,104 @@ struct pvcalls_back_global {
struct semaphore frontends_lock;
 } pvcalls_back_global;
 
+/*
+ * Per-frontend data structure. It contains pointers to the command
+ * ring, its event channel, a list of active sockets and a tree of
+ * passive sockets.
+ */
+struct pvcalls_back_priv {
+   struct list_head list;
+   struct xenbus_device *dev;
+   struct xen_pvcalls_sring *sring;
+   struct xen_pvcalls_back_ring ring;
+   int irq;
+   struct list_head socket_mappings;
+   struct radix_tree_root socketpass_mappings;
+   struct semaphore socket_lock;
+   atomic_t work;
+   struct workqueue_struct *wq;
+   struct work_struct register_work;
+};
+
+static void pvcalls_back_work(struct work_struct *work)
+{
+}
+
+static irqreturn_t pvcalls_back_event(int irq, void *dev_id)
+{
+   return IRQ_HANDLED;
+}
+
 static int backend_connect(struct xenbus_device *dev)
 {
+   int err, evtchn;
+   grant_ref_t ring_ref;
+   void *addr = NULL;
+   struct pvcalls_back_priv *priv = NULL;
+
+   priv = kzalloc(sizeof(struct pvcalls_back_priv), GFP_KERNEL);
+   if (!priv)
+   return -ENOMEM;
+
+   err = xenbus_scanf(XBT_NIL, dev->otherend, "port", "%u",
+  );
+   if (err != 1) {
+   err = -EINVAL;
+   xenbus_dev_fatal(dev, err, "reading %s/event-channel",
+dev->otherend);
+   goto error;
+   }
+
+   err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-ref", "%u", _ref);
+   if (err != 1) {
+   err = -EINVAL;
+   xenbus_dev_fatal(dev, err, "reading %s/ring-ref",
+dev->otherend);
+   goto error;
+   }
+
+   err = bind_interdomain_evtchn_to_irqhandler(dev->otherend_id, evtchn,
+   pvcalls_back_event, 0,
+   "pvcalls-backend", dev);
+   if (err < 0)
+   goto error;
+   priv->irq = err;
+
+   priv->wq = alloc_workqueue("pvcalls_back_wq", WQ_UNBOUND, 1);
+   if (!priv->wq) {
+   err = -ENOMEM;
+   goto error;
+   }
+
+   err = xenbus_map_ring_valloc(dev, _ref, 1, );
+   if (err < 0)
+   goto error;
+   priv->sring = addr;
+
+   BACK_RING_INIT(>ring, priv->sring, XEN_PAGE_SIZE * 1);
+   priv->dev = dev;
+
+   INIT_WORK(>register_work, pvcalls_back_work);
+   INIT_LIST_HEAD(>socket_mappings);
+   INIT_RADIX_TREE(>socketpass_mappings, GFP_KERNEL);
+   sema_init(>socket_lock, 1);
+   dev_set_drvdata(>dev, priv);
+
+   down(_back_global.frontends_lock);
+   list_add_tail(>list, _back_global.frontends);
+   up(_back_global.frontends_lock);
+   queue_work(priv->wq, >register_work);
+
return 0;
+
+ error:
+   if (addr != NULL)
+   xenbus_unmap_ring_vfree(dev, addr);
+   if (priv->wq)
+   destroy_workqueue(priv->wq);
+   unbind_from_irqhandler(priv->irq, dev);
+   kfree(priv);
+   return err;
 }
 
 static int backend_disconnect(struct xenbus_device *dev)
-- 
1.9.1



[PATCH v2 04/18] xen/pvcalls: xenbus state handling

2017-05-19 Thread Stefano Stabellini
Introduce the code to handle xenbus state changes.

Implement the probe function for the pvcalls backend. Write the
supported versions, max-page-order and function-calls nodes to xenstore,
as required by the protocol.

Introduce stub functions for disconnecting/connecting to a frontend.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 135 +
 1 file changed, 135 insertions(+)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index 9044cf2..b4da138 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -25,20 +25,155 @@
 #include 
 #include 
 
+#define PVCALLS_VERSIONS "1"
+#define MAX_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER
+
 struct pvcalls_back_global {
struct list_head frontends;
struct semaphore frontends_lock;
 } pvcalls_back_global;
 
+static int backend_connect(struct xenbus_device *dev)
+{
+   return 0;
+}
+
+static int backend_disconnect(struct xenbus_device *dev)
+{
+   return 0;
+}
+
 static int pvcalls_back_probe(struct xenbus_device *dev,
  const struct xenbus_device_id *id)
 {
+   int err;
+
+   err = xenbus_printf(XBT_NIL, dev->nodename, "versions", "%s",
+   PVCALLS_VERSIONS);
+   if (err) {
+   pr_warn("%s write out 'version' failed\n", __func__);
+   return -EINVAL;
+   }
+
+   err = xenbus_printf(XBT_NIL, dev->nodename, "max-page-order", "%u",
+   MAX_RING_ORDER);
+   if (err) {
+   pr_warn("%s write out 'max-page-order' failed\n", __func__);
+   return err;
+   }
+
+   /* "1" means socket, connect, release, bind, listen, accept and poll*/
+   err = xenbus_printf(XBT_NIL, dev->nodename, "function-calls",
+   XENBUS_FUNCTIONS_CALLS);
+   if (err) {
+   pr_warn("%s write out 'function-calls' failed\n", __func__);
+   return err;
+   }
+
+   err = xenbus_switch_state(dev, XenbusStateInitWait);
+   if (err)
+   return err;
+
return 0;
 }
 
+static void set_backend_state(struct xenbus_device *dev,
+ enum xenbus_state state)
+{
+   while (dev->state != state) {
+   switch (dev->state) {
+   case XenbusStateClosed:
+   switch (state) {
+   case XenbusStateInitWait:
+   case XenbusStateConnected:
+   xenbus_switch_state(dev, XenbusStateInitWait);
+   break;
+   case XenbusStateClosing:
+   xenbus_switch_state(dev, XenbusStateClosing);
+   break;
+   default:
+   __WARN();
+   }
+   break;
+   case XenbusStateInitWait:
+   case XenbusStateInitialised:
+   switch (state) {
+   case XenbusStateConnected:
+   backend_connect(dev);
+   xenbus_switch_state(dev, XenbusStateConnected);
+   break;
+   case XenbusStateClosing:
+   case XenbusStateClosed:
+   xenbus_switch_state(dev, XenbusStateClosing);
+   break;
+   default:
+   __WARN();
+   }
+   break;
+   case XenbusStateConnected:
+   switch (state) {
+   case XenbusStateInitWait:
+   case XenbusStateClosing:
+   case XenbusStateClosed:
+   down(_back_global.frontends_lock);
+   backend_disconnect(dev);
+   up(_back_global.frontends_lock);
+   xenbus_switch_state(dev, XenbusStateClosing);
+   break;
+   default:
+   __WARN();
+   }
+   break;
+   case XenbusStateClosing:
+   switch (state) {
+   case XenbusStateInitWait:
+   case XenbusStateConnected:
+   case XenbusStateClosed:
+   xenbus_switch_state(dev, XenbusStateClosed);
+   break;
+   default:
+   __WARN();
+   }
+   break;
+   default:
+   __WARN();
+   }
+   }
+}
+
 static void pvcalls_back_changed(struct 

[PATCH v2 01/18] xen: introduce the pvcalls interface header

2017-05-19 Thread Stefano Stabellini
Introduce the C header file which defines the PV Calls interface. It is
imported from xen/include/public/io/pvcalls.h.

Signed-off-by: Stefano Stabellini 
CC: konrad.w...@oracle.com
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 include/xen/interface/io/pvcalls.h | 120 +
 1 file changed, 120 insertions(+)
 create mode 100644 include/xen/interface/io/pvcalls.h

diff --git a/include/xen/interface/io/pvcalls.h 
b/include/xen/interface/io/pvcalls.h
new file mode 100644
index 000..0d41959
--- /dev/null
+++ b/include/xen/interface/io/pvcalls.h
@@ -0,0 +1,120 @@
+#ifndef __XEN_PUBLIC_IO_XEN_PVCALLS_H__
+#define __XEN_PUBLIC_IO_XEN_PVCALLS_H__
+
+#include 
+#include "xen/interface/io/ring.h"
+
+/* "1" means socket, connect, release, bind, listen, accept and poll */
+#define XENBUS_FUNCTIONS_CALLS "1"
+
+/*
+ * See docs/misc/pvcalls.markdown in xen.git for the full specification:
+ * https://xenbits.xen.org/docs/unstable/misc/pvcalls.html
+ */
+struct pvcalls_data_intf {
+RING_IDX in_cons, in_prod, in_error;
+
+uint8_t pad1[52];
+
+RING_IDX out_cons, out_prod, out_error;
+
+uint8_t pad2[52];
+
+RING_IDX ring_order;
+grant_ref_t ref[];
+};
+DEFINE_XEN_FLEX_RING(pvcalls);
+
+#define PVCALLS_SOCKET 0
+#define PVCALLS_CONNECT1
+#define PVCALLS_RELEASE2
+#define PVCALLS_BIND   3
+#define PVCALLS_LISTEN 4
+#define PVCALLS_ACCEPT 5
+#define PVCALLS_POLL   6
+
+struct xen_pvcalls_request {
+uint32_t req_id; /* private to guest, echoed in response */
+uint32_t cmd;/* command to execute */
+union {
+struct xen_pvcalls_socket {
+uint64_t id;
+uint32_t domain;
+uint32_t type;
+uint32_t protocol;
+} socket;
+struct xen_pvcalls_connect {
+uint64_t id;
+uint8_t addr[28];
+uint32_t len;
+uint32_t flags;
+grant_ref_t ref;
+uint32_t evtchn;
+} connect;
+struct xen_pvcalls_release {
+uint64_t id;
+uint8_t reuse;
+} release;
+struct xen_pvcalls_bind {
+uint64_t id;
+uint8_t addr[28];
+uint32_t len;
+} bind;
+struct xen_pvcalls_listen {
+uint64_t id;
+uint32_t backlog;
+} listen;
+struct xen_pvcalls_accept {
+uint64_t id;
+uint64_t id_new;
+grant_ref_t ref;
+uint32_t evtchn;
+} accept;
+struct xen_pvcalls_poll {
+uint64_t id;
+} poll;
+/* dummy member to force sizeof(struct xen_pvcalls_request)
+ * to match across archs */
+struct xen_pvcalls_dummy {
+uint8_t dummy[56];
+} dummy;
+} u;
+};
+
+struct xen_pvcalls_response {
+uint32_t req_id;
+uint32_t cmd;
+int32_t ret;
+uint32_t pad;
+union {
+struct _xen_pvcalls_socket {
+uint64_t id;
+} socket;
+struct _xen_pvcalls_connect {
+uint64_t id;
+} connect;
+struct _xen_pvcalls_release {
+uint64_t id;
+} release;
+struct _xen_pvcalls_bind {
+uint64_t id;
+} bind;
+struct _xen_pvcalls_listen {
+uint64_t id;
+} listen;
+struct _xen_pvcalls_accept {
+uint64_t id;
+} accept;
+struct _xen_pvcalls_poll {
+uint64_t id;
+} poll;
+struct _xen_pvcalls_dummy {
+uint8_t dummy[8];
+} dummy;
+} u;
+};
+
+DEFINE_RING_TYPES(xen_pvcalls, struct xen_pvcalls_request,
+  struct xen_pvcalls_response);
+
+#endif
-- 
1.9.1



[PATCH v2 04/18] xen/pvcalls: xenbus state handling

2017-05-19 Thread Stefano Stabellini
Introduce the code to handle xenbus state changes.

Implement the probe function for the pvcalls backend. Write the
supported versions, max-page-order and function-calls nodes to xenstore,
as required by the protocol.

Introduce stub functions for disconnecting/connecting to a frontend.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 135 +
 1 file changed, 135 insertions(+)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index 9044cf2..b4da138 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -25,20 +25,155 @@
 #include 
 #include 
 
+#define PVCALLS_VERSIONS "1"
+#define MAX_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER
+
 struct pvcalls_back_global {
struct list_head frontends;
struct semaphore frontends_lock;
 } pvcalls_back_global;
 
+static int backend_connect(struct xenbus_device *dev)
+{
+   return 0;
+}
+
+static int backend_disconnect(struct xenbus_device *dev)
+{
+   return 0;
+}
+
 static int pvcalls_back_probe(struct xenbus_device *dev,
  const struct xenbus_device_id *id)
 {
+   int err;
+
+   err = xenbus_printf(XBT_NIL, dev->nodename, "versions", "%s",
+   PVCALLS_VERSIONS);
+   if (err) {
+   pr_warn("%s write out 'version' failed\n", __func__);
+   return -EINVAL;
+   }
+
+   err = xenbus_printf(XBT_NIL, dev->nodename, "max-page-order", "%u",
+   MAX_RING_ORDER);
+   if (err) {
+   pr_warn("%s write out 'max-page-order' failed\n", __func__);
+   return err;
+   }
+
+   /* "1" means socket, connect, release, bind, listen, accept and poll*/
+   err = xenbus_printf(XBT_NIL, dev->nodename, "function-calls",
+   XENBUS_FUNCTIONS_CALLS);
+   if (err) {
+   pr_warn("%s write out 'function-calls' failed\n", __func__);
+   return err;
+   }
+
+   err = xenbus_switch_state(dev, XenbusStateInitWait);
+   if (err)
+   return err;
+
return 0;
 }
 
+static void set_backend_state(struct xenbus_device *dev,
+ enum xenbus_state state)
+{
+   while (dev->state != state) {
+   switch (dev->state) {
+   case XenbusStateClosed:
+   switch (state) {
+   case XenbusStateInitWait:
+   case XenbusStateConnected:
+   xenbus_switch_state(dev, XenbusStateInitWait);
+   break;
+   case XenbusStateClosing:
+   xenbus_switch_state(dev, XenbusStateClosing);
+   break;
+   default:
+   __WARN();
+   }
+   break;
+   case XenbusStateInitWait:
+   case XenbusStateInitialised:
+   switch (state) {
+   case XenbusStateConnected:
+   backend_connect(dev);
+   xenbus_switch_state(dev, XenbusStateConnected);
+   break;
+   case XenbusStateClosing:
+   case XenbusStateClosed:
+   xenbus_switch_state(dev, XenbusStateClosing);
+   break;
+   default:
+   __WARN();
+   }
+   break;
+   case XenbusStateConnected:
+   switch (state) {
+   case XenbusStateInitWait:
+   case XenbusStateClosing:
+   case XenbusStateClosed:
+   down(_back_global.frontends_lock);
+   backend_disconnect(dev);
+   up(_back_global.frontends_lock);
+   xenbus_switch_state(dev, XenbusStateClosing);
+   break;
+   default:
+   __WARN();
+   }
+   break;
+   case XenbusStateClosing:
+   switch (state) {
+   case XenbusStateInitWait:
+   case XenbusStateConnected:
+   case XenbusStateClosed:
+   xenbus_switch_state(dev, XenbusStateClosed);
+   break;
+   default:
+   __WARN();
+   }
+   break;
+   default:
+   __WARN();
+   }
+   }
+}
+
 static void pvcalls_back_changed(struct xenbus_device *dev,
  

[PATCH v2 01/18] xen: introduce the pvcalls interface header

2017-05-19 Thread Stefano Stabellini
Introduce the C header file which defines the PV Calls interface. It is
imported from xen/include/public/io/pvcalls.h.

Signed-off-by: Stefano Stabellini 
CC: konrad.w...@oracle.com
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 include/xen/interface/io/pvcalls.h | 120 +
 1 file changed, 120 insertions(+)
 create mode 100644 include/xen/interface/io/pvcalls.h

diff --git a/include/xen/interface/io/pvcalls.h 
b/include/xen/interface/io/pvcalls.h
new file mode 100644
index 000..0d41959
--- /dev/null
+++ b/include/xen/interface/io/pvcalls.h
@@ -0,0 +1,120 @@
+#ifndef __XEN_PUBLIC_IO_XEN_PVCALLS_H__
+#define __XEN_PUBLIC_IO_XEN_PVCALLS_H__
+
+#include 
+#include "xen/interface/io/ring.h"
+
+/* "1" means socket, connect, release, bind, listen, accept and poll */
+#define XENBUS_FUNCTIONS_CALLS "1"
+
+/*
+ * See docs/misc/pvcalls.markdown in xen.git for the full specification:
+ * https://xenbits.xen.org/docs/unstable/misc/pvcalls.html
+ */
+struct pvcalls_data_intf {
+RING_IDX in_cons, in_prod, in_error;
+
+uint8_t pad1[52];
+
+RING_IDX out_cons, out_prod, out_error;
+
+uint8_t pad2[52];
+
+RING_IDX ring_order;
+grant_ref_t ref[];
+};
+DEFINE_XEN_FLEX_RING(pvcalls);
+
+#define PVCALLS_SOCKET 0
+#define PVCALLS_CONNECT1
+#define PVCALLS_RELEASE2
+#define PVCALLS_BIND   3
+#define PVCALLS_LISTEN 4
+#define PVCALLS_ACCEPT 5
+#define PVCALLS_POLL   6
+
+struct xen_pvcalls_request {
+uint32_t req_id; /* private to guest, echoed in response */
+uint32_t cmd;/* command to execute */
+union {
+struct xen_pvcalls_socket {
+uint64_t id;
+uint32_t domain;
+uint32_t type;
+uint32_t protocol;
+} socket;
+struct xen_pvcalls_connect {
+uint64_t id;
+uint8_t addr[28];
+uint32_t len;
+uint32_t flags;
+grant_ref_t ref;
+uint32_t evtchn;
+} connect;
+struct xen_pvcalls_release {
+uint64_t id;
+uint8_t reuse;
+} release;
+struct xen_pvcalls_bind {
+uint64_t id;
+uint8_t addr[28];
+uint32_t len;
+} bind;
+struct xen_pvcalls_listen {
+uint64_t id;
+uint32_t backlog;
+} listen;
+struct xen_pvcalls_accept {
+uint64_t id;
+uint64_t id_new;
+grant_ref_t ref;
+uint32_t evtchn;
+} accept;
+struct xen_pvcalls_poll {
+uint64_t id;
+} poll;
+/* dummy member to force sizeof(struct xen_pvcalls_request)
+ * to match across archs */
+struct xen_pvcalls_dummy {
+uint8_t dummy[56];
+} dummy;
+} u;
+};
+
+struct xen_pvcalls_response {
+uint32_t req_id;
+uint32_t cmd;
+int32_t ret;
+uint32_t pad;
+union {
+struct _xen_pvcalls_socket {
+uint64_t id;
+} socket;
+struct _xen_pvcalls_connect {
+uint64_t id;
+} connect;
+struct _xen_pvcalls_release {
+uint64_t id;
+} release;
+struct _xen_pvcalls_bind {
+uint64_t id;
+} bind;
+struct _xen_pvcalls_listen {
+uint64_t id;
+} listen;
+struct _xen_pvcalls_accept {
+uint64_t id;
+} accept;
+struct _xen_pvcalls_poll {
+uint64_t id;
+} poll;
+struct _xen_pvcalls_dummy {
+uint8_t dummy[8];
+} dummy;
+} u;
+};
+
+DEFINE_RING_TYPES(xen_pvcalls, struct xen_pvcalls_request,
+  struct xen_pvcalls_response);
+
+#endif
-- 
1.9.1



[PATCH v2 06/18] xen/pvcalls: handle commands from the frontend

2017-05-19 Thread Stefano Stabellini
When the other end notifies us that there are commands to be read
(pvcalls_back_event), wake up the backend thread to parse the command.

The command ring works like most other Xen rings, so use the usual
ring macros to read and write to it. The functions implementing the
commands are empty stubs for now.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 115 +
 1 file changed, 115 insertions(+)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index a48b0d9..9dc8a28 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -52,12 +52,127 @@ struct pvcalls_back_priv {
struct work_struct register_work;
 };
 
+static int pvcalls_back_socket(struct xenbus_device *dev,
+   struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_connect(struct xenbus_device *dev,
+   struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_release(struct xenbus_device *dev,
+   struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_bind(struct xenbus_device *dev,
+struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_listen(struct xenbus_device *dev,
+  struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_accept(struct xenbus_device *dev,
+  struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_poll(struct xenbus_device *dev,
+struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_handle_cmd(struct xenbus_device *dev,
+  struct xen_pvcalls_request *req)
+{
+   int ret = 0;
+
+   switch (req->cmd) {
+   case PVCALLS_SOCKET:
+   ret = pvcalls_back_socket(dev, req);
+   break;
+   case PVCALLS_CONNECT:
+   ret = pvcalls_back_connect(dev, req);
+   break;
+   case PVCALLS_RELEASE:
+   ret = pvcalls_back_release(dev, req);
+   break;
+   case PVCALLS_BIND:
+   ret = pvcalls_back_bind(dev, req);
+   break;
+   case PVCALLS_LISTEN:
+   ret = pvcalls_back_listen(dev, req);
+   break;
+   case PVCALLS_ACCEPT:
+   ret = pvcalls_back_accept(dev, req);
+   break;
+   case PVCALLS_POLL:
+   ret = pvcalls_back_poll(dev, req);
+   break;
+   default:
+   ret = -ENOTSUPP;
+   break;
+   }
+   return ret;
+}
+
 static void pvcalls_back_work(struct work_struct *work)
 {
+   struct pvcalls_back_priv *priv = container_of(work,
+   struct pvcalls_back_priv, register_work);
+   int notify, notify_all = 0, more = 1;
+   struct xen_pvcalls_request req;
+   struct xenbus_device *dev = priv->dev;
+
+   atomic_set(>work, 1);
+
+   while (more || !atomic_dec_and_test(>work)) {
+   while (RING_HAS_UNCONSUMED_REQUESTS(>ring)) {
+   RING_COPY_REQUEST(>ring,
+ priv->ring.req_cons++,
+ );
+
+   if (!pvcalls_back_handle_cmd(dev, )) {
+   RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(
+   >ring, notify);
+   notify_all += notify;
+   }
+   }
+
+   if (notify_all)
+   notify_remote_via_irq(priv->irq);
+
+   RING_FINAL_CHECK_FOR_REQUESTS(>ring, more);
+   }
 }
 
 static irqreturn_t pvcalls_back_event(int irq, void *dev_id)
 {
+   struct xenbus_device *dev = dev_id;
+   struct pvcalls_back_priv *priv = NULL;
+
+   if (dev == NULL)
+   return IRQ_HANDLED;
+
+   priv = dev_get_drvdata(>dev);
+   if (priv == NULL)
+   return IRQ_HANDLED;
+
+   atomic_inc(>work);
+   queue_work(priv->wq, >register_work);
+
return IRQ_HANDLED;
 }
 
-- 
1.9.1



[PATCH v2 06/18] xen/pvcalls: handle commands from the frontend

2017-05-19 Thread Stefano Stabellini
When the other end notifies us that there are commands to be read
(pvcalls_back_event), wake up the backend thread to parse the command.

The command ring works like most other Xen rings, so use the usual
ring macros to read and write to it. The functions implementing the
commands are empty stubs for now.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 115 +
 1 file changed, 115 insertions(+)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index a48b0d9..9dc8a28 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -52,12 +52,127 @@ struct pvcalls_back_priv {
struct work_struct register_work;
 };
 
+static int pvcalls_back_socket(struct xenbus_device *dev,
+   struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_connect(struct xenbus_device *dev,
+   struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_release(struct xenbus_device *dev,
+   struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_bind(struct xenbus_device *dev,
+struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_listen(struct xenbus_device *dev,
+  struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_accept(struct xenbus_device *dev,
+  struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_poll(struct xenbus_device *dev,
+struct xen_pvcalls_request *req)
+{
+   return 0;
+}
+
+static int pvcalls_back_handle_cmd(struct xenbus_device *dev,
+  struct xen_pvcalls_request *req)
+{
+   int ret = 0;
+
+   switch (req->cmd) {
+   case PVCALLS_SOCKET:
+   ret = pvcalls_back_socket(dev, req);
+   break;
+   case PVCALLS_CONNECT:
+   ret = pvcalls_back_connect(dev, req);
+   break;
+   case PVCALLS_RELEASE:
+   ret = pvcalls_back_release(dev, req);
+   break;
+   case PVCALLS_BIND:
+   ret = pvcalls_back_bind(dev, req);
+   break;
+   case PVCALLS_LISTEN:
+   ret = pvcalls_back_listen(dev, req);
+   break;
+   case PVCALLS_ACCEPT:
+   ret = pvcalls_back_accept(dev, req);
+   break;
+   case PVCALLS_POLL:
+   ret = pvcalls_back_poll(dev, req);
+   break;
+   default:
+   ret = -ENOTSUPP;
+   break;
+   }
+   return ret;
+}
+
 static void pvcalls_back_work(struct work_struct *work)
 {
+   struct pvcalls_back_priv *priv = container_of(work,
+   struct pvcalls_back_priv, register_work);
+   int notify, notify_all = 0, more = 1;
+   struct xen_pvcalls_request req;
+   struct xenbus_device *dev = priv->dev;
+
+   atomic_set(>work, 1);
+
+   while (more || !atomic_dec_and_test(>work)) {
+   while (RING_HAS_UNCONSUMED_REQUESTS(>ring)) {
+   RING_COPY_REQUEST(>ring,
+ priv->ring.req_cons++,
+ );
+
+   if (!pvcalls_back_handle_cmd(dev, )) {
+   RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(
+   >ring, notify);
+   notify_all += notify;
+   }
+   }
+
+   if (notify_all)
+   notify_remote_via_irq(priv->irq);
+
+   RING_FINAL_CHECK_FOR_REQUESTS(>ring, more);
+   }
 }
 
 static irqreturn_t pvcalls_back_event(int irq, void *dev_id)
 {
+   struct xenbus_device *dev = dev_id;
+   struct pvcalls_back_priv *priv = NULL;
+
+   if (dev == NULL)
+   return IRQ_HANDLED;
+
+   priv = dev_get_drvdata(>dev);
+   if (priv == NULL)
+   return IRQ_HANDLED;
+
+   atomic_inc(>work);
+   queue_work(priv->wq, >register_work);
+
return IRQ_HANDLED;
 }
 
-- 
1.9.1



[PATCH v2 10/18] xen/pvcalls: implement listen command

2017-05-19 Thread Stefano Stabellini
Call inet_listen to implement the listen command.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index d3278bd..de82bf5 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -347,7 +347,26 @@ static int pvcalls_back_bind(struct xenbus_device *dev,
 static int pvcalls_back_listen(struct xenbus_device *dev,
   struct xen_pvcalls_request *req)
 {
-   return 0;
+   struct pvcalls_back_priv *priv;
+   int ret = -EINVAL;
+   struct sockpass_mapping *map;
+   struct xen_pvcalls_response *rsp;
+
+   priv = dev_get_drvdata(>dev);
+
+   map = radix_tree_lookup(>socketpass_mappings, req->u.listen.id);
+   if (map == NULL)
+   goto out;
+
+   ret = inet_listen(map->sock, req->u.listen.backlog);
+
+out:
+   rsp = RING_GET_RESPONSE(>ring, priv->ring.rsp_prod_pvt++);
+   rsp->req_id = req->req_id;
+   rsp->cmd = req->cmd;
+   rsp->u.listen.id = req->u.listen.id;
+   rsp->ret = ret;
+   return ret;
 }
 
 static int pvcalls_back_accept(struct xenbus_device *dev,
-- 
1.9.1



[PATCH v2 07/18] xen/pvcalls: implement socket command

2017-05-19 Thread Stefano Stabellini
Just reply with success to the other end for now. Delay the allocation
of the actual socket to bind and/or connect.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index 9dc8a28..fed54bf 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -12,12 +12,17 @@
  * GNU General Public License for more details.
  */
 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -55,7 +60,29 @@ struct pvcalls_back_priv {
 static int pvcalls_back_socket(struct xenbus_device *dev,
struct xen_pvcalls_request *req)
 {
-   return 0;
+   struct pvcalls_back_priv *priv;
+   int ret;
+   struct xen_pvcalls_response *rsp;
+
+   priv = dev_get_drvdata(>dev);
+
+   if (req->u.socket.domain != AF_INET ||
+   req->u.socket.type != SOCK_STREAM ||
+   (req->u.socket.protocol != 0 &&
+req->u.socket.protocol != AF_INET))
+   ret = -EAFNOSUPPORT;
+   else
+   ret = 0;
+
+   /* leave the actual socket allocation for later */
+
+   rsp = RING_GET_RESPONSE(>ring, priv->ring.rsp_prod_pvt++);
+   rsp->req_id = req->req_id;
+   rsp->cmd = req->cmd;
+   rsp->u.socket.id = req->u.socket.id;
+   rsp->ret = ret;
+
+   return ret;
 }
 
 static int pvcalls_back_connect(struct xenbus_device *dev,
-- 
1.9.1



[PATCH v2 10/18] xen/pvcalls: implement listen command

2017-05-19 Thread Stefano Stabellini
Call inet_listen to implement the listen command.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index d3278bd..de82bf5 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -347,7 +347,26 @@ static int pvcalls_back_bind(struct xenbus_device *dev,
 static int pvcalls_back_listen(struct xenbus_device *dev,
   struct xen_pvcalls_request *req)
 {
-   return 0;
+   struct pvcalls_back_priv *priv;
+   int ret = -EINVAL;
+   struct sockpass_mapping *map;
+   struct xen_pvcalls_response *rsp;
+
+   priv = dev_get_drvdata(>dev);
+
+   map = radix_tree_lookup(>socketpass_mappings, req->u.listen.id);
+   if (map == NULL)
+   goto out;
+
+   ret = inet_listen(map->sock, req->u.listen.backlog);
+
+out:
+   rsp = RING_GET_RESPONSE(>ring, priv->ring.rsp_prod_pvt++);
+   rsp->req_id = req->req_id;
+   rsp->cmd = req->cmd;
+   rsp->u.listen.id = req->u.listen.id;
+   rsp->ret = ret;
+   return ret;
 }
 
 static int pvcalls_back_accept(struct xenbus_device *dev,
-- 
1.9.1



[PATCH v2 07/18] xen/pvcalls: implement socket command

2017-05-19 Thread Stefano Stabellini
Just reply with success to the other end for now. Delay the allocation
of the actual socket to bind and/or connect.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index 9dc8a28..fed54bf 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -12,12 +12,17 @@
  * GNU General Public License for more details.
  */
 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -55,7 +60,29 @@ struct pvcalls_back_priv {
 static int pvcalls_back_socket(struct xenbus_device *dev,
struct xen_pvcalls_request *req)
 {
-   return 0;
+   struct pvcalls_back_priv *priv;
+   int ret;
+   struct xen_pvcalls_response *rsp;
+
+   priv = dev_get_drvdata(>dev);
+
+   if (req->u.socket.domain != AF_INET ||
+   req->u.socket.type != SOCK_STREAM ||
+   (req->u.socket.protocol != 0 &&
+req->u.socket.protocol != AF_INET))
+   ret = -EAFNOSUPPORT;
+   else
+   ret = 0;
+
+   /* leave the actual socket allocation for later */
+
+   rsp = RING_GET_RESPONSE(>ring, priv->ring.rsp_prod_pvt++);
+   rsp->req_id = req->req_id;
+   rsp->cmd = req->cmd;
+   rsp->u.socket.id = req->u.socket.id;
+   rsp->ret = ret;
+
+   return ret;
 }
 
 static int pvcalls_back_connect(struct xenbus_device *dev,
-- 
1.9.1



[PATCH v2 11/18] xen/pvcalls: implement accept command

2017-05-19 Thread Stefano Stabellini
Implement the accept command by calling inet_accept. To avoid blocking
in the kernel, call inet_accept(O_NONBLOCK) from a workqueue, which get
scheduled on sk_data_ready (for a passive socket, it means that there
are connections to accept).

Use the reqcopy field to store the request. Accept the new socket from
the delayed work function, create a new sock_mapping for it, map
the indexes page and data ring, and reply to the other end. Allocate an
ioworker for the socket.

Only support one outstanding blocking accept request for every socket at
any time.

Add a field to sock_mapping to remember the passive socket from which an
active socket was created.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 161 -
 1 file changed, 160 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index de82bf5..bc641a8 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -66,6 +66,7 @@ struct pvcalls_ioworker {
 struct sock_mapping {
struct list_head list;
struct pvcalls_back_priv *priv;
+   struct sockpass_mapping *sockpass;
struct socket *sock;
uint64_t id;
grant_ref_t ref;
@@ -267,10 +268,131 @@ static int pvcalls_back_release(struct xenbus_device 
*dev,
 
 static void __pvcalls_back_accept(struct work_struct *work)
 {
+   struct sockpass_mapping *mappass = container_of(
+   work, struct sockpass_mapping, register_work);
+   struct sock_mapping *map;
+   struct pvcalls_ioworker *iow;
+   struct pvcalls_back_priv *priv;
+   struct xen_pvcalls_response *rsp;
+   struct xen_pvcalls_request *req;
+   void *page = NULL;
+   int notify;
+   int ret = -EINVAL;
+   unsigned long flags;
+
+   priv = mappass->priv;
+   /* We only need to check the value of "cmd" atomically on read. */
+   spin_lock_irqsave(>copy_lock, flags);
+   req = >reqcopy;
+   if (req->cmd != PVCALLS_ACCEPT) {
+   spin_unlock_irqrestore(>copy_lock, flags);
+   return;
+   }
+   spin_unlock_irqrestore(>copy_lock, flags);
+
+   map = kzalloc(sizeof(*map), GFP_KERNEL);
+   if (map == NULL) {
+   ret = -ENOMEM;
+   goto out_error;
+   }
+
+   map->sock = sock_alloc();
+   if (!map->sock)
+   goto out_error;
+
+   map->ref = req->u.accept.ref;
+
+   map->priv = priv;
+   map->sockpass = mappass;
+   map->sock->type = mappass->sock->type;
+   map->sock->ops = mappass->sock->ops;
+   map->id = req->u.accept.id_new;
+
+   ret = xenbus_map_ring_valloc(priv->dev, >u.accept.ref, 1, );
+   if (ret < 0)
+   goto out_error;
+   map->ring = page;
+   map->ring_order = map->ring->ring_order;
+   /* first read the order, then map the data ring */
+   virt_rmb();
+   if (map->ring_order > MAX_RING_ORDER) {
+   ret = -EFAULT;
+   goto out_error;
+   }
+   ret = xenbus_map_ring_valloc(priv->dev, map->ring->ref,
+(1 << map->ring_order), );
+   if (ret < 0)
+   goto out_error;
+   map->bytes = page;
+
+   ret = bind_interdomain_evtchn_to_irqhandler(priv->dev->otherend_id,
+   req->u.accept.evtchn,
+   pvcalls_back_conn_event,
+   0,
+   "pvcalls-backend",
+   map);
+   if (ret < 0)
+   goto out_error;
+   map->irq = ret;
+
+   map->data.in = map->bytes;
+   map->data.out = map->bytes + XEN_FLEX_RING_SIZE(map->ring_order);
+
+   map->ioworker.wq = alloc_workqueue("pvcalls_io", WQ_UNBOUND, 1);
+   if (!map->ioworker.wq) {
+   ret = -ENOMEM;
+   goto out_error;
+   }
+   map->ioworker.cpu = get_random_int() % num_online_cpus();
+   atomic_set(>io, 1);
+   INIT_WORK(>ioworker.register_work, pvcalls_back_ioworker);
+
+   down(>socket_lock);
+   list_add_tail(>list, >socket_mappings);
+   up(>socket_lock);
+
+   ret = inet_accept(mappass->sock, map->sock, O_NONBLOCK, true);
+   if (ret == -EAGAIN)
+   goto out_error;
+
+   write_lock_bh(>sock->sk->sk_callback_lock);
+   map->saved_data_ready = map->sock->sk->sk_data_ready;
+   map->sock->sk->sk_user_data = map;
+   map->sock->sk->sk_data_ready = pvcalls_sk_data_ready;
+   map->sock->sk->sk_state_change = pvcalls_sk_state_change;
+   write_unlock_bh(>sock->sk->sk_callback_lock);
+
+   iow = >ioworker;
+   atomic_inc(>read);
+   atomic_inc(>io);
+   queue_work_on(iow->cpu, iow->wq, >register_work);
+

[PATCH v2 11/18] xen/pvcalls: implement accept command

2017-05-19 Thread Stefano Stabellini
Implement the accept command by calling inet_accept. To avoid blocking
in the kernel, call inet_accept(O_NONBLOCK) from a workqueue, which get
scheduled on sk_data_ready (for a passive socket, it means that there
are connections to accept).

Use the reqcopy field to store the request. Accept the new socket from
the delayed work function, create a new sock_mapping for it, map
the indexes page and data ring, and reply to the other end. Allocate an
ioworker for the socket.

Only support one outstanding blocking accept request for every socket at
any time.

Add a field to sock_mapping to remember the passive socket from which an
active socket was created.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 161 -
 1 file changed, 160 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index de82bf5..bc641a8 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -66,6 +66,7 @@ struct pvcalls_ioworker {
 struct sock_mapping {
struct list_head list;
struct pvcalls_back_priv *priv;
+   struct sockpass_mapping *sockpass;
struct socket *sock;
uint64_t id;
grant_ref_t ref;
@@ -267,10 +268,131 @@ static int pvcalls_back_release(struct xenbus_device 
*dev,
 
 static void __pvcalls_back_accept(struct work_struct *work)
 {
+   struct sockpass_mapping *mappass = container_of(
+   work, struct sockpass_mapping, register_work);
+   struct sock_mapping *map;
+   struct pvcalls_ioworker *iow;
+   struct pvcalls_back_priv *priv;
+   struct xen_pvcalls_response *rsp;
+   struct xen_pvcalls_request *req;
+   void *page = NULL;
+   int notify;
+   int ret = -EINVAL;
+   unsigned long flags;
+
+   priv = mappass->priv;
+   /* We only need to check the value of "cmd" atomically on read. */
+   spin_lock_irqsave(>copy_lock, flags);
+   req = >reqcopy;
+   if (req->cmd != PVCALLS_ACCEPT) {
+   spin_unlock_irqrestore(>copy_lock, flags);
+   return;
+   }
+   spin_unlock_irqrestore(>copy_lock, flags);
+
+   map = kzalloc(sizeof(*map), GFP_KERNEL);
+   if (map == NULL) {
+   ret = -ENOMEM;
+   goto out_error;
+   }
+
+   map->sock = sock_alloc();
+   if (!map->sock)
+   goto out_error;
+
+   map->ref = req->u.accept.ref;
+
+   map->priv = priv;
+   map->sockpass = mappass;
+   map->sock->type = mappass->sock->type;
+   map->sock->ops = mappass->sock->ops;
+   map->id = req->u.accept.id_new;
+
+   ret = xenbus_map_ring_valloc(priv->dev, >u.accept.ref, 1, );
+   if (ret < 0)
+   goto out_error;
+   map->ring = page;
+   map->ring_order = map->ring->ring_order;
+   /* first read the order, then map the data ring */
+   virt_rmb();
+   if (map->ring_order > MAX_RING_ORDER) {
+   ret = -EFAULT;
+   goto out_error;
+   }
+   ret = xenbus_map_ring_valloc(priv->dev, map->ring->ref,
+(1 << map->ring_order), );
+   if (ret < 0)
+   goto out_error;
+   map->bytes = page;
+
+   ret = bind_interdomain_evtchn_to_irqhandler(priv->dev->otherend_id,
+   req->u.accept.evtchn,
+   pvcalls_back_conn_event,
+   0,
+   "pvcalls-backend",
+   map);
+   if (ret < 0)
+   goto out_error;
+   map->irq = ret;
+
+   map->data.in = map->bytes;
+   map->data.out = map->bytes + XEN_FLEX_RING_SIZE(map->ring_order);
+
+   map->ioworker.wq = alloc_workqueue("pvcalls_io", WQ_UNBOUND, 1);
+   if (!map->ioworker.wq) {
+   ret = -ENOMEM;
+   goto out_error;
+   }
+   map->ioworker.cpu = get_random_int() % num_online_cpus();
+   atomic_set(>io, 1);
+   INIT_WORK(>ioworker.register_work, pvcalls_back_ioworker);
+
+   down(>socket_lock);
+   list_add_tail(>list, >socket_mappings);
+   up(>socket_lock);
+
+   ret = inet_accept(mappass->sock, map->sock, O_NONBLOCK, true);
+   if (ret == -EAGAIN)
+   goto out_error;
+
+   write_lock_bh(>sock->sk->sk_callback_lock);
+   map->saved_data_ready = map->sock->sk->sk_data_ready;
+   map->sock->sk->sk_user_data = map;
+   map->sock->sk->sk_data_ready = pvcalls_sk_data_ready;
+   map->sock->sk->sk_state_change = pvcalls_sk_state_change;
+   write_unlock_bh(>sock->sk->sk_callback_lock);
+
+   iow = >ioworker;
+   atomic_inc(>read);
+   atomic_inc(>io);
+   queue_work_on(iow->cpu, iow->wq, >register_work);
+
+out_error:
+   if (ret < 

Re: Widespread crashes in -next, bisected to 'mm: drop HASH_ADAPT'

2017-05-19 Thread Kevin Hilman
On Fri, May 19, 2017 at 9:46 AM, Guenter Roeck <li...@roeck-us.net> wrote:
> Hi,
>
> my qemu tests of next-20170519 show the following results:
> total: 122 pass: 30 fail: 92
>
> I won't bother listing all of the failures; they are available at
> http://kerneltests.org/builders. I bisected one (openrisc, because
> it gives me some console output before dying). It points to
> 'mm: drop HASH_ADAPT' as the culprit. Bisect log is attached.
>
> A quick glance suggests that 64 bit kernels pass and 32 bit kernels fail.
> 32-bit x86 images fail and should provide an easy test case.

32-bit ARM platforms also failing.

I also noticed widespread boot failures in kernelci.org, and bisected
one of them (32-bit ARM, Beaglebone Black) and it pointed at the same
patch.

Kevin

> Guenter
>
> ---
> # bad: [5666af8ae4a18b5ea6758d0c7c42ea765de216d2] Add linux-next specific 
> files for 20170519
> # good: [2ea659a9ef488125eb46da6eb571de5eae5c43f6] Linux 4.12-rc1
> git bisect start 'HEAD' 'v4.12-rc1'
> # good: [c7115270d8cc333801b11e541ddbc43e320a88ef] Merge remote-tracking 
> branch 'drm/drm-next'
> git bisect good c7115270d8cc333801b11e541ddbc43e320a88ef
> # good: [6bf84ee44e057051577be7edf306dc595b8d3c0f] Merge remote-tracking 
> branch 'tip/auto-latest'
> git bisect good 6bf84ee44e057051577be7edf306dc595b8d3c0f
> # good: [8def67a06d65a1b08c87a65a8ef4fd6e71b6745c] Merge remote-tracking 
> branch 'staging/staging-next'
> git bisect good 8def67a06d65a1b08c87a65a8ef4fd6e71b6745c
> # good: [0d538a750eaab91fc3f6dffe4c0e7d98d6252b81] Merge remote-tracking 
> branch 'userns/for-next'
> git bisect good 0d538a750eaab91fc3f6dffe4c0e7d98d6252b81
> # good: [eb64959cd8c405de533122dc72b64d6ca197cee1] powerpc/mm/hugetlb: remove 
> follow_huge_addr for powerpc
> git bisect good eb64959cd8c405de533122dc72b64d6ca197cee1
> # bad: [eb520e759caf124ba1c64e277939ff379d0ca8bd] procfs: fdinfo: extend 
> information about epoll target files
> git bisect bad eb520e759caf124ba1c64e277939ff379d0ca8bd
> # bad: [45f5e427d6326ca1c44cd6897b9939441063fb96] lib/kstrtox.c: use 
> "unsigned int" more
> git bisect bad 45f5e427d6326ca1c44cd6897b9939441063fb96
> # bad: [d75db247b8f204bfa2e6d2b40afcae74f3b4c7fd] mm: drop NULL return check 
> of pte_offset_map_lock()
> git bisect bad d75db247b8f204bfa2e6d2b40afcae74f3b4c7fd
> # good: [d4c9af9111d483efd5f302916639a0e9a626f90f] mm: adaptive hash table 
> scaling
> git bisect good d4c9af9111d483efd5f302916639a0e9a626f90f
> # bad: [90d2d8d8960a1b2ed868ce3bfd91e2ac8d4ff9aa] mm/hugetlb: clean up 
> ARCH_HAS_GIGANTIC_PAGE
> git bisect bad 90d2d8d8960a1b2ed868ce3bfd91e2ac8d4ff9aa
> # bad: [67d0687224a93ef2adae7a2ed10f25b275f2ee91] mm: drop HASH_ADAPT
> git bisect bad 67d0687224a93ef2adae7a2ed10f25b275f2ee91
> # first bad commit: [67d0687224a93ef2adae7a2ed10f25b275f2ee91] mm: drop 
> HASH_ADAPT


[PATCH v2 12/18] xen/pvcalls: implement poll command

2017-05-19 Thread Stefano Stabellini
Implement poll on passive sockets by requesting a delayed response with
mappass->reqcopy, and reply back when there is data on the passive
socket.

Poll on active socket is unimplemented as by the spec, as the frontend
should just wait for events and check the indexes on the indexes page.

Only support one outstanding poll (or accept) request for every passive
socket at any given time.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
 drivers/xen/pvcalls-back.c | 75 --
 1 file changed, 73 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
index bc641a8..5ff1504 100644
--- a/drivers/xen/pvcalls-back.c
+++ b/drivers/xen/pvcalls-back.c
@@ -388,11 +388,33 @@ static void __pvcalls_back_accept(struct work_struct 
*work)
 static void pvcalls_pass_sk_data_ready(struct sock *sock)
 {
struct sockpass_mapping *mappass = sock->sk_user_data;
+   struct pvcalls_back_priv *priv;
+   struct xen_pvcalls_response *rsp;
+   unsigned long flags;
+   int notify;
 
if (mappass == NULL)
return;
 
-   queue_work(mappass->wq, >register_work);
+   priv = mappass->priv;
+   spin_lock_irqsave(>copy_lock, flags);
+   if (mappass->reqcopy.cmd == PVCALLS_POLL) {
+   rsp = RING_GET_RESPONSE(>ring, priv->ring.rsp_prod_pvt++);
+   rsp->req_id = mappass->reqcopy.req_id;
+   rsp->u.poll.id = mappass->reqcopy.u.poll.id;
+   rsp->cmd = mappass->reqcopy.cmd;
+   rsp->ret = 0;
+
+   mappass->reqcopy.cmd = 0;
+   spin_unlock_irqrestore(>copy_lock, flags);
+
+   RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(>ring, notify);
+   if (notify)
+   notify_remote_via_irq(mappass->priv->irq);
+   } else {
+   spin_unlock_irqrestore(>copy_lock, flags);
+   queue_work(mappass->wq, >register_work);
+   }
 }
 
 static int pvcalls_back_bind(struct xenbus_device *dev,
@@ -537,7 +559,56 @@ static int pvcalls_back_accept(struct xenbus_device *dev,
 static int pvcalls_back_poll(struct xenbus_device *dev,
 struct xen_pvcalls_request *req)
 {
-   return 0;
+   struct pvcalls_back_priv *priv;
+   struct sockpass_mapping *mappass;
+   struct xen_pvcalls_response *rsp;
+   struct inet_connection_sock *icsk;
+   struct request_sock_queue *queue;
+   unsigned long flags;
+   int ret;
+   bool data;
+
+   priv = dev_get_drvdata(>dev);
+
+   mappass = radix_tree_lookup(>socketpass_mappings, req->u.poll.id);
+   if (mappass == NULL)
+   return -EINVAL;
+
+   /*
+* Limitation of the current implementation: only support one
+* concurrent accept or poll call on one socket.
+*/
+   spin_lock_irqsave(>copy_lock, flags);
+   if (mappass->reqcopy.cmd != 0) {
+   ret = -EINTR;
+   goto out;
+   }
+
+   mappass->reqcopy = *req;
+   icsk = inet_csk(mappass->sock->sk);
+   queue = >icsk_accept_queue;
+   spin_lock(>rskq_lock);
+   data = queue->rskq_accept_head != NULL;
+   spin_unlock(>rskq_lock);
+   if (data) {
+   mappass->reqcopy.cmd = 0;
+   ret = 0;
+   goto out;
+   }
+   spin_unlock_irqrestore(>copy_lock, flags);
+
+   /* Tell the caller we don't need to send back a notification yet */
+   return -1;
+
+out:
+   spin_unlock_irqrestore(>copy_lock, flags);
+
+   rsp = RING_GET_RESPONSE(>ring, priv->ring.rsp_prod_pvt++);
+   rsp->req_id = req->req_id;
+   rsp->cmd = req->cmd;
+   rsp->u.poll.id = req->u.poll.id;
+   rsp->ret = ret;
+   return ret;
 }
 
 static int pvcalls_back_handle_cmd(struct xenbus_device *dev,
-- 
1.9.1



  1   2   3   4   5   6   7   8   9   10   >