date:20150511

[PATCH] net: ll_temac: Fix DMA map size bug

2015-05-11 Thread Michal Simek

DMA allocates skb->len instead of headlen
which is used for DMA.

Signed-off-by: Michal Simek 
---

 drivers/net/ethernet/xilinx/ll_temac_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/xilinx/ll_temac_main.c 
b/drivers/net/ethernet/xilinx/ll_temac_main.c
index ca640d04fd93..cfb6bdb37fdc 100644
--- a/drivers/net/ethernet/xilinx/ll_temac_main.c
+++ b/drivers/net/ethernet/xilinx/ll_temac_main.c
@@ -705,8 +705,8 @@ static int temac_start_xmit(struct sk_buff *skb, struct 
net_device *ndev)
 
cur_p->app0 |= STS_CTRL_APP0_SOP;
cur_p->len = skb_headlen(skb);
-   cur_p->phys = dma_map_single(ndev->dev.parent, skb->data, skb->len,
-DMA_TO_DEVICE);
+   cur_p->phys = dma_map_single(ndev->dev.parent, skb->data,
+   skb_headlen(skb), DMA_TO_DEVICE);
cur_p->app4 = (unsigned long)skb;
 
for (ii = 0; ii < num_frag; ii++) {
-- 
2.3.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-11 Thread Dave Young

On 05/11/15 at 12:11pm, Joerg Roedel wrote:
> On Thu, May 07, 2015 at 09:56:00PM +0800, Dave Young wrote:
> > Joreg, I can not find the last reply from you, so just reply here about
> > my worries here.
> > 
> > I said that the patchset will cause more problems, let me explain about
> > it more here:
> > 
> > Suppose page table was corrupted, ie. original mapping iova1 -> page 1
> > it was changed to iova1 -> page 2 accidently while crash happening,
> > thus future dma will read/write page 2 instead page 1, right?
> 
> When the page-table is corrupted then it is a left-over from the old
> kernel. When the kdump kernel boots the situation can at least not get
> worse. For the page tables it is also hard to detect wrong mappings (if
> this would be possible the hardware could already do it), so any checks
> we could do there are of limited use anyway.

Joerg, since both of you guys do not think it is a problem I will object it 
any more though I still do not like reusing the old page tables. So let's
leave it as a future issue.

Thanks
Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 0/6] arm64: Add kernel probes (kprobes) support

2015-05-11 Thread David Long


On 05/05/15 11:48, Will Deacon wrote:

On Tue, May 05, 2015 at 06:14:51AM +0100, David Long wrote:

On 05/01/15 21:44, William Cohen wrote:

Dave Long and I did some additional experimentation to better
understand what is condition causes the kernel to sometimes spew:

Unexpected kernel single-step exception at EL1

The functioncallcount.stp test instruments the entry and return of
every function in the mm files, including kfree.  In most cases the
arm64 trampoline_probe_handler just determines which return probe
instance matches the current conditions, runs the associated handler,
and recycles the return probe instance for another use by placing it
on a hlist.  However, it is possible that a return probe instance has
been set up on function entry and the return probe is unregistered
before the return probe instance fires.  In this case kfree is called
by the trampoline handler to remove the return probe instances related
to the unregistered kretprobe.  This case where the the kprobed kfree
is called within the arm64 trampoline_probe_handler function trigger
the problem.

The kprobe breakpoint for the kfree call from within the
trampoline_probe_handler is encountered and started, but things go
wrong when attempting the single step on the instruction.

It took a while to trigger this problem with the sytemtap testsuite.
Dave Long came up with steps that reproduce this more quickly with a
probed function that is always called within the trampoline handler.
Trying the same on x86_64 doesn't trigger the problem.  It appears
that the x86_64 code can handle a single step from within the
trampoline_handler.



I'm assuming there are no plans for supporting software breakpoint debug
exceptions during processing of single-step exceptions, any time soon on
arm64.  Given that the only solution that I can come with for this is
instead of making this orphaned kretprobe instance list exist only
temporarily (in the scope of the kretprobe trampoline handler), make it
always exist and kfree any items found on it as part of a periodic
cleanup running outside of the handler context.  I think these changes
would still all be in archiecture-specific code.  This doesn't feel to
me like a bad solution.  Does anyone think there is a simpler way out of
this?


Just to clarify, is the problem here the software breakpoint exception,
or trying to step the faulting instruction whilst we were already handling
a step?



Sorry for the delay, I got tripped up with some global optimizations 
that happened when I made more testing changes.  When the kprobes 
software breakpoint handler for kretprobes is reentered it sets up the 
single-step and that ends up hitting inside entry.S, apparently in 
el1_undef.



I think I'd be inclined to keep the code run in debug context to a minimum.
We already can't block there, and the more code we add the more black spots
we end up with in the kernel itself. The alternative would be to make your
kprobes code re-entrant, but that sounds like a nightmare.

You say this works on x86. How do they handle it? Is the nested probe
on kfree ignored or handled?



Will Cohen's email pointing out x86 does not use a breakpoint for the 
trampoline handler explains a lot.  I'm experimenting starting with his 
proposed new trampoline code.  I can't see a reason this can't be made 
to work and so given everything it doesn't seem interesting to try and 
understand the failure in reentering the kprobe break handler in any 
more detail.


-dave long


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: VERIFY_READ/WRITE in uaccess.h?

2015-05-11 Thread H. Peter Anvin

On 05/11/2015 02:42 PM, Linus Torvalds wrote:
> 
> That one - for the same reasons - also checked the actual accesses,
> not just that the range was in user mode. Exactly because it needed to
> pre-COW the pages (even if that was then obviously racy in threaded
> environments - in practice it worked, and we tried to support the
> fundamentally broken i386 hardware protection model for a long time).
> 

It worked in part because we never supported SMP on i386.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 05/11] scatterlist: use sg_phys()

2015-05-11 Thread Dan Williams

On Mon, May 11, 2015 at 10:24 PM, Julia Lawall  wrote:
>
>
> On Tue, 12 May 2015, Dan Williams wrote:
>
>> Coccinelle cleanup to replace open coded sg to physical address
>> translations.  This is in preparation for introducing scatterlists that
>> reference pfn(s) without a backing struct page.
>>
>> // sg_phys.cocci: convert usage page_to_phys(sg_page(sg)) to sg_phys(sg)
>> // usage: make coccicheck COCCI=sg_phys.cocci MODE=patch
>>
>> virtual patch
>> virtual report
>> virtual org
>
> Just for information, you don't need the three lines above.  There are ony
> useful when you want the semantic patch to support several kinds of
> output.
>

Ok, I think I added them from copying a coccicheck script, and if I
delete virtual patch I get

"virtual rule patch not supported"

when running:

make coccicheck COCCI=sg_phys.cocci MODE=patch

I suspect I am invoking it wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [v2 0/5] arm64: add kdump support

2015-05-11 Thread Dave Young

On 05/11/15 at 03:16pm, AKASHI Takahiro wrote:
> Hi
> 
> Sorry for late response. I was on vacation.
> 
> On 04/24/2015 06:53 PM, Mark Rutland wrote:
> >Hi,
> >
> >On Fri, Apr 24, 2015 at 08:53:03AM +0100, AKASHI Takahiro wrote:
> >>This patch set enables kdump (crash dump kernel) support on arm64 on top of
> >>Geoff's kexec patchset.
> >>
> >>In this version, there are some arm64-specific usage/constraints:
> >>1) "mem=" boot parameter must be specified on crash dump kernel
> >>if the system starts on uefi.
> >
> >This sounds very painful. Why is this the case, and how do x86 and/or
> >ia64 get around that?
> 
> As Dave (Young) said, x86 uses "memmap=XX" kernel commandline parameters
> to specify usable memory for crash dump kernel.

Originally x86 use memmap=exactmap memmap=XX to specify each section of
memories for 2nd kernel. But later because a lot of reserved type ranges
need to be passed ie. for pci mmconfig, and kernel cmdline buffer is
limited so kexec-tools later switch to passing these in x86 boot params as
E820 memory ranges directly.

> On my arm64 implementation, "linux,usable-memory" property is added
> to device tree blob by kexec-tools for this purpose.
> This is because, when I first implemented kdump on arm64, ppc is the only
> architecture that supports kdump AND utilizes device trees.
> Since kexec-tools as well as the kernel already has this framework,
> I believed that device-tree approach was smarter than a commandline
> parameter.
> 
> However, uefi-based kernel ignores all the memory-related properties
> in a device tree and so this "mem=" workaround was added.

Kdump kernel reuses the memmap info getting from firmware during 1st kernel
boot, I do not think the memmap info can be cooked for crash kernel usable
memory. But it might be a better way to use a special fdt node for crash
kernel memory even for UEFI..

Another way is introducing a similar memmap=, but maybe consider only
system_ram type ranges. For other memory areas still use UEFI memmap.

Thanks
Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread Dave Chinner

On Mon, May 11, 2015 at 07:34:34PM -0700, Daniel Phillips wrote:
> Anyway, everybody but you loves competitive benchmarks, that is why I

I think Ted and I are on the same page here. "Competitive
benchmarks" only matter to the people who are trying to sell
something. You're trying to sell Tux3, but

> post them. They are not only useful for tracking down performance bugs,
> but as you point out, they help us advertise the reasons why Tux3 is
> interesting and ought to be merged.

 benchmarks won't get tux3 merged.

Addressing the significant issues that have been raised during
previous code reviews is what will get it merged.  I posted that
list elsewhere in this thread which you replied that they were all
"on the list of things to do except for the page forking design".

The "except page forking design" statement is your biggest hurdle
for getting tux3 merged, not performance. Without page forking, tux3
cannot be merged at all. But it's not filesystem developers you need
to convince about the merits of the page forking design and
implementation - it's the mm and core kernel developers that need to
review and accept that code *before* we can consider merging tux3.

IOWs, you need to focus on the important things needed to acheive
your stated goal of getting tux3 merged. New filesystems should be
faster than those based on 20-25 year old designs, so you don't need
to waste time trying to convince people that tux3, when complete,
will be fast.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] iommu: add ARM short descriptor page table allocator.

2015-05-11 Thread Yong Wu

On Mon, 2015-05-11 at 11:40 +0100, Will Deacon wrote:
> On Mon, May 11, 2015 at 11:13:09AM +0100, Joerg Roedel wrote:
> > On Tue, May 05, 2015 at 06:05:41PM +0100, Will Deacon wrote:
> > > I think the MT8173 IOMMU [1] uses this format and, since it's part of
> > > the ARM architecture, the ARM SMMU can make use of it too if the
> > > implementation supports it.
> > 
> > I think it makes more sense to merge this with a driver actually using
> > it. I guess a driver for the above mentioned IOMMU will be submitted
> > soon?
> 
> I completely agree; since there are already patches floating around for
> that IOMMU, it's probably worth them being put into a single series.
> 
> Will
Hi Joerg, Will, 
 Thanks. We will put this into the mtk-iommu patch series and
prepare to send in this week.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 05/11] scatterlist: use sg_phys()

2015-05-11 Thread Julia Lawall



On Tue, 12 May 2015, Dan Williams wrote:

> Coccinelle cleanup to replace open coded sg to physical address
> translations.  This is in preparation for introducing scatterlists that
> reference pfn(s) without a backing struct page.
> 
> // sg_phys.cocci: convert usage page_to_phys(sg_page(sg)) to sg_phys(sg)
> // usage: make coccicheck COCCI=sg_phys.cocci MODE=patch
> 
> virtual patch
> virtual report
> virtual org

Just for information, you don't need the three lines above.  There are ony 
useful when you want the semantic patch to support several kinds of 
output.

julia

> @@
> struct scatterlist *sg;
> @@
> 
> - page_to_phys(sg_page(sg)) + sg->offset
> + sg_phys(sg)
> 
> @@
> struct scatterlist *sg;
> @@
> 
> - page_to_phys(sg_page(sg))
> + sg_phys(sg) - sg->offset
> 
> Cc: Julia Lawall 
> Signed-off-by: Dan Williams 
> ---
>  arch/arm/mm/dma-mapping.c|2 +-
>  arch/microblaze/kernel/dma.c |2 +-
>  drivers/iommu/intel-iommu.c  |4 ++--
>  drivers/iommu/iommu.c|2 +-
>  drivers/staging/android/ion/ion_chunk_heap.c |4 ++--
>  5 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index 09c5fe3d30c2..43cc6a8fdacc 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1502,7 +1502,7 @@ static int __map_sg_chunk(struct device *dev, struct 
> scatterlist *sg,
>   return -ENOMEM;
>  
>   for (count = 0, s = sg; count < (size >> PAGE_SHIFT); s = sg_next(s)) {
> - phys_addr_t phys = page_to_phys(sg_page(s));
> + phys_addr_t phys = sg_phys(s) - s->offset;
>   unsigned int len = PAGE_ALIGN(s->offset + s->length);
>  
>   if (!is_coherent &&
> diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
> index ed7ba8a11822..dcb3c594d626 100644
> --- a/arch/microblaze/kernel/dma.c
> +++ b/arch/microblaze/kernel/dma.c
> @@ -61,7 +61,7 @@ static int dma_direct_map_sg(struct device *dev, struct 
> scatterlist *sgl,
>   /* FIXME this part of code is untested */
>   for_each_sg(sgl, sg, nents, i) {
>   sg->dma_address = sg_phys(sg);
> - __dma_sync(page_to_phys(sg_page(sg)) + sg->offset,
> + __dma_sync(sg_phys(sg),
>   sg->length, direction);
>   }
>  
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 68d43beccb7e..9b9ada71e0d3 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1998,7 +1998,7 @@ static int __domain_mapping(struct dmar_domain *domain, 
> unsigned long iov_pfn,
>   sg_res = aligned_nrpages(sg->offset, sg->length);
>   sg->dma_address = ((dma_addr_t)iov_pfn << 
> VTD_PAGE_SHIFT) + sg->offset;
>   sg->dma_length = sg->length;
> - pteval = page_to_phys(sg_page(sg)) | prot;
> + pteval = (sg_phys(sg) - sg->offset) | prot;
>   phys_pfn = pteval >> VTD_PAGE_SHIFT;
>   }
>  
> @@ -3302,7 +3302,7 @@ static int intel_nontranslate_map_sg(struct device 
> *hddev,
>  
>   for_each_sg(sglist, sg, nelems, i) {
>   BUG_ON(!sg_page(sg));
> - sg->dma_address = page_to_phys(sg_page(sg)) + sg->offset;
> + sg->dma_address = sg_phys(sg);
>   sg->dma_length = sg->length;
>   }
>   return nelems;
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index d4f527e56679..59808fc9110d 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1147,7 +1147,7 @@ size_t default_iommu_map_sg(struct iommu_domain 
> *domain, unsigned long iova,
>   min_pagesz = 1 << __ffs(domain->ops->pgsize_bitmap);
>  
>   for_each_sg(sg, s, nents, i) {
> - phys_addr_t phys = page_to_phys(sg_page(s)) + s->offset;
> + phys_addr_t phys = sg_phys(s);
>  
>   /*
>* We are mapping on IOMMU page boundaries, so offset within
> diff --git a/drivers/staging/android/ion/ion_chunk_heap.c 
> b/drivers/staging/android/ion/ion_chunk_heap.c
> index 3e6ec2ee6802..b7da5d142aa9 100644
> --- a/drivers/staging/android/ion/ion_chunk_heap.c
> +++ b/drivers/staging/android/ion/ion_chunk_heap.c
> @@ -81,7 +81,7 @@ static int ion_chunk_heap_allocate(struct ion_heap *heap,
>  err:
>   sg = table->sgl;
>   for (i -= 1; i >= 0; i--) {
> - gen_pool_free(chunk_heap->pool, page_to_phys(sg_page(sg)),
> + gen_pool_free(chunk_heap->pool, sg_phys(sg) - sg->offset,
> sg->length);
>   sg = sg_next(sg);
>   }
> @@ -109,7 +109,7 @@ static void ion_chunk_heap_free(struct ion_buffer *buffer)
>   DMA_BIDIRECTIONAL);
>  
>   for_each_sg(table->sgl, sg, table->nents, i) {
>

Re: [PATCH v2] net: ll_temac: Use one return statement instead of two

2015-05-11 Thread Julia Lawall

On Mon, 11 May 2015, Joe Perches wrote:

> On Mon, 2015-05-11 at 17:48 +0200, Julia Lawall wrote:
> > > > A coccinelle script might be rather more complicated
> > > > than the simpler grep above, but perhaps the script
> > > > could be a bit more complete as it could likely look
> > > > at more code indentation styles.
> > >
> > > Julia: Any comment?
> > 
> > Here is what I had in mind:
> > 
> > if (...) {
> >   ... when != goto l;
> >   return C;
> > }
> > return C;
> > 
> > C is a constant, to avoid that its value depends on the code in the ...
> 
> Sure but I think that would miss several instances like:
> 
>   switch () {
>   ...
>   default:
>   return ;
>   }
>   return ;

Switch Coccinelle is not very good at...

> or the similar
> 
>   if (foo) {
>   if (qux)
>   return ;
>   } else {
>   return ;
>   }
> 
>   return ;

It seems improbable, but I could look for that.  Unfortunately, I don't 
see a way to deal with arbitrarily nested ifs.  Basically, the control 
flow from one return doesn't go to the other.  It goes from the return to 
the outside of the function.  I guess something could be done by renaming 
all of the returns to function calls, but that tends to make a mess.  It 
could be done to see if such cases are worth considering though.

Another similar and popular construction is:

if (...) {
  ...
  goto l;
}
l:

julia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] ARM: cache-l2c: Add flag to skip cache unlocking

2015-05-11 Thread Sjoerd Simons

On Mon, 2015-05-11 at 23:29 +0100, Russell King - ARM Linux wrote:
> On Tue, May 12, 2015 at 12:17:29AM +0200, Sjoerd Simons wrote:
> >  extern struct outer_cache_fns outer_cache;
> > diff --git a/arch/arm/mm/cache-l2x0.c b/arch/arm/mm/cache-l2x0.c
> > index e309c8f..fff7888 100644
> > --- a/arch/arm/mm/cache-l2x0.c
> > +++ b/arch/arm/mm/cache-l2x0.c
> > @@ -136,7 +136,8 @@ static void l2c_enable(void __iomem *base, u32 aux, 
> > unsigned num_lock)
> > l2x0_saved_regs.aux_ctrl = aux;
> > l2c_configure(base);
> >  
> > -   l2c_unlock(base, num_lock);
> > +   if (!outer_cache.skip_unlock)
> > +   l2c_unlock(base, num_lock);
> 
> I think we can do better here.  If the non-secure lockdown access bit has
> been set, then proceed with the unlock:
> 
>   if (readl_relaxed(base + L2X0_AUX_CTRL) & L310_AUX_CTRL_NS_LOCKDOWN)
>   l2c_unlock(base, num_lock);
> 
> I don't see any need to add a flag for this.  This also eliminates your
> second patch.

Main reason I added the flag like this was to simplify the changes as
l2c_enable has no real knowledge about which type of cache it's running
on. 

But sure i will have a look at re-jigging the code such that the
situation is automatically detected rather then requiring the machine
specific code to flag it explicitely 

-- 
Sjoerd Simons 
Collabora Ltd.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 0/6] arm64,hi6220: Enable Hisilicon Hi6220 SoC

2015-05-11 Thread Bintian


Hello Kevin,

On 2015/5/12 11:05, Leo Yan wrote:

hi Kevin,

On Mon, May 11, 2015 at 05:20:54PM -0700, Kevin Hilman wrote:

On Thu, May 7, 2015 at 4:11 PM, Brent Wang  wrote:

Hello Kevin,

2015-05-08 4:30 GMT+08:00 Kevin Hilman :

Bintian Wang  writes:


Hi6220 is one mobile solution of Hisilicon, this patchset contains
initial support for Hi6220 SoC and HiKey development board, which
supports octal ARM Cortex A53 cores. Initial support is minimal and
includes just the arch configuration, clock driver, device tree
configuration.

PSCI is enabled in device tree and there is no problem to boot all the
octal cores, and the CPU hotplug is also working now, you can download
and compile the latest firmware based on the following link to run this
patch set:
https://github.com/96boards/documentation/wiki/UEFI


Do you have any tips for booting this using the HiSi bootloader?  It
seems that I need to add the magic hisi,boardid property for dtbTool to
work.  Could you share what that magic value is?

Yes, you need it.
Hisilicon has many different development boards and those boards have some
different hardware configuration, so we need different device tree
files for them.
the original hisi,boardid is used to distinguish different boards and
used by the
bootloader to judge which device tree to use at boot-up.


and maybe add it to the wiki someplace?

Maybe add to section "Known Issues" in
"https://github.com/96boards/documentation/wiki/UEFI;
is a good choice, I will update this section later.


You updated the wiki, but you didn't specify what the value should be
for this to work with the old bootloader.

Can you please give the value of that property?

hisi,boardid = <0 0 4 3>
It is needed by the old hisilicon bootloader.

I also updated the wiki page.


Also, have you tested this series with the old bootloader as well?


Below are my testing result w/t Bintian's patches and Hisilicon old
bootloader:
- Need add property "hisi,boardid" into dts;
- Need change cpu enable-method from "psci" to "spin-table";
- The bootloader has not initialized register *cntfrq_el0* so will
   introduce the failure during init arch timer.

For init cntfrq_el0, we need fix this issue in Hisilicon's old
bootloader, rather than directly add "clock-frequency" for arch
timer's node in DTS. i will try to commit one patch for fix this
issue for Hisilicon's old bootloader.

So i think upper issues mainly are introduced by Hisilicon's old
bootloader but not come from Bintian's patches. How about u think for
this?

Below is my local diff which is used to compatible w/t Hisilicon's
old bootloader; Just for your reference.

Thanks,
Leo Yan

---8<---

diff --git a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts 
b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
index e36a539..fd1f89e 100644
--- a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
+++ b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
@@ -14,6 +14,7 @@

  / {
model = "HiKey Development Board";
+   hisi,boardid = <0 0 4 3>;
compatible = "hisilicon,hi6220-hikey", "hisilicon,hi6220";

aliases {
diff --git a/arch/arm64/boot/dts/hisilicon/hi6220.dtsi 
b/arch/arm64/boot/dts/hisilicon/hi6220.dtsi
index 229937f..8ade3d9 100644
--- a/arch/arm64/boot/dts/hisilicon/hi6220.dtsi
+++ b/arch/arm64/boot/dts/hisilicon/hi6220.dtsi
@@ -13,11 +13,6 @@
#address-cells = <2>;
#size-cells = <2>;

-   psci {
-   compatible = "arm,psci-0.2";
-   method = "smc";
-   };
-
cpus {
#address-cells = <2>;
#size-cells = <0>;
@@ -57,56 +52,64 @@
compatible = "arm,cortex-a53", "arm,armv8";
device_type = "cpu";
reg = <0x0 0x0>;
-   enable-method = "psci";
+   enable-method = "spin-table";
+   cpu-release-addr = <0x0 0x740fff8>;
};

cpu1: cpu@1 {
compatible = "arm,cortex-a53", "arm,armv8";
device_type = "cpu";
reg = <0x0 0x1>;
-   enable-method = "psci";
+   enable-method = "spin-table";
+   cpu-release-addr = <0x0 0x740fff8>;
};

cpu2: cpu@2 {
compatible = "arm,cortex-a53", "arm,armv8";
device_type = "cpu";
reg = <0x0 0x2>;
-   enable-method = "psci";
+   enable-method = "spin-table";
+   cpu-release-addr = <0x0 0x740fff8>;
};

cpu3: cpu@3 {
compatible = "arm,cortex-a53", "arm,armv8";
device_type = "cpu";
reg = <0x0 0x3>;
-   enable-method = "psci";
+   enable-method = "spin-table";
+

Re: [PATCH 3/5] workqueue: ensure attrs-changing be sequentially

2015-05-11 Thread Lai Jiangshan

On 05/11/2015 10:55 PM, Tejun Heo wrote:
> Hey,
> 
> Prolly a better subject is "ensure attrs changes are properly
> synchronized"
> 
> On Mon, May 11, 2015 at 05:35:50PM +0800, Lai Jiangshan wrote:
>> Current modification to attrs via sysfs is not atomically.
> 
>atomic.
> 
>>
>> Process A (change cpumask)   | Process B (change numa affinity)
>> wq_cpumask_store()   |
>>   wq_sysfs_prep_attrs()  |
>   ^
>   misaligned

It is aligned in email, misaligned in quoted email, and misaligned
in `git log` and `git show`, aligned in `git commit` when I wrote
the changelog. 

I will just remove all the |.

> 
>>  | apply_workqueue_attrs()
>>   apply_workqueue_attrs()|
>>
>> It results that the Process B's operation is totally reverted
>> without any notification.
> 
> Yeah, right.
> 
>> This behavior is acceptable but it is sometimes unexpected.
> 
> I don't think this is an acceptable behavior.
> 
>> Sequential model on non-performance-sensitive operations is more popular
>> and preferred. So this patch moves wq_sysfs_prep_attrs() into the protection
> 
> You can just say the previous behavior is buggy.

It depends on definitions. To me, it is just a nuisance.

> 
>> under wq_pool_mutex to ensure attrs-changing be sequentially.
>>
>> This patch is also a preparation patch for next patch which change
>> the API of apply_workqueue_attrs().
> ...
>> +static void apply_wqattrs_lock(void)
>> +{
>> +/*
>> + * CPUs should stay stable across pwq creations and installations.
>> + * Pin CPUs, determine the target cpumask for each node and create
>> + * pwqs accordingly.
>> + */
>> +get_online_cpus();
>> +mutex_lock(_pool_mutex);
>> +}
>> +
>> +static void apply_wqattrs_unlock(void)
>> +{
>> +mutex_unlock(_pool_mutex);
>> +put_online_cpus();
>> +}
> 
> Separate out refactoring and extending locking coverage?
> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PULL] LSM: Basic module stacking infrastructure for security-next - Acked

2015-05-11 Thread James Morris

On Fri, 8 May 2015, Casey Schaufler wrote:

> James, here's an updated pull request for LSM stacking.
> Acks have been applied.
> 
> The following changes since commit b787f68c36d49bb1d9236f403813641efa74a031:
> 
>   Linux 4.1-rc1 (2015-04-26 17:59:10 -0700)
> 
> are available in the git repository at:
> 
>   g...@github.com:cschaufler/smack-next.git stacking-v22-acked

fyi, this is not a public URN.

> 
> for you to fetch changes up to f17cd945a8761544ac9bfdaf55e952e558dbee3e:

Applied to
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security.git next


-- 
James Morris


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

linux-next: build failure after merge of the target-updates tree

2015-05-11 Thread Stephen Rothwell

Hi Nicholas,

After merging the target-updates tree, today's linux-next build (x86_64
allmodconfig) failed like this:

drivers/target/sbp/sbp_target.c: In function 'sbp_get_lun_from_tpg':
drivers/target/sbp/sbp_target.c:186:17: error: 'struct se_portal_group' has no 
member named 'tpg_lun_list'
  se_lun = se_tpg->tpg_lun_list[lun];
 ^
drivers/target/sbp/sbp_target.c: In function 'sbp_count_se_tpg_luns':
drivers/target/sbp/sbp_target.c:1833:30: error: 'struct se_portal_group' has no 
member named 'tpg_lun_list'
   struct se_lun *se_lun = tpg->tpg_lun_list[i];
  ^
drivers/target/sbp/sbp_target.c: In function 'sbp_update_unit_directory':
drivers/target/sbp/sbp_target.c:1911:45: error: 'struct se_portal_group' has no 
member named 'tpg_lun_list'
   struct se_lun *se_lun = tport->tpg->se_tpg.tpg_lun_list[i];
 ^

Caused by commit 731bbd790f79 ("target: Convert se_tpg->tpg_lun_list to
->tpg_lun_hlist") which doesn't seem to be complete?

I have used the target-updates tree from next-20150511 for today.
-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpm3T7FpomnZ.pgp
Description: OpenPGP digital signature

Re: [PATCH v12 00/21] Re-introduce h8300 architecture

2015-05-11 Thread Yoshinori Sato

At Mon, 11 May 2015 10:50:27 -0700,
Guenter Roeck wrote:
> 
> On Mon, May 11, 2015 at 03:26:19PM +0900, Yoshinori Sato wrote:
> > Changes for v12
> > - IRQ chip convert to OF
> > - dts cleanup
> > - some headers use generic
> > - rebase to v4.1-rc3
> > 
> Configurations in arch/h8300/configs shtill build ok.
> 
> make allmodoconfig, after fixing the spi build error, results in
> 
> ERROR: "csum_partial_copy_nocheck" [net/ipv6/ipv6.ko] undefined!
> ERROR: "ip_compute_csum" [net/ipv6/ip6_gre.ko] undefined!
> ERROR: "ip_fast_csum" [net/ipv4/xfrm4_mode_beet.ko] undefined!
> ERROR: "ip_compute_csum" [net/bridge/bridge.ko] undefined!
> ERROR: "ip_fast_csum" [net/bridge/bridge.ko] undefined!
> ERROR: "ip_fast_csum" [net/bridge/br_netfilter.ko] undefined!
> ERROR: "ip_fast_csum" [net/atm/mpoa.ko] undefined!
> ERROR: "__ucmpdi2" [fs/btrfs/btrfs.ko] undefined!
> ERROR: "ip_compute_csum" [drivers/scsi/scsi_debug.ko] undefined!
> ERROR: "ip_fast_csum" [drivers/net/slip/slhc.ko] undefined!
> ERROR: "__ucmpdi2" [drivers/md/bcache/bcache.ko] undefined!
> ERROR: "__ucmpdi2" [drivers/iio/imu/inv_mpu6050/inv-mpu6050.ko] undefined!
> 
> csum_partial_copy_nocheck, ip_compute_csum, and ip_fast_csum need to be 
> exported
> from arch/h8300/lib/checksum.c. No idea what to do about the missing __ucmpdi2
> symbol, or what causes it.
> 
> Guenter

It looks missing EXPORT_SYMBOL.
Fixed on git.
Please retry.

-- 
Yoshinori Sato

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread Daniel Phillips

Hi David,

On 05/11/2015 05:12 PM, David Lang wrote:
> On Mon, 11 May 2015, Daniel Phillips wrote:
> 
>> On 05/11/2015 03:12 PM, Pavel Machek wrote:
> It is a fact of life that when you change one aspect of an intimately 
> interconnected system,
> something else will change as well. You have naive/nonexistent free space 
> management now; when you
> design something workable there it is going to impact everything else 
> you've already done. It's an
> easy bet that the impact will be negative, the only question is to what 
> degree.

 You might lose that bet. For example, suppose we do strictly linear 
 allocation
 each delta, and just leave nice big gaps between the deltas for future
 expansion. Clearly, we run at similar or identical speed to the current 
 naive
 strategy until we must start filling in the gaps, and at that point our 
 layout
 is not any worse than XFS, which started bad and stayed that way.
>>>
>>> Umm, are you sure. If "some areas of disk are faster than others" is
>>> still true on todays harddrives, the gaps will decrease the
>>> performance (as you'll "use up" the fast areas more quickly).
>>
>> That's why I hedged my claim with "similar or identical". The
>> difference in media speed seems to be a relatively small effect
>> compared to extra seeks. It seems that XFS puts big spaces between
>> new directories, and suffers a lot of extra seeks because of it.
>> I propose to batch new directories together initially, then change
>> the allocation goal to a new, relatively empty area if a big batch
>> of files lands on a directory in a crowded region. The "big" gaps
>> would be on the order of delta size, so not really very big.
> 
> This is an interesting idea, but what happens if the files don't arrive as a 
> big batch, but rather
> trickle in over time (think a logserver that if putting files into a bunch of 
> directories at a
> fairly modest rate per directory)

If files are trickling in then we can afford to spend a lot more time
finding nice places to tuck them in. Log server files are an especially
irksome problem for a redirect-on-write filesystem because the final
block tends to be rewritten many times and we must move it to a new
location each time, so every extent ends up as one block. Oh well. If
we just make sure to have some free space at the end of the file that
only that file can use (until everywhere else is full) then the long
term result will be slightly ravelled blocks that nonetheless tend to
be on the same track or flash block as their logically contiguous
neighbours. There will be just zero or one empty data blocks mixed
into the file tail as we commit the tail block over and over with the
same allocation goal. Sometimes there will be a block or two of
metadata as well, which will eventually bake themselves into the
middle of contiguous data and stop moving around.

Putting this together, we have:

  * At delta flush, break out all the log type files
  * Dedicate some block groups to append type files
  * Leave lots of space between files in those block groups
  * Peek at the last block of the file to set the allocation goal

Something like that. What we don't want is to throw those files into
the middle of a lot of rewrite-all files, messing up both kinds of file.
We don't care much about keeping these files near the parent directory
because one big seek per log file in a grep is acceptable, we just need
to avoid thousands of big seeks within the file, and not dribble single
blocks all over the disk.

It would also be nice to merge together extents somehow as the final
block is rewritten. One idea is to retain the final block dirty until
the next delta, and write it again into a contiguous position, so the
final block is always flushed twice. We already have the opportunistic
merge logic, but the redirty behavior and making sure it only happens
to log files would be a bit fiddly.

We will also play the incremental defragmentation card at some point,
but first we should try hard to control fragmentation in the first
place. Tux3 is well suited to online defragmentation because the delta
commit model makes it easy to move things around efficiently and safely,
but it does generate extra IO, so as a basic mechanism it is not ideal.
When we get to piling on features, that will be high on the list,
because it is relatively easy, and having that fallback gives a certain
sense of security.

> And when you then decide that you have to move the directory/file info, 
> doesn't that create a
> potentially large amount of unexpected IO that could end up interfering with 
> what the user is trying
> to do?

Right, we don't like that and don't plan to rely on it. What we hope
for is behavior that, when you slowly stir the pot, tends to improve the
layout just as often as it degrades it. It may indeed become harder to
find ideal places to put things as time goes by, but we also gain more
information to base

[PATCH v3 04/11] dma-mapping: allow archs to optionally specify a ->map_pfn() operation

2015-05-11 Thread Dan Williams

This is in support of enabling block device drivers to perform DMA
to/from persistent memory which may not have a backing struct page
entry.

Signed-off-by: Dan Williams 
---
 arch/Kconfig |3 +++
 include/asm-generic/dma-mapping-common.h |   30 ++
 include/linux/dma-debug.h|   23 +++
 include/linux/dma-mapping.h  |8 +++-
 lib/dma-debug.c  |   10 ++
 5 files changed, 65 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb24997..f7f800860c00 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,6 +203,9 @@ config HAVE_DMA_ATTRS
 config HAVE_DMA_CONTIGUOUS
bool
 
+config HAVE_DMA_PFN
+   bool
+
 config GENERIC_SMP_IDLE_THREAD
bool
 
diff --git a/include/asm-generic/dma-mapping-common.h 
b/include/asm-generic/dma-mapping-common.h
index 940d5ec122c9..7305efb1bac6 100644
--- a/include/asm-generic/dma-mapping-common.h
+++ b/include/asm-generic/dma-mapping-common.h
@@ -17,9 +17,15 @@ static inline dma_addr_t dma_map_single_attrs(struct device 
*dev, void *ptr,
 
kmemcheck_mark_initialized(ptr, size);
BUG_ON(!valid_dma_direction(dir));
+#ifdef CONFIG_HAVE_DMA_PFN
+   addr = ops->map_pfn(dev, page_to_pfn_typed(virt_to_page(ptr)),
+(unsigned long)ptr & ~PAGE_MASK, size,
+dir, attrs);
+#else
addr = ops->map_page(dev, virt_to_page(ptr),
 (unsigned long)ptr & ~PAGE_MASK, size,
 dir, attrs);
+#endif
debug_dma_map_page(dev, virt_to_page(ptr),
   (unsigned long)ptr & ~PAGE_MASK, size,
   dir, addr, true);
@@ -73,6 +79,29 @@ static inline void dma_unmap_sg_attrs(struct device *dev, 
struct scatterlist *sg
ops->unmap_sg(dev, sg, nents, dir, attrs);
 }
 
+#ifdef CONFIG_HAVE_DMA_PFN
+static inline dma_addr_t dma_map_pfn(struct device *dev, __pfn_t pfn,
+ size_t offset, size_t size,
+ enum dma_data_direction dir)
+{
+   struct dma_map_ops *ops = get_dma_ops(dev);
+   dma_addr_t addr;
+
+   BUG_ON(!valid_dma_direction(dir));
+   addr = ops->map_pfn(dev, pfn, offset, size, dir, NULL);
+   debug_dma_map_pfn(dev, pfn, offset, size, dir, addr, false);
+
+   return addr;
+}
+
+static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
+ size_t offset, size_t size,
+ enum dma_data_direction dir)
+{
+   kmemcheck_mark_initialized(page_address(page) + offset, size);
+   return dma_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir);
+}
+#else
 static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
  size_t offset, size_t size,
  enum dma_data_direction dir)
@@ -87,6 +116,7 @@ static inline dma_addr_t dma_map_page(struct device *dev, 
struct page *page,
 
return addr;
 }
+#endif /* CONFIG_HAVE_DMA_PFN */
 
 static inline void dma_unmap_page(struct device *dev, dma_addr_t addr,
  size_t size, enum dma_data_direction dir)
diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h
index fe8cb610deac..a3b4c8c0cd68 100644
--- a/include/linux/dma-debug.h
+++ b/include/linux/dma-debug.h
@@ -34,10 +34,18 @@ extern void dma_debug_init(u32 num_entries);
 
 extern int dma_debug_resize_entries(u32 num_entries);
 
-extern void debug_dma_map_page(struct device *dev, struct page *page,
-  size_t offset, size_t size,
-  int direction, dma_addr_t dma_addr,
-  bool map_single);
+extern void debug_dma_map_pfn(struct device *dev, __pfn_t pfn, size_t offset,
+ size_t size, int direction, dma_addr_t dma_addr,
+ bool map_single);
+
+static inline void debug_dma_map_page(struct device *dev, struct page *page,
+ size_t offset, size_t size,
+ int direction, dma_addr_t dma_addr,
+ bool map_single)
+{
+   return debug_dma_map_pfn(dev, page_to_pfn_t(page), offset, size,
+   direction, dma_addr, map_single);
+}
 
 extern void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
 
@@ -109,6 +117,13 @@ static inline void debug_dma_map_page(struct device *dev, 
struct page *page,
 {
 }
 
+static inline void debug_dma_map_pfn(struct device *dev, __pfn_t pfn,
+size_t offset, size_t size,
+int direction, dma_addr_t dma_addr,
+bool map_single)
+{
+}

[PATCH v3 03/11] block: convert .bv_page to .bv_pfn bio_vec

2015-05-11 Thread Dan Williams

Carry an __pfn_t in a bio_vec rather than a 'struct page *' in support
of allowing a bio to reference unmapped (not struct page backed)
persistent memory.

This also fixes up the macros and static initializers that we were not
automatically converted by the Coccinelle script that introduced the
bvec_page() and bvec_set_page() helpers.

If CONFIG_DEV_PFN=n this is functionally equivalent to the status quo as
the __pfn_t helpers can assume that a __pfn_t always has a corresponding
struct page.

Cc: Jens Axboe 
Cc: Matthew Wilcox 
Cc: Dave Hansen 
Cc: Julia Lawall 
Signed-off-by: Dan Williams 
---
 Documentation/block/biodoc.txt |4 ++--
 block/blk-integrity.c  |4 ++--
 block/blk-merge.c  |6 +++---
 block/bounce.c |2 +-
 drivers/md/bcache/btree.c  |2 +-
 include/linux/bio.h|   23 ---
 include/linux/blk_types.h  |   24 +---
 lib/iov_iter.c |   22 +++---
 mm/page_io.c   |4 ++--
 9 files changed, 55 insertions(+), 36 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index fd12c0d835fd..3a10fd91e890 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -412,7 +412,7 @@ mapped to bio structures.
 2.2 The bio struct
 
 The bio structure uses a vector representation pointing to an array of tuples
-of  to describe the i/o buffer, and has various other
+of  to describe the i/o buffer, and has various other
 fields describing i/o parameters and state that needs to be maintained for
 performing the i/o.
 
@@ -420,7 +420,7 @@ Notice that this representation means that a bio has no 
virtual address
 mapping at all (unlike buffer heads).
 
 struct bio_vec {
-   struct page *bv_page;
+   __pfn_t bv_pfn;
unsigned short  bv_len;
unsigned short  bv_offset;
 };
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 0458f31f075a..351198fbda3c 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -43,7 +43,7 @@ static const char *bi_unsupported_name = "unsupported";
  */
 int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio)
 {
-   struct bio_vec iv, ivprv = { NULL };
+   struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
unsigned int segments = 0;
unsigned int seg_size = 0;
struct bvec_iter iter;
@@ -89,7 +89,7 @@ EXPORT_SYMBOL(blk_rq_count_integrity_sg);
 int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio,
struct scatterlist *sglist)
 {
-   struct bio_vec iv, ivprv = { NULL };
+   struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
struct scatterlist *sg = NULL;
unsigned int segments = 0;
struct bvec_iter iter;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 47ceefacd320..218ad1e57a49 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -13,7 +13,7 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 struct bio *bio,
 bool no_sg_merge)
 {
-   struct bio_vec bv, bvprv = { NULL };
+   struct bio_vec bv, bvprv = BIO_VEC_INIT(bvprv);
int cluster, high, highprv = 1;
unsigned int seg_size, nr_phys_segs;
struct bio *fbio, *bbio;
@@ -123,7 +123,7 @@ EXPORT_SYMBOL(blk_recount_segments);
 static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
   struct bio *nxt)
 {
-   struct bio_vec end_bv = { NULL }, nxt_bv;
+   struct bio_vec end_bv = BIO_VEC_INIT(end_bv), nxt_bv;
struct bvec_iter iter;
 
if (!blk_queue_cluster(q))
@@ -202,7 +202,7 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
 struct scatterlist *sglist,
 struct scatterlist **sg)
 {
-   struct bio_vec bvec, bvprv = { NULL };
+   struct bio_vec bvec, bvprv = BIO_VEC_INIT(bvprv);
struct bvec_iter iter;
int nsegs, cluster;
 
diff --git a/block/bounce.c b/block/bounce.c
index 0390e44d6e1b..4a3098067c81 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -64,7 +64,7 @@ static void bounce_copy_vec(struct bio_vec *to, unsigned char 
*vfrom)
 #else /* CONFIG_HIGHMEM */
 
 #define bounce_copy_vec(to, vfrom) \
-   memcpy(page_address((to)->bv_page) + (to)->bv_offset, vfrom, 
(to)->bv_len)
+   memcpy(page_address(bvec_page(to)) + (to)->bv_offset, vfrom, 
(to)->bv_len)
 
 #endif /* CONFIG_HIGHMEM */
 
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2e76e8b62902..36bbe29a806b 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -426,7 +426,7 @@ static void do_btree_node_write(struct btree *b)
void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));

[PATCH v3 02/11] block: add helpers for accessing a bio_vec page

2015-05-11 Thread Dan Williams

In preparation for converting struct bio_vec to carry a __pfn_t instead
of struct page.

This change is prompted by the desire to add in-kernel DMA support
(O_DIRECT, hierarchical storage, RDMA, etc) for persistent memory which
lacks struct page coverage.

Alternatives:

1/ Provide struct page coverage for persistent memory in DRAM.  The
   expectation is that persistent memory capacities make this untenable
   in the long term.

2/ Provide struct page coverage for persistent memory with persistent
   memory.  While persistent memory may have near DRAM performance
   characteristics it may not have the same write-endurance of DRAM.
   Given the update frequency of struct page objects it may not be
   suitable for persistent memory.

3/ Dynamically allocate struct page.  This appears to be on the order
   of the complexity of converting code paths to use __pfn_t references
   instead of struct page, and the amount of setup required to establish
   a valid struct page reference is mostly wasted when the only usage in
   the block stack is to perform a page_to_pfn() conversion for
   dma-mapping.  Instances of kmap() / kmap_atomic() usage appear to be
   the only occasions in the block stack where struct page is
   non-trivially used.  A new kmap_atomic_pfn_t() is proposed to handle
   those cases.

Generated with the following semantic patch:

// bv_page.cocci: convert usage of ->bv_page to use set/get helpers
// usage: make coccicheck COCCI=bv_page.cocci MODE=patch

virtual patch
virtual report
virtual org

@@
struct bio_vec bvec;
expression E;
type T;
@@

- bvec.bv_page = (T)E
+ bvec_set_page(, E)

@@
struct bio_vec *bvec;
expression E;
type T;
@@

- bvec->bv_page = (T)E
+ bvec_set_page(bvec, E)

@@
struct bio_vec bvec;
type T;
@@

- (T)bvec.bv_page
+ bvec_page()

@@
struct bio_vec *bvec;
type T;
@@

- (T)bvec->bv_page
+ bvec_page(bvec)

@@
struct bio *bio;
expression E;
expression F;
type T;
@@

- bio->bi_io_vec[F].bv_page = (T)E
+ bvec_set_page(>bi_io_vec[F], E)

@@
struct bio *bio;
expression E;
type T;
@@

- bio->bi_io_vec->bv_page = (T)E
+ bvec_set_page(bio->bi_io_vec, E)

@@
struct cached_dev *dc;
expression E;
type T;
@@

- dc->sb_bio.bi_io_vec->bv_page = (T)E
+ bvec_set_page(dc->sb_bio.bi_io_vec, E)

@@
struct cache *ca;
expression E;
expression F;
type T;
@@

- ca->sb_bio.bi_io_vec[F].bv_page = (T)E
+ bvec_set_page(>sb_bio.bi_io_vec[F], E)

@@
struct cache *ca;
expression F;
@@

- ca->sb_bio.bi_io_vec[F].bv_page
+ bvec_page(>sb_bio.bi_io_vec[F])

@@
struct cache *ca;
expression E;
expression F;
type T;
@@

- ca->sb_bio.bi_inline_vecs[F].bv_page = (T)E
+ bvec_set_page(>sb_bio.bi_inline_vecs[F], E)

@@
struct cache *ca;
expression F;
@@

- ca->sb_bio.bi_inline_vecs[F].bv_page
+ bvec_page(>sb_bio.bi_inline_vecs[F])


@@
struct cache *ca;
expression E;
type T;
@@

- ca->sb_bio.bi_io_vec->bv_page = (T)E
+ bvec_set_page(ca->sb_bio.bi_io_vec, E)

@@
struct bio *bio;
expression F;
@@

- bio->bi_io_vec[F].bv_page
+ bvec_page(>bi_io_vec[F])

@@
struct bio bio;
expression F;
@@

- bio.bi_io_vec[F].bv_page
+ bvec_page(_io_vec[F])

@@
struct bio *bio;
@@

- bio->bi_io_vec->bv_page
+ bvec_page(bio->bi_io_vec)

@@
struct cached_dev *dc;
@@

- dc->sb_bio.bi_io_vec->bv_page
+ bvec_page(>sb_bio->bi_io_vec)


@@
struct bio bio;
@@

- bio.bi_io_vec->bv_page
+ bvec_page(bio.bi_io_vec)

@@
struct bio_integrity_payload *bip;
expression E;
type T;
@@

- bip->bip_vec->bv_page = (T)E
+ bvec_set_page(bip->bip_vec, E)

@@
struct bio_integrity_payload *bip;
@@

- bip->bip_vec->bv_page
+ bvec_page(bip->bip_vec)

@@
struct bio_integrity_payload bip;
@@

- bip.bip_vec->bv_page
+ bvec_page(bip.bip_vec)

Cc: Jens Axboe 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Cc: Neil Brown 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Chris Mason 
Cc: Boaz Harrosh 
Cc: Theodore Ts'o 
Cc: Jan Kara 
Cc: Julia Lawall 
Cc: Martin K. Petersen 
Signed-off-by: Dan Williams 
---
 arch/powerpc/sysdev/axonram.c   |2 +
 block/bio-integrity.c   |8 ++--
 block/bio.c |   40 +++---
 block/blk-core.c|4 +-
 block/blk-integrity.c   |3 +-
 block/blk-lib.c |2 +
 block/blk-merge.c   |7 ++--
 block/bounce.c  |   24 ++---
 drivers/block/aoe/aoecmd.c  |8 ++--
 drivers/block/brd.c |2 +
 drivers/block/drbd/drbd_bitmap.c|5 ++-
 drivers/block/drbd/drbd_main.c  |6 ++-
 drivers/block/drbd/drbd_receiver.c  |4 +-
 drivers/block/drbd/drbd_worker.c|3 +-
 drivers/block/floppy.c  |6 ++-
 drivers/block/loop.c|   13 ---
 drivers/block/nbd.c |8 ++--
 drivers/block/nvme-core.c   |2 +
 drivers/block/pktcdvd.c |   11 +++---

Re: [PATCH 0/6] support "dataplane" mode for nohz_full

2015-05-11 Thread Mike Galbraith

On Tue, 2015-05-12 at 03:47 +0200, Mike Galbraith wrote:
> On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote:
> > On 05/11/2015 03:19 PM, Mike Galbraith wrote:
> > > I really shouldn't have acked nohz_full -> isolcpus.  Beside the fact
> > > that old static isolcpus was_supposed_  to crawl off and die, I know
> > > beyond doubt that having isolated a cpu as well as you can definitely
> > > does NOT imply that said cpu should become tickless.
> > 
> > True, at a high level, I agree that it would be better to have a
> > top-level concept like Frederic's proposed ISOLATION that includes
> > isolcpus and nohz_cpu (and other stuff as needed).
> > 
> > That said, what you wrote above is wrong; even with the patch you
> > acked, setting isolcpus does not automatically turn on nohz_full for
> > a given cpu.  The patch made it true the other way around: when
> > you say nohz_full, you automatically get isolcpus on that cpu too.
> > That does, at least, make sense for the semantics of nohz_full.
> 
> I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus.
> Yes, with nohz_full currently being static, the old allegedly dying but
> also static isolcpus scheduler off switch is a convenient thing to wire
> the nohz_full CPU SET (<- hint;) property to.

BTW, another facet of this: Rik wants to make isolcpus immune to
cpusets, which makes some sense, user did say isolcpus=, but that also
makes isolcpus truly static.  If the user now says nohz_full=, they lose
the ability to deactivate CPU isolation, making the set fairly useless
for anything other than HPC.  Currently, the user can flip the isolation
switch as he sees fit.  He takes a size extra large performance hit for
having said nohz_full=, but he doesn't lose generic utility.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 10/11] dax: convert to __pfn_t

2015-05-11 Thread Dan Williams

The primary source for non-page-backed page-frames to enter the system
is via the pmem driver's ->direct_access() method.  The pfns returned by
the top-level bdev_direct_access() may be passed to any other subsystem
in the kernel and those sub-systems either need to assume that the pfn
is page backed (CONFIG_DEV_PFN=n) or be prepared to handle non-page
backed case (CONFIG_DEV_PFN=y).  Currently the pfns returned by
->direct_access() are only ever used by vm_insert_mixed() which does not
care if the pfn is mapped.  As we go to add more usages of these pfns
add the type-safety of __pfn_t.

This also simplifies the calling convention of ->direct_access() by not
returning the virtual address in the same call.  This annotates cases
where the kernel is directly accessing pmem outside the driver, and
makes the valid lifetime of the reference explicit.  This property may
be useful in the future for invalidating mappings to pmem, but for now
it provides some protection against the "pmem disable vs still-in-use"
race.

Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Jens Axboe 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Boaz Harrosh 
Signed-off-by: Dan Williams 
---
 arch/powerpc/sysdev/axonram.c |   11 +--
 drivers/block/brd.c   |5 +--
 drivers/block/pmem.c  |   11 +--
 drivers/s390/block/dcssblk.c  |   13 ++---
 fs/block_dev.c|4 +--
 fs/dax.c  |   62 -
 include/asm-generic/pfn.h |   11 +++
 include/linux/blkdev.h|7 ++---
 8 files changed, 91 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 9bb5da7f2c0c..91c40a300797 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,22 +139,27 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
+#ifdef CONFIG_DEV_PFN
 static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-  void **kaddr, unsigned long *pfn, long size)
+   __pfn_t *pfn, long size)
 {
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
+   void *kaddr;
 
-   *kaddr = (void *)(bank->ph_addr + offset);
-   *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
+   kaddr = (void *)(bank->ph_addr + offset);
+   *pfn = phys_to_pfn_t(virt_to_phys(*kaddr));
 
return bank->size - offset;
 }
+#endif
 
 static const struct block_device_operations axon_ram_devops = {
.owner  = THIS_MODULE,
+#ifdef CONFIG_DEV_PFN
.direct_access  = axon_ram_direct_access
+#endif
 };
 
 /**
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 115c6cf9cb43..3be31a2aed20 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -371,7 +371,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t 
sector,
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, unsigned long *pfn, long size)
+   __pfn_t *pfn, long size)
 {
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
@@ -381,8 +381,7 @@ static long brd_direct_access(struct block_device *bdev, 
sector_t sector,
page = brd_insert_page(brd, sector);
if (!page)
return -ENOSPC;
-   *kaddr = page_address(page);
-   *pfn = page_to_pfn(page);
+   *pfn = page_to_pfn_t(page);
 
/*
 * TODO: If size > PAGE_SIZE, we could look to see if the next page in
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 2a847651f8de..0cf34fba308c 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -98,8 +98,9 @@ static int pmem_rw_page(struct block_device *bdev, sector_t 
sector,
return 0;
 }
 
-static long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn, long size)
+#ifdef CONFIG_DEV_PFN
+static long pmem_direct_access(struct block_device *bdev,
+   sector_t sector, __pfn_t *pfn, long size)
 {
struct pmem_device *pmem = bdev->bd_disk->private_data;
size_t offset = sector << 9;
@@ -107,16 +108,18 @@ static long pmem_direct_access(struct block_device *bdev, 
sector_t sector,
if (!pmem)
return -ENODEV;
 
-   *kaddr = pmem->virt_addr + offset;
-   *pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
+   *pfn = phys_to_pfn_t(pmem->phys_addr + offset);
 
return pmem->size - offset;
 }
+#endif
 
 static const struct block_device_operations pmem_fops = {
.owner =THIS_MODULE,
.rw_page =

[PATCH v3 09/11] block: convert kmap helpers to kmap_atomic_pfn_t()

2015-05-11 Thread Dan Williams

Convert the generic helpers to the __pfn_t version of kmap_atomic() in
support of generically enabling "page-less" block i/o.

Signed-off-by: Dan Williams 
---
 include/linux/bio.h |   11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index a569e6ea1cd2..6537d78e78b3 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -161,7 +161,7 @@ static inline void *bio_data(struct bio *bio)
  * I/O completely on that queue (see ide-dma for example)
  */
 #define __bio_kmap_atomic(bio, iter)   \
-   (kmap_atomic(bvec_page(bio_iter_iovec((bio), (iter +   \
+   (kmap_atomic_pfn_t(bio_iter_iovec((bio), (iter)).pfn) +   \
bio_iter_iovec((bio), (iter)).bv_offset)
 
 #define __bio_kunmap_atomic(addr)  kunmap_atomic(addr)
@@ -491,7 +491,7 @@ static inline char *bvec_kmap_irq(struct bio_vec *bvec, 
unsigned long *flags)
 * balancing is a lot nicer this way
 */
local_irq_save(*flags);
-   addr = (unsigned long) kmap_atomic(bvec_page(bvec));
+   addr = (unsigned long) kmap_atomic_pfn_t(bvec.pfn);
 
BUG_ON(addr & ~PAGE_MASK);
 
@@ -502,18 +502,21 @@ static inline void bvec_kunmap_irq(char *buffer, unsigned 
long *flags)
 {
unsigned long ptr = (unsigned long) buffer & PAGE_MASK;
 
-   kunmap_atomic((void *) ptr);
+   kunmap_atomic_pfn_t((void *) ptr);
local_irq_restore(*flags);
 }
 
 #else
 static inline char *bvec_kmap_irq(struct bio_vec *bvec, unsigned long *flags)
 {
-   return page_address(bvec_page(bvec)) + bvec->bv_offset;
+   return kmap_atomic_pfn_t(bvec->bv_pfn) + bvec->bv_offset;
 }
 
 static inline void bvec_kunmap_irq(char *buffer, unsigned long *flags)
 {
+   unsigned long ptr = (unsigned long) buffer & PAGE_MASK;
+
+   kunmap_atomic_pfn_t((void *) ptr);
*flags = 0;
 }
 #endif

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 07/11] x86: support dma_map_pfn()

2015-05-11 Thread Dan Williams

Fix up x86 dma_map_ops to allow pfn-only mappings.

As long as a dma_map_sg() implementation uses the generic sg_phys()
helpers it can support scatterlists that use __pfn_t instead of struct
page.

Signed-off-by: Dan Williams 
---
 arch/x86/Kconfig |5 +
 arch/x86/kernel/amd_gart_64.c|   22 +-
 arch/x86/kernel/pci-nommu.c  |   22 +-
 arch/x86/kernel/pci-swiotlb.c|4 
 arch/x86/pci/sta2x11-fixup.c |4 
 arch/x86/xen/pci-swiotlb-xen.c   |4 
 drivers/iommu/amd_iommu.c|   21 -
 drivers/iommu/intel-iommu.c  |   22 +-
 drivers/xen/swiotlb-xen.c|   29 +++--
 include/asm-generic/dma-mapping-common.h |4 ++--
 include/asm-generic/scatterlist.h|1 +
 include/linux/swiotlb.h  |4 
 lib/swiotlb.c|   20 +++-
 13 files changed, 125 insertions(+), 37 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d5696e1d1..c626ffa5c01e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -796,6 +796,7 @@ config CALGARY_IOMMU
bool "IBM Calgary IOMMU support"
select SWIOTLB
depends on X86_64 && PCI
+   depends on !HAVE_DMA_PFN
---help---
  Support for hardware IOMMUs in IBM's xSeries x366 and x460
  systems. Needed to run systems with more than 3GB of memory
@@ -1432,6 +1433,10 @@ config X86_PMEM_LEGACY
 
  Say Y if unsure.
 
+config X86_PMEM_DMA
+   def_bool DEV_PFN
+   select HAVE_DMA_PFN
+
 config HIGHPTE
bool "Allocate 3rd-level pagetables from highmem"
depends on HIGHMEM
diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 8e3842fc8bea..8fad83c8dfd2 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -239,13 +239,13 @@ static dma_addr_t dma_map_area(struct device *dev, 
dma_addr_t phys_mem,
 }
 
 /* Map a single area into the IOMMU */
-static dma_addr_t gart_map_page(struct device *dev, struct page *page,
-   unsigned long offset, size_t size,
-   enum dma_data_direction dir,
-   struct dma_attrs *attrs)
+static dma_addr_t gart_map_pfn(struct device *dev, __pfn_t pfn,
+  unsigned long offset, size_t size,
+  enum dma_data_direction dir,
+  struct dma_attrs *attrs)
 {
unsigned long bus;
-   phys_addr_t paddr = page_to_phys(page) + offset;
+   phys_addr_t paddr = __pfn_t_to_phys(pfn) + offset;
 
if (!dev)
dev = _dma_fallback_dev;
@@ -259,6 +259,14 @@ static dma_addr_t gart_map_page(struct device *dev, struct 
page *page,
return bus;
 }
 
+static __maybe_unused dma_addr_t gart_map_page(struct device *dev,
+   struct page *page, unsigned long offset, size_t size,
+   enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+   return gart_map_pfn(dev, page_to_pfn_t(page), offset, size, dir,
+   attrs);
+}
+
 /*
  * Free a DMA mapping.
  */
@@ -699,7 +707,11 @@ static __init int init_amd_gatt(struct agp_kern_info *info)
 static struct dma_map_ops gart_dma_ops = {
.map_sg = gart_map_sg,
.unmap_sg   = gart_unmap_sg,
+#ifdef CONFIG_HAVE_DMA_PFN
+   .map_pfn= gart_map_pfn,
+#else
.map_page   = gart_map_page,
+#endif
.unmap_page = gart_unmap_page,
.alloc  = gart_alloc_coherent,
.free   = gart_free_coherent,
diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c
index da15918d1c81..876dacfbabf6 100644
--- a/arch/x86/kernel/pci-nommu.c
+++ b/arch/x86/kernel/pci-nommu.c
@@ -25,12 +25,12 @@ check_addr(char *name, struct device *hwdev, dma_addr_t 
bus, size_t size)
return 1;
 }
 
-static dma_addr_t nommu_map_page(struct device *dev, struct page *page,
-unsigned long offset, size_t size,
-enum dma_data_direction dir,
-struct dma_attrs *attrs)
+static dma_addr_t nommu_map_pfn(struct device *dev, __pfn_t pfn,
+   unsigned long offset, size_t size,
+   enum dma_data_direction dir,
+   struct dma_attrs *attrs)
 {
-   dma_addr_t bus = page_to_phys(page) + offset;
+   dma_addr_t bus = __pfn_t_to_phys(pfn) + offset;
WARN_ON(size == 0);
if (!check_addr("map_single", dev, bus, size))
return DMA_ERROR_CODE;
@@ -38,6 +38,14 @@ static dma_addr_t

[PATCH v3 08/11] x86: support kmap_atomic_pfn_t() for persistent memory

2015-05-11 Thread Dan Williams

It would be unfortunate if the kmap infrastructure escaped its current
32-bit/HIGHMEM bonds and leaked into 64-bit code.  Instead, if the user
has enabled CONFIG_DEV_PFN we direct the kmap_atomic_pfn_t()
implementation to scan a list of pre-mapped persistent memory address
ranges inserted by the pmem driver.

The __pfn_t to resource lookup is indeed inefficient walking of a linked list,
but there are two mitigating factors:

1/ The number of persistent memory ranges is bounded by the number of
   DIMMs which is on the order of 10s of DIMMs, not hundreds.

2/ The lookup yields the entire range, if it becomes inefficient to do a
   kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take
   advantage of the fact that the lookup can be amortized for all kmap
   operations it needs to perform in a given range.

Signed-off-by: Dan Williams 
---
 arch/Kconfig|3 +
 arch/x86/Kconfig|2 +
 drivers/block/pmem.c|6 +++
 include/linux/highmem.h |   23 +++
 mm/Makefile |1 
 mm/pfn.c|   98 +++
 6 files changed, 133 insertions(+)
 create mode 100644 mm/pfn.c

diff --git a/arch/Kconfig b/arch/Kconfig
index f7f800860c00..69d3a3fa21af 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -206,6 +206,9 @@ config HAVE_DMA_CONTIGUOUS
 config HAVE_DMA_PFN
bool
 
+config HAVE_KMAP_PFN
+   bool
+
 config GENERIC_SMP_IDLE_THREAD
bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c626ffa5c01e..2fd7690ed0e2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1434,7 +1434,9 @@ config X86_PMEM_LEGACY
  Say Y if unsure.
 
 config X86_PMEM_DMA
+   depends on !HIGHMEM
def_bool DEV_PFN
+   select HAVE_KMAP_PFN
select HAVE_DMA_PFN
 
 config HIGHPTE
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 41bb424533e6..2a847651f8de 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define PMEM_MINORS16
 
@@ -147,6 +148,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, 
struct resource *res)
if (!pmem->virt_addr)
goto out_release_region;
 
+   err = devm_register_kmap_pfn_range(dev, res, pmem->virt_addr);
+   if (err)
+   goto out_unmap;
+
+   err = -ENOMEM;
pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
if (!pmem->pmem_queue)
goto out_unmap;
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 9286a46b7d69..85fd52d43a9a 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -83,6 +83,29 @@ static inline void __kunmap_atomic(void *addr)
 
 #endif /* CONFIG_HIGHMEM */
 
+#ifdef CONFIG_HAVE_KMAP_PFN
+extern void *kmap_atomic_pfn_t(__pfn_t pfn);
+extern void kunmap_atomic_pfn_t(void *addr);
+extern int devm_register_kmap_pfn_range(struct device *dev,
+   struct resource *res, void *base);
+#else
+static inline void *kmap_atomic_pfn_t(__pfn_t pfn)
+{
+   return kmap_atomic(__pfn_t_to_page(pfn));
+}
+
+static inline void kunmap_atomic_pfn_t(void *addr)
+{
+   __kunmap_atomic(addr);
+}
+
+static inline int devm_register_kmap_pfn_range(struct device *dev,
+   struct resource *res, void *base)
+{
+   return 0;
+}
+#endif /* CONFIG_HAVE_KMAP_PFN */
+
 #if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)
 
 DECLARE_PER_CPU(int, __kmap_atomic_idx);
diff --git a/mm/Makefile b/mm/Makefile
index 98c4eaeabdcb..66e30c2addfe 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_CMA) += cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
+obj-$(CONFIG_HAVE_KMAP_PFN) += pfn.o
diff --git a/mm/pfn.c b/mm/pfn.c
new file mode 100644
index ..0e046b49aebf
--- /dev/null
+++ b/mm/pfn.c
@@ -0,0 +1,98 @@
+/*
+ * Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static LIST_HEAD(ranges);
+
+struct kmap {
+   struct list_head list;
+   struct resource *res;
+   struct device *dev;
+   void *base;
+};
+
+static void teardown_kmap(void *data)
+{
+   struct kmap *kmap = data;
+
+   dev_dbg(kmap->dev, "kmap unregister %pr\n", kmap->res);
+   list_del_rcu(>list);
+   synchronize_rcu();
+   kfree(kmap);
+}
+
+int devm_register_kmap_pfn_range(struct device

[PATCH v3 11/11] block: base support for pfn i/o

2015-05-11 Thread Dan Williams

Allow block device drivers to opt-in to receiving bio(s) where the
bio_vec(s) point to memory that is not backed by struct page entries.
When a driver opts in it asserts that it will use the __pfn_t versions of the
dma_map/kmap/scatterlist apis in its bio submission path.

Cc: Tejun Heo 
Cc: Jens Axboe 
Signed-off-by: Dan Williams 
---
 block/bio.c   |   46 ++---
 block/blk-core.c  |9 +
 include/linux/blk_types.h |1 +
 include/linux/blkdev.h|2 ++
 4 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 7100fd6d5898..58553dfd777e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -567,6 +567,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_rw = bio_src->bi_rw;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
+   bio->bi_flags |= bio_src->bi_flags & (1 << BIO_PFN);
 }
 EXPORT_SYMBOL(__bio_clone_fast);
 
@@ -658,6 +659,8 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
gfp_mask,
goto integrity_clone;
}
 
+   bio->bi_flags |= bio_src->bi_flags & (1 << BIO_PFN);
+
bio_for_each_segment(bv, bio_src, iter)
bio->bi_io_vec[bio->bi_vcnt++] = bv;
 
@@ -699,9 +702,9 @@ int bio_get_nr_vecs(struct block_device *bdev)
 }
 EXPORT_SYMBOL(bio_get_nr_vecs);
 
-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
- *page, unsigned int len, unsigned int offset,
- unsigned int max_sectors)
+static int __bio_add_pfn(struct request_queue *q, struct bio *bio,
+   __pfn_t pfn, unsigned int len, unsigned int offset,
+   unsigned int max_sectors)
 {
int retried_segments = 0;
struct bio_vec *bvec;
@@ -723,7 +726,7 @@ static int __bio_add_page(struct request_queue *q, struct 
bio *bio, struct page
if (bio->bi_vcnt > 0) {
struct bio_vec *prev = >bi_io_vec[bio->bi_vcnt - 1];
 
-   if (page == bvec_page(prev) &&
+   if (__pfn_t_to_pfn(pfn) == __pfn_t_to_pfn(prev->bv_pfn) &&
offset == prev->bv_offset + prev->bv_len) {
unsigned int prev_bv_len = prev->bv_len;
prev->bv_len += len;
@@ -768,7 +771,7 @@ static int __bio_add_page(struct request_queue *q, struct 
bio *bio, struct page
 * cannot add the page
 */
bvec = >bi_io_vec[bio->bi_vcnt];
-   bvec_set_page(bvec, page);
+   bvec->bv_pfn = pfn;
bvec->bv_len = len;
bvec->bv_offset = offset;
bio->bi_vcnt++;
@@ -845,7 +848,7 @@ static int __bio_add_page(struct request_queue *q, struct 
bio *bio, struct page
 int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page 
*page,
unsigned int len, unsigned int offset)
 {
-   return __bio_add_page(q, bio, page, len, offset,
+   return __bio_add_pfn(q, bio, page_to_pfn_t(page), len, offset,
  queue_max_hw_sectors(q));
 }
 EXPORT_SYMBOL(bio_add_pc_page);
@@ -872,10 +875,39 @@ int bio_add_page(struct bio *bio, struct page *page, 
unsigned int len,
if ((max_sectors < (len >> 9)) && !bio->bi_iter.bi_size)
max_sectors = len >> 9;
 
-   return __bio_add_page(q, bio, page, len, offset, max_sectors);
+   return __bio_add_pfn(q, bio, page_to_pfn_t(page), len, offset,
+   max_sectors);
 }
 EXPORT_SYMBOL(bio_add_page);
 
+/**
+ * bio_add_pfn -   attempt to add pfn to bio
+ * @bio: destination bio
+ * @pfn: pfn to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Identical to bio_add_page() except this variant flags the bio as
+ * not have struct page backing.  A given request_queue must assert
+ * that it is prepared to handle this constraint before bio(s)
+ * flagged in the manner can be passed.
+ */
+int bio_add_pfn(struct bio *bio, __pfn_t pfn, unsigned int len,
+   unsigned int offset)
+{
+   struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+   unsigned int max_sectors;
+
+   if (!blk_queue_pfn(q))
+   return 0;
+   set_bit(BIO_PFN, >bi_flags);
+   max_sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector);
+   if ((max_sectors < (len >> 9)) && !bio->bi_iter.bi_size)
+   max_sectors = len >> 9;
+
+   return __bio_add_pfn(q, bio, pfn, len, offset, max_sectors);
+}
+
 struct submit_bio_ret {
struct completion event;
int error;
diff --git a/block/blk-core.c b/block/blk-core.c
index 94d2c6ccf801..1275e2c08c16 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1856,6 +1856,15 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   if (bio_flagged(bio, BIO_PFN)) {
+   if (IS_ENABLED(CONFIG_DEV_PFN) &&

[PATCH v3 06/11] scatterlist: support "page-less" (__pfn_t only) entries

2015-05-11 Thread Dan Williams

From: Matthew Wilcox 

Given that an offset will never be more than PAGE_SIZE, steal the unused
bits of the offset to implement a flags field.  Move the existing "this
is a sg_chain() entry" flag to the new flags field, and add a new flag
(SG_FLAGS_PAGE) to indicate that there is a struct page backing for the
entry.

Signed-off-by: Dan Williams 
Signed-off-by: Matthew Wilcox 
---
 block/blk-merge.c |2 -
 drivers/dma/ste_dma40.c   |5 --
 drivers/mmc/card/queue.c  |4 +-
 include/asm-generic/scatterlist.h |9 
 include/crypto/scatterwalk.h  |   10 
 include/linux/scatterlist.h   |   91 +
 6 files changed, 105 insertions(+), 16 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 218ad1e57a49..82a688551b72 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -267,7 +267,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request 
*rq,
if (rq->cmd_flags & REQ_WRITE)
memset(q->dma_drain_buffer, 0, q->dma_drain_size);
 
-   sg->page_link &= ~0x02;
+   sg_unmark_end(sg);
sg = sg_next(sg);
sg_set_page(sg, virt_to_page(q->dma_drain_buffer),
q->dma_drain_size,
diff --git a/drivers/dma/ste_dma40.c b/drivers/dma/ste_dma40.c
index 3c10f034d4b9..e8c00642cacb 100644
--- a/drivers/dma/ste_dma40.c
+++ b/drivers/dma/ste_dma40.c
@@ -2562,10 +2562,7 @@ dma40_prep_dma_cyclic(struct dma_chan *chan, dma_addr_t 
dma_addr,
dma_addr += period_len;
}
 
-   sg[periods].offset = 0;
-   sg_dma_len([periods]) = 0;
-   sg[periods].page_link =
-   ((unsigned long)sg | 0x01) & ~0x02;
+   sg_chain(sg, periods + 1, sg);
 
txd = d40_prep_sg(chan, sg, sg, periods, direction,
  DMA_PREP_INTERRUPT);
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 236d194c2883..127f76294e71 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -469,7 +469,7 @@ static unsigned int mmc_queue_packed_map_sg(struct 
mmc_queue *mq,
sg_set_buf(__sg, buf + offset, len);
offset += len;
remain -= len;
-   (__sg++)->page_link &= ~0x02;
+   sg_unmark_end(__sg++);
sg_len++;
} while (remain);
}
@@ -477,7 +477,7 @@ static unsigned int mmc_queue_packed_map_sg(struct 
mmc_queue *mq,
list_for_each_entry(req, >list, queuelist) {
sg_len += blk_rq_map_sg(mq->queue, req, __sg);
__sg = sg + (sg_len - 1);
-   (__sg++)->page_link &= ~0x02;
+   sg_unmark_end(__sg++);
}
sg_mark_end(sg + (sg_len - 1));
return sg_len;
diff --git a/include/asm-generic/scatterlist.h 
b/include/asm-generic/scatterlist.h
index 5de07355fad4..959f51572a8e 100644
--- a/include/asm-generic/scatterlist.h
+++ b/include/asm-generic/scatterlist.h
@@ -7,8 +7,17 @@ struct scatterlist {
 #ifdef CONFIG_DEBUG_SG
unsigned long   sg_magic;
 #endif
+#ifdef CONFIG_HAVE_DMA_PFN
+   union {
+   __pfn_t pfn;
+   struct scatterlist *next;
+   };
+   unsigned short  offset;
+   unsigned short  sg_flags;
+#else
unsigned long   page_link;
unsigned intoffset;
+#endif
unsigned intlength;
dma_addr_t  dma_address;
 #ifdef CONFIG_NEED_SG_DMA_LENGTH
diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h
index 20e4226a2e14..7296d89a50b2 100644
--- a/include/crypto/scatterwalk.h
+++ b/include/crypto/scatterwalk.h
@@ -25,6 +25,15 @@
 #include 
 #include 
 
+#ifdef CONFIG_HAVE_DMA_PFN
+/*
+ * If we're using PFNs, the architecture must also have been converted to
+ * support SG_CHAIN.  So we can use the generic code instead of custom
+ * code.
+ */
+#define scatterwalk_sg_chain(prv, num, sgl)sg_chain(prv, num, sgl)
+#define scatterwalk_sg_next(sgl)   sg_next(sgl)
+#else
 static inline void scatterwalk_sg_chain(struct scatterlist *sg1, int num,
struct scatterlist *sg2)
 {
@@ -32,6 +41,7 @@ static inline void scatterwalk_sg_chain(struct scatterlist 
*sg1, int num,
sg1[num - 1].page_link &= ~0x02;
sg1[num - 1].page_link |= 0x01;
 }
+#endif
 
 static inline void scatterwalk_crypto_chain(struct scatterlist *head,
struct scatterlist *sg,
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ed8f9e70df9b..9d423e559bdb 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -18,8 +19,14 @@ struct sg_table {
 /*
  * Notes on SG table design.
  *
- * Architectures must provide an unsigned long

[PATCH v3 05/11] scatterlist: use sg_phys()

2015-05-11 Thread Dan Williams

Coccinelle cleanup to replace open coded sg to physical address
translations.  This is in preparation for introducing scatterlists that
reference pfn(s) without a backing struct page.

// sg_phys.cocci: convert usage page_to_phys(sg_page(sg)) to sg_phys(sg)
// usage: make coccicheck COCCI=sg_phys.cocci MODE=patch

virtual patch
virtual report
virtual org

@@
struct scatterlist *sg;
@@

- page_to_phys(sg_page(sg)) + sg->offset
+ sg_phys(sg)

@@
struct scatterlist *sg;
@@

- page_to_phys(sg_page(sg))
+ sg_phys(sg) - sg->offset

Cc: Julia Lawall 
Signed-off-by: Dan Williams 
---
 arch/arm/mm/dma-mapping.c|2 +-
 arch/microblaze/kernel/dma.c |2 +-
 drivers/iommu/intel-iommu.c  |4 ++--
 drivers/iommu/iommu.c|2 +-
 drivers/staging/android/ion/ion_chunk_heap.c |4 ++--
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 09c5fe3d30c2..43cc6a8fdacc 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1502,7 +1502,7 @@ static int __map_sg_chunk(struct device *dev, struct 
scatterlist *sg,
return -ENOMEM;
 
for (count = 0, s = sg; count < (size >> PAGE_SHIFT); s = sg_next(s)) {
-   phys_addr_t phys = page_to_phys(sg_page(s));
+   phys_addr_t phys = sg_phys(s) - s->offset;
unsigned int len = PAGE_ALIGN(s->offset + s->length);
 
if (!is_coherent &&
diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index ed7ba8a11822..dcb3c594d626 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -61,7 +61,7 @@ static int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl,
/* FIXME this part of code is untested */
for_each_sg(sgl, sg, nents, i) {
sg->dma_address = sg_phys(sg);
-   __dma_sync(page_to_phys(sg_page(sg)) + sg->offset,
+   __dma_sync(sg_phys(sg),
sg->length, direction);
}
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 68d43beccb7e..9b9ada71e0d3 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1998,7 +1998,7 @@ static int __domain_mapping(struct dmar_domain *domain, 
unsigned long iov_pfn,
sg_res = aligned_nrpages(sg->offset, sg->length);
sg->dma_address = ((dma_addr_t)iov_pfn << 
VTD_PAGE_SHIFT) + sg->offset;
sg->dma_length = sg->length;
-   pteval = page_to_phys(sg_page(sg)) | prot;
+   pteval = (sg_phys(sg) - sg->offset) | prot;
phys_pfn = pteval >> VTD_PAGE_SHIFT;
}
 
@@ -3302,7 +3302,7 @@ static int intel_nontranslate_map_sg(struct device *hddev,
 
for_each_sg(sglist, sg, nelems, i) {
BUG_ON(!sg_page(sg));
-   sg->dma_address = page_to_phys(sg_page(sg)) + sg->offset;
+   sg->dma_address = sg_phys(sg);
sg->dma_length = sg->length;
}
return nelems;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d4f527e56679..59808fc9110d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1147,7 +1147,7 @@ size_t default_iommu_map_sg(struct iommu_domain *domain, 
unsigned long iova,
min_pagesz = 1 << __ffs(domain->ops->pgsize_bitmap);
 
for_each_sg(sg, s, nents, i) {
-   phys_addr_t phys = page_to_phys(sg_page(s)) + s->offset;
+   phys_addr_t phys = sg_phys(s);
 
/*
 * We are mapping on IOMMU page boundaries, so offset within
diff --git a/drivers/staging/android/ion/ion_chunk_heap.c 
b/drivers/staging/android/ion/ion_chunk_heap.c
index 3e6ec2ee6802..b7da5d142aa9 100644
--- a/drivers/staging/android/ion/ion_chunk_heap.c
+++ b/drivers/staging/android/ion/ion_chunk_heap.c
@@ -81,7 +81,7 @@ static int ion_chunk_heap_allocate(struct ion_heap *heap,
 err:
sg = table->sgl;
for (i -= 1; i >= 0; i--) {
-   gen_pool_free(chunk_heap->pool, page_to_phys(sg_page(sg)),
+   gen_pool_free(chunk_heap->pool, sg_phys(sg) - sg->offset,
  sg->length);
sg = sg_next(sg);
}
@@ -109,7 +109,7 @@ static void ion_chunk_heap_free(struct ion_buffer *buffer)
DMA_BIDIRECTIONAL);
 
for_each_sg(table->sgl, sg, table->nents, i) {
-   gen_pool_free(chunk_heap->pool, page_to_phys(sg_page(sg)),
+   gen_pool_free(chunk_heap->pool, sg_phys(sg) - sg->offset,
  sg->length);
}
chunk_heap->allocated -= allocated_size;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to

[PATCH v3 01/11] arch: introduce __pfn_t for persistenti/device memory

2015-05-11 Thread Dan Williams

Introduce a type that encapsulates a page-frame-number that is
optionally backed by memmap (struct page).  This type will be used in
place of 'struct page *' instances in contexts where device-backed
memory (usually persistent memory) is being referenced (scatterlists for
drivers, biovecs for the block layer, etc).  The operations in those i/o
paths that formerly required a 'struct page *' are to be converted to
use __pfn_t aware equivalent helpers.  Otherwise, in the absence of
persistent memory, there is no functional change and __pfn_t is an alias
for a normal memory page.

It turns out that while 'struct page' references are used broadly in the
kernel I/O stacks the usage of 'struct page' based capabilities is very
shallow for block-i/o.  It is only used for populating bio_vecs and
scatterlists for the retrieval of dma addresses, and for temporary
kernel mappings (kmap).  Aside from kmap, these usages can be trivially
converted to operate on a pfn.

Indeed, kmap_atomic() is more problematic as it uses mm infrastructure,
via struct page, to setup and track temporary kernel mappings.  It would
be unfortunate if the kmap infrastructure escaped its 32-bit/HIGHMEM
bonds and leaked into 64-bit code.  Thankfully, it seems all that is
needed here is to convert kmap_atomic() callers, that want to opt-in to
supporting persistent memory, to use a new kmap_atomic_pfn_t().  Where
kmap_atomic_pfn_t() is enabled to re-use the existing ioremap() mapping
established by the driver for persistent memory.

Note, that as far as conceptually understanding __pfn_t is concerned,
'persistent memory' is really any address range in host memory not
covered by memmap.  Contrast this with pure iomem that is on an mmio
mapped bus like PCI and cannot be converted to a dma_addr_t by "pfn <<
PAGE_SHIFT".

Cc: H. Peter Anvin 
Cc: Jens Axboe 
Cc: Tejun Heo 
Cc: Ingo Molnar 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Signed-off-by: Dan Williams 
---
 include/asm-generic/memory_model.h |1 
 include/asm-generic/pfn.h  |   84 
 include/linux/mm.h |1 
 init/Kconfig   |   13 ++
 4 files changed, 98 insertions(+), 1 deletion(-)
 create mode 100644 include/asm-generic/pfn.h

diff --git a/include/asm-generic/memory_model.h 
b/include/asm-generic/memory_model.h
index 14909b0b9cae..1b0ae21fd8ff 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -70,7 +70,6 @@
 #endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */
 
 #define page_to_pfn __page_to_pfn
-#define pfn_to_page __pfn_to_page
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
new file mode 100644
index ..ee1363e3c67c
--- /dev/null
+++ b/include/asm-generic/pfn.h
@@ -0,0 +1,84 @@
+#ifndef __ASM_PFN_H
+#define __ASM_PFN_H
+
+/*
+ * Default pfn to physical address conversion, like most arch
+ * page_to_phys() implementations this resolves to a dma_addr_t as it
+ * should be the size needed for a device to reference this address.
+ */
+#ifndef __pfn_to_phys
+#define __pfn_to_phys(pfn)  ((dma_addr_t)(pfn) << PAGE_SHIFT)
+#endif
+
+static inline struct page *pfn_to_page(unsigned long pfn)
+{
+   return __pfn_to_page(pfn);
+}
+
+/*
+ * __pfn_t: encapsulates a page-frame number that is optionally backed
+ * by memmap (struct page).  This type will be used in place of a
+ * 'struct page *' instance in contexts where unmapped memory (usually
+ * persistent memory) is being referenced (scatterlists for drivers,
+ * biovecs for the block layer, etc).  Whether a __pfn_t has a struct
+ * page backing is indicated by flags in the low bits of @data.
+ */
+typedef struct {
+   union {
+   unsigned long data;
+   struct page *page;
+   };
+} __pfn_t;
+
+enum {
+#if BITS_PER_LONG == 64
+   PFN_SHIFT = 3,
+#else
+   PFN_SHIFT = 2,
+#endif
+   PFN_MASK = (1 << PFN_SHIFT) - 1,
+   /* device-pfn not covered by memmap */
+   PFN_DEV = (1 << 0),
+};
+
+#ifdef CONFIG_DEV_PFN
+static inline bool __pfn_t_has_page(__pfn_t pfn)
+{
+   return (pfn.data & PFN_MASK) == 0;
+}
+
+#else
+static inline bool __pfn_t_has_page(__pfn_t pfn)
+{
+   return true;
+}
+#endif
+
+static inline struct page *__pfn_t_to_page(__pfn_t pfn)
+{
+   if (!__pfn_t_has_page(pfn))
+   return NULL;
+   return pfn.page;
+}
+
+static inline unsigned long __pfn_t_to_pfn(__pfn_t pfn)
+{
+   if (__pfn_t_has_page(pfn))
+   return page_to_pfn(pfn.page);
+   return pfn.data >> PFN_SHIFT;
+}
+
+static inline dma_addr_t __pfn_t_to_phys(__pfn_t pfn)
+{
+   if (!__pfn_t_has_page(pfn))
+   return __pfn_to_phys(__pfn_t_to_pfn(pfn));
+   return __pfn_to_phys(page_to_pfn(pfn.page));
+}
+
+static inline __pfn_t page_to_pfn_t(struct page *page)
+{
+   __pfn_t pfn = { .page = page };
+
+   return pfn;
+}
+#endif /* __ASM_PFN_H */
diff --git

[PATCH v3 00/11] evacuate struct page from the block layer, introduce __pfn_t

2015-05-11 Thread Dan Williams

Changes since v2 [1]:
[1]: https://lwn.net/Articles/643437/

1/ Linus pointed out that comparing a __pfn_t value against PAGE_OFFSET
was both inefficient, when PAGE_OFFSET is a large constant, and
incorrect for archs that set PAGE_OFFSET to zero.  Instead, take
advantage of the standard alignment of a 'struct page *' to store a set
of flags.  In this patch set the only flag defined is PFN_DEV to
indicate "this pfn originated from device memory".  A potential future
flag is PFN_DEV_MAPPED if the device has arranged for an associated
struct page for the __pfn_t.

2/ Fix DAX against pmem device disable/removal using
kmap_atomic_pfn_t().  We can later exploit these annotations to protect
against the "stray pointer problem" whereby a kernel bug in an unrelated
part of the system causes inadvertent scribbling over pmem.

3/ Made the series easier to merge as it no longer causes compile errors
by default for new usages of bv_page arriving in the next merge window.

4/ arch/x86/kernel/kmap.c => mm/pfn.c since it is generic functionality.

5/ Updated the kmap_atomic() helpers in bio.h to use kmap_atomic_pfn_t()

Incremental diffstat:

 arch/powerpc/sysdev/axonram.c  |  9 +--
 arch/x86/Kconfig   |  2 +-
 arch/x86/kernel/Makefile   |  1 -
 block/bio.c|  4 +--
 block/blk-core.c   |  2 +-
 drivers/block/brd.c|  3 +--
 drivers/block/pmem.c   |  9 ---
 drivers/s390/block/dcssblk.c   | 11 +---
 fs/block_dev.c |  4 +--
 fs/dax.c   | 57 

 include/asm-generic/pfn.h  | 73 
++--
 include/linux/bio.h| 14 +-
 include/linux/blk_types.h  |  2 +-
 include/linux/blkdev.h |  7 +++--
 init/Kconfig   | 12 -
 mm/Makefile|  1 +
 arch/x86/kernel/kmap.c => mm/pfn.c |  0
 17 files changed, 140 insertions(+), 71 deletions(-)
 rename arch/x86/kernel/kmap.c => mm/pfn.c (100%)

While we wait for the debate [2] to settle about what to do about i/o
paths that ostensibly require struct page, these patches enable a
stacked/tiered storage driver to manage pmem fronting slower storage
media.

[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000727.html

---

Dan Williams (10):
  arch: introduce __pfn_t for persistenti/device memory
  block: add helpers for accessing a bio_vec page
  block: convert .bv_page to .bv_pfn bio_vec
  dma-mapping: allow archs to optionally specify a ->map_pfn() operation
  scatterlist: use sg_phys()
  x86: support dma_map_pfn()
  x86: support kmap_atomic_pfn_t() for persistent memory
  block: convert kmap helpers to kmap_atomic_pfn_t()
  dax: convert to __pfn_t
  block: base support for pfn i/o

Matthew Wilcox (1):
  scatterlist: support "page-less" (__pfn_t only) entries


 Documentation/block/biodoc.txt   |4 +
 arch/Kconfig |6 ++
 arch/arm/mm/dma-mapping.c|2 -
 arch/microblaze/kernel/dma.c |2 -
 arch/powerpc/sysdev/axonram.c|   13 ++-
 arch/x86/Kconfig |7 ++
 arch/x86/kernel/amd_gart_64.c|   22 +-
 arch/x86/kernel/pci-nommu.c  |   22 +-
 arch/x86/kernel/pci-swiotlb.c|4 +
 arch/x86/pci/sta2x11-fixup.c |4 +
 arch/x86/xen/pci-swiotlb-xen.c   |4 +
 block/bio-integrity.c|8 +-
 block/bio.c  |   82 +++---
 block/blk-core.c |   13 +++
 block/blk-integrity.c|7 +-
 block/blk-lib.c  |2 -
 block/blk-merge.c|   15 ++--
 block/bounce.c   |   26 +++
 drivers/block/aoe/aoecmd.c   |8 +-
 drivers/block/brd.c  |7 +-
 drivers/block/drbd/drbd_bitmap.c |5 +
 drivers/block/drbd/drbd_main.c   |6 +-
 drivers/block/drbd/drbd_receiver.c   |4 +
 drivers/block/drbd/drbd_worker.c |3 +
 drivers/block/floppy.c   |6 +-
 drivers/block/loop.c |   13 ++-
 drivers/block/nbd.c  |8 +-
 drivers/block/nvme-core.c|2 -
 drivers/block/pktcdvd.c  |   11 ++-
 drivers/block/pmem.c |   19 -
 drivers/block/ps3disk.c  |2 -
 drivers/block/ps3vram.c  |2 -
 drivers/block/rbd.c  |2 -
 drivers/block/rsxx/dma.c |2 -
 drivers/block/umem.c |2 -

Re: [PATCH 1/2] clone: Support passing tls argument via C rather than pt_regs magic

2015-05-11 Thread Vineet Gupta

+CC Arnd, Al, linux-arch

On Monday 11 May 2015 08:17 PM, Josh Triplett wrote:
> On Mon, May 11, 2015 at 02:31:39PM +, Vineet Gupta wrote:
>> On Tuesday 21 April 2015 11:17 PM, Josh Triplett wrote:
>>> clone with CLONE_SETTLS accepts an argument to set the thread-local
>>> storage area for the new thread.  sys_clone declares an int argument
>>> tls_val in the appropriate point in the argument list (based on the
>>> various CLONE_BACKWARDS variants), but doesn't actually use or pass
>>> along that argument.  Instead, sys_clone calls do_fork, which calls
>>> copy_process, which calls the arch-specific copy_thread, and copy_thread
>>> pulls the corresponding syscall argument out of the pt_regs captured at
>>> kernel entry (knowing what argument of clone that architecture passes
>>> tls in).
>>>
>>> Apart from being awful and inscrutable, that also only works because
>>> only one code path into copy_thread can pass the CLONE_SETTLS flag, and
>>> that code path comes from sys_clone with its architecture-specific
>>> argument-passing order.  This prevents introducing a new version of the
>>> clone system call without propagating the same architecture-specific
>>> position of the tls argument.
>>>
>>> However, there's no reason to pull the argument out of pt_regs when
>>> sys_clone could just pass it down via C function call arguments.
>>>
>>> Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt
>>> into, and a new copy_thread_tls that accepts the tls parameter as an
>>> additional unsigned long (syscall-argument-sized) argument.
>>> Change sys_clone's tls argument to an unsigned long (which does
>>> not change the ABI), and pass that down to copy_thread_tls.
>>>
>>> Architectures that don't opt into copy_thread_tls will continue to
>>> ignore the C argument to sys_clone in favor of the pt_regs captured at
>>> kernel entry, and thus will be unable to introduce new versions of the
>>> clone syscall.
>>>
>>> Signed-off-by: Josh Triplett 
>>> Signed-off-by: Thiago Macieira 
>>> Acked-by: Andy Lutomirski 
>>> ---
>>>  arch/Kconfig |  7 ++
>>>  include/linux/sched.h| 14 
>>>  include/linux/syscalls.h |  6 +++---
>>>  kernel/fork.c| 55 
>>> +++-
>>>  4 files changed, 60 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/arch/Kconfig b/arch/Kconfig
>>> index 05d7a8a..4834a58 100644
>>> --- a/arch/Kconfig
>>> +++ b/arch/Kconfig
>>> @@ -484,6 +484,13 @@ config HAVE_IRQ_EXIT_ON_IRQ_STACK
>>>   This spares a stack switch and improves cache usage on softirq
>>>   processing.
>>>  
>>> +config HAVE_COPY_THREAD_TLS
>>> +   bool
>>> +   help
>>> + Architecture provides copy_thread_tls to accept tls argument via
>>> + normal C parameter passing, rather than extracting the syscall
>>> + argument from pt_regs.
>>> +
>>>  #
>>>  # ABI hall of shame
>>>  #
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index a419b65..2cc88c6 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -2480,8 +2480,22 @@ extern struct mm_struct *mm_access(struct 
>>> task_struct *task, unsigned int mode);
>>>  /* Remove the current tasks stale references to the old mm_struct */
>>>  extern void mm_release(struct task_struct *, struct mm_struct *);
>>>  
>>> +#ifdef CONFIG_HAVE_COPY_THREAD_TLS
>>> +extern int copy_thread_tls(unsigned long, unsigned long, unsigned long,
>>> +   struct task_struct *, unsigned long);
>>> +#else
>>>  extern int copy_thread(unsigned long, unsigned long, unsigned long,
>>> struct task_struct *);
>>> +
>>> +/* Architectures that haven't opted into copy_thread_tls get the tls 
>>> argument
>>> + * via pt_regs, so ignore the tls argument passed via C. */
>>> +static inline int copy_thread_tls(
>>> +   unsigned long clone_flags, unsigned long sp, unsigned long arg,
>>> +   struct task_struct *p, unsigned long tls)
>>> +{
>>> +   return copy_thread(clone_flags, sp, arg, p);
>>> +}
>>> +#endif
>>
>> Is this detour really needed. Can we not update copy_thread() of all arches 
>> in one
>> go and add the tls arg, w/o using it.
>>
>> And then arch maintainers can micro-optimize their code to use that arg vs.
>> pt_regs->rxx version at their own leisure. The only downside I see with that 
>> is
>> bigger churn (touches all arches), and a interim unused arg warning ?
> 
> In addition to the cleanup and simplification, the purpose of this patch
> is specifically to make sure that any architecture opting into
> HAVE_COPY_THREAD_TLS does *not* care how tls is passed in, and in
> particular doesn't depend on it arriving in a specific syscall argument.

Sorry for sounding dense, but as I see here, in the end even for non opting
arches, copy_thread_tls() calling convention expects tls arg passed to it from
sys_clone call stack, but simply drops it. So that arg is always available, has 
to
be, otherwise even the pt_regs

Re: [RFC][PATCH] x86/hpet: fix NULL pointer dereference in msi_domain_alloc_irqs()

2015-05-11 Thread Sergey Senozhatsky

> directly call __irq_domain_alloc_irqs() in hpet_assign_irq() and pass
> correct `arg' to fix the oops.
> 

oh, what I was thinking about... it should be as simple as this.

8<-8<-

>From 8be2eb548cefc788c87b05da22176b7360c6aca9 Mon Sep 17 00:00:00 2001
From: Sergey Senozhatsky 
Date: Mon, 11 May 2015 18:56:49 +0900
Subject: [PATCH] x86/hpet: fix NULL pointer deference in
 msi_domain_alloc_irqs()

Fix the following oops:
 hpet_msi_get_hwirq+0x1f/0x27
 msi_domain_alloc+0x35/0xfe
 ? trace_hardirqs_on_caller+0x16c/0x188
 irq_domain_alloc_irqs_recursive+0x51/0x95
 __irq_domain_alloc_irqs+0x151/0x223
 hpet_assign_irq+0x5d/0x68
 hpet_msi_capability_lookup+0x121/0x1cb
 ? hpet_enable+0x2b4/0x2b4
 hpet_late_init+0x5f/0xf2
 ? hpet_enable+0x2b4/0x2b4
 do_one_initcall+0x184/0x199
 kernel_init_freeable+0x1af/0x237
 ? rest_init+0x13a/0x13a
 kernel_init+0xe/0xd4
 ret_from_fork+0x3f/0x70
 ? rest_init+0x13a/0x13a

since 3cb96f0c9733 ('x86/hpet: Enhance HPET IRQ to support hierarchical
irqdomains') hpet_msi_capability_lookup() uses hpet_assign_irq(). the
latter discards `irq_alloc_info info' param and instead passes NULL to
__irq_domain_alloc_irqs() as `arg'. __irq_domain_alloc_irqs() invokes
irq_domain_alloc_irqs_recursive(), which msi_domain_alloc_irqs() and,
eventually, accesses `arg->hpet_index' in hpet_msi_get_hwirq().

pass a correct `irq_alloc_info info' pointer to irq_domain_alloc_irqs()
in hpet_assign_irq() to fix the oops.

Signed-off-by: Sergey Senozhatsky 
---
 arch/x86/kernel/apic/msi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 58fde66..ef516af 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -351,6 +351,6 @@ int hpet_assign_irq(struct irq_domain *domain, struct 
hpet_dev *dev,
info.hpet_id = hpet_dev_id(domain);
info.hpet_index = dev_num;
 
-   return irq_domain_alloc_irqs(domain, 1, NUMA_NO_NODE, NULL);
+   return irq_domain_alloc_irqs(domain, 1, NUMA_NO_NODE, );
 }
 #endif
-- 
2.4.0.rc3.3.g6eb1401

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 105/110] namei: make unlazy_walk and terminate_walk handle nd->stack, add unlazy_link

2015-05-11 Thread Al Viro

On Tue, May 12, 2015 at 12:43:33AM +0100, Al Viro wrote:
> On Mon, May 11, 2015 at 07:08:05PM +0100, Al Viro wrote:
> > +static bool legitimize_links(struct nameidata *nd)
> > +{
> > +   int i;
> > +   for (i = 0; i < nd->depth; i++) {
> > +   struct saved *last = nd->stack + i;
> > +   if (unlikely(!legitimize_path(nd, >link, last->seq))) {
> > +   drop_links(nd);
> > +   nd->depth = i;
> 
> Broken, actually - it should be i + 1.  What happens is that we attempt to
> grab references on nd->stack[...].link; if everything succeeds, we'd won.
> If legitimizing nd->stack[i].link fails (e.g. ->d_seq has changed on us),
> we
>   * put_link everything in stack and clear nd->stack[...].cookie, making
> sure that nobody will call ->put_link() on it later.
>   * leave the things for terminate_walk() so that it would do
> path_put() on everything we have grabbed and ignored everything we hadn't
> even got around to.
> 
> But this failed legitimize_path() requires path_put() - we *can't* block
> there (we wouldn't be able to do ->put_link() afterwards if we did), so
> we just zero what we didn't grab and leave what we had for subsequent
> path_put().  Which may be anything from "nothing" (mount_lock has been
> touched) to "both vfsmount and dentry" (->d_seq mismatch).
> 
> So we need to set nd->depth to i + 1 here, not i.  As it is, we are risking
> a vfsmount (and possibly dentry) leak.  Fixed and force-pushed...

FWIW, below is a better replacement; tested and force-pushed.  And seeing that
we just got nd->root_seq, I wonder if we really need messing with
current->fs there - something like
if (nd->root.mnt && !(nd->flags & LOOKUP_ROOT)) {
if (unlikely(!legitimize_path(>root, nd->root_seq))) {
rcu_read_unlock();
dput(dentry);
return -ECHILD;
}
}
should do better than playing with fs->lock, etc. we do right now...

diff --git a/fs/namei.c b/fs/namei.c
index 92bf031..6db14f2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -554,6 +554,68 @@ static inline int nd_alloc_stack(struct nameidata *nd)
return __nd_alloc_stack(nd);
 }
 
+static void drop_links(struct nameidata *nd)
+{
+   int i = nd->depth;
+   while (i--) {
+   struct saved *last = nd->stack + i;
+   struct inode *inode = last->inode;
+   if (last->cookie && inode->i_op->put_link) {
+   inode->i_op->put_link(inode, last->cookie);
+   last->cookie = NULL;
+   }
+   }
+}
+
+static void terminate_walk(struct nameidata *nd)
+{
+   drop_links(nd);
+   if (!(nd->flags & LOOKUP_RCU)) {
+   int i;
+   path_put(>path);
+   for (i = 0; i < nd->depth; i++)
+   path_put(>stack[i].link);
+   } else {
+   nd->flags &= ~LOOKUP_RCU;
+   if (!(nd->flags & LOOKUP_ROOT))
+   nd->root.mnt = NULL;
+   rcu_read_unlock();
+   }
+   nd->depth = 0;
+}
+
+/* path_put is needed afterwards regardless of success or failure */
+static bool legitimize_path(struct nameidata *nd,
+   struct path *path, unsigned seq)
+{
+   int res = __legitimize_mnt(path->mnt, nd->m_seq);
+   if (unlikely(res)) {
+   if (res > 0)
+   path->mnt = NULL;
+   path->dentry = NULL;
+   return false;
+   }
+   if (unlikely(!lockref_get_not_dead(>dentry->d_lockref))) {
+   path->dentry = NULL;
+   return false;
+   }
+   return !read_seqcount_retry(>dentry->d_seq, seq);
+}
+
+static bool legitimize_links(struct nameidata *nd)
+{
+   int i;
+   for (i = 0; i < nd->depth; i++) {
+   struct saved *last = nd->stack + i;
+   if (unlikely(!legitimize_path(nd, >link, last->seq))) {
+   drop_links(nd);
+   nd->depth = i + 1;
+   return false;
+   }
+   }
+   return true;
+}
+
 /*
  * Path walking has 2 modes, rcu-walk and ref-walk (see
  * Documentation/filesystems/path-lookup.txt).  In situations when we can't
@@ -575,6 +637,8 @@ static inline int nd_alloc_stack(struct nameidata *nd)
  * unlazy_walk attempts to legitimize the current nd->path, nd->root and dentry
  * for ref-walk mode.  @dentry must be a path found by a do_lookup call on
  * @nd or NULL.  Must be called from rcu-walk context.
+ * Nothing should touch nameidata between unlazy_walk() failure and
+ * terminate_walk().
  */
 static int unlazy_walk(struct nameidata *nd, struct dentry *dentry, unsigned 
seq)
 {
@@ -583,22 +647,13 @@ static int unlazy_walk(struct nameidata *nd, struct 
dentry *dentry, unsigned seq
 
BUG_ON(!(nd->flags & LOOKUP_RCU));
 
-   /*
-* After legitimizing the bastards,

Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-11 Thread Kevin Easton

On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
> > > Let me re-ask the question that I asked last week (and was apparently
> > > ignored).  Why not trying to use the lazytime feature instead of
> > > pointing a head straight at the application's --- and system
> > > administrators' --- heads?
> > 
> > Sorry Ted, I thought I responded already.
> > 
> > The goal is to avoid inode writeout entirely when we can, and 
> > as I understand it lazytime will still force writeout before the inode 
> > is dropped from the cache.  In systems like Ceph in particular, the 
> > IOs can be spread across lots of files, so simply deferring writeout 
> > doesn't always help.
> 
> Sure, but it would reduce the writeout by orders of magnitude.  I can
> understand if you want to reduce it further, but it might be good
> enough for your purposes.
> 
> I considered doing the equivalent of O_NOMTIME for our purposes at
> $WORK, and our use case is actually not that different from Ceph's
> (i.e., using a local disk file system to support a cluster file
> system), and lazytime was (a) something I figured was something I
> could upstream in good conscience, and (b) was more than good enough
> for us.

A safer alternative might be a chattr file attribute that if set, the
mtime is not updated on writes, and stat() on the file always shows the
mtime as "right now".  At least that way, the file won't accidentally
get left out of backups that rely on the mtime.

(If the file attribute is unset, you immediately update the mtime then
too, and from then on the file is back to normal).

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4.0 00/72] 4.0.3-stable review

2015-05-11 Thread Greg Kroah-Hartman

On Mon, May 11, 2015 at 05:40:28PM -0600, Shuah Khan wrote:
> On 05/11/2015 11:54 AM, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.0.3 release.
> > There are 72 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Wed May 13 17:54:19 UTC 2015.
> > Anything received after that time might be too late.
> > 
> > The whole patch series can be found in one patch at:
> > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.0.3-rc1.gz
> > and the diffstat can be found below.
> > 
> > thanks,
> > 
> > greg k-h
> > 
> 
> Compiled and booted on my test system. No dmesg regressions.

Thanks for testing all 3 of these and letting me know.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [alsa-devel] [RFC PATCH 00/14] ASoC: qcom: add support to apq8016 audio

2015-05-11 Thread Kenneth Westfield

On Tue, May 05, 2015 at 11:54:16PM -0700, Srinivas Kandagatla wrote:
> Hi Kenneth,
> 
> On 06/05/15 06:47, Kenneth Westfield wrote:
> >>>
> >>>I will test the patches and let you know by Wednesday.  Also, I posted
> >>>some comments, but Patrick should be posting his comments separately
> >>>later next week.
> >Srinivas,
> >
> >After applying the patches, audio playback is no longer functional on
> >the storm board.  Give me some time to debug this and I will get back
> >to you.
> I found atleast one issue with rdma audif bits in ipq806x lpass
> 
> could you try change, this will be fixed properly in next version.
> 
> ---><-
> diff --git a/sound/soc/qcom/lpass-ipq806x.c
> b/sound/soc/qcom/lpass-ipq806x.c
> index 11a7053..2b00355 100644
> --- a/sound/soc/qcom/lpass-ipq806x.c
> +++ b/sound/soc/qcom/lpass-ipq806x.c
> @@ -69,6 +69,7 @@ struct lpass_variant ipq806x_data = {
> .rdma_reg_base  = 0x6000,
> .rdma_reg_stride= 0x1000,
> .rdma_channels  = 4,
> +   .rdmactl_audif_start= 4,
> .dai_driver = _cpu_dai_driver,
> .num_dai= 1,
> 
> ---><-

Srinivas,

I was able to get audio working on the Storm board.  There were several
issues.  First, the I2S control port was saved in the DAI driver id field
(which was 4), but the DAI id field was used by the macro (which was 0).
The patch below fixed it:

---><-
diff --git a/sound/soc/qcom/lpass-cpu.c b/sound/soc/qcom/lpass-cpu.c
index 17ad20d..58ae8af 100644
--- a/sound/soc/qcom/lpass-cpu.c
+++ b/sound/soc/qcom/lpass-cpu.c
@@ -146,7 +146,7 @@ static int lpass_cpu_daiops_hw_params(struct 
snd_pcm_substream *substream,
}
 
ret = regmap_write(drvdata->lpaif_map,
-  LPAIF_I2SCTL_REG(drvdata->variant, dai->id),
+  LPAIF_I2SCTL_REG(drvdata->variant, dai->driver->id),
   regval);
if (ret) {
dev_err(dai->dev, "%s() error writing to i2sctl reg: %d\n",
@@ -171,7 +171,7 @@ static int lpass_cpu_daiops_hw_free(struct 
snd_pcm_substream *substream,
int ret;
 
ret = regmap_write(drvdata->lpaif_map,
-  LPAIF_I2SCTL_REG(drvdata->variant, dai->id), 0);
+  LPAIF_I2SCTL_REG(drvdata->variant, dai->driver->id), 
0);
if (ret)
dev_err(dai->dev, "%s() error writing to i2sctl reg: %d\n",
__func__, ret);
@@ -186,7 +186,7 @@ static int lpass_cpu_daiops_prepare(struct 
snd_pcm_substream *substream,
int ret;
 
ret = regmap_update_bits(drvdata->lpaif_map,
-   LPAIF_I2SCTL_REG(drvdata->variant, dai->id),
+   LPAIF_I2SCTL_REG(drvdata->variant, dai->driver->id),
LPAIF_I2SCTL_SPKEN_MASK, LPAIF_I2SCTL_SPKEN_ENABLE);
if (ret)
dev_err(dai->dev, "%s() error writing to i2sctl reg: %d\n",
@@ -206,7 +206,7 @@ static int lpass_cpu_daiops_trigger(struct 
snd_pcm_substream *substream,
case SNDRV_PCM_TRIGGER_RESUME:
case SNDRV_PCM_TRIGGER_PAUSE_RELEASE:
ret = regmap_update_bits(drvdata->lpaif_map,
-   LPAIF_I2SCTL_REG(drvdata->variant, dai->id),
+   LPAIF_I2SCTL_REG(drvdata->variant, 
dai->driver->id),
LPAIF_I2SCTL_SPKEN_MASK,
LPAIF_I2SCTL_SPKEN_ENABLE);
if (ret)
@@ -217,7 +217,7 @@ static int lpass_cpu_daiops_trigger(struct 
snd_pcm_substream *substream,
case SNDRV_PCM_TRIGGER_SUSPEND:
case SNDRV_PCM_TRIGGER_PAUSE_PUSH:
ret = regmap_update_bits(drvdata->lpaif_map,
-   LPAIF_I2SCTL_REG(drvdata->variant, dai->id),
+   LPAIF_I2SCTL_REG(drvdata->variant, 
dai->driver->id),
LPAIF_I2SCTL_SPKEN_MASK,
LPAIF_I2SCTL_SPKEN_DISABLE);
if (ret)
@@ -247,7 +247,7 @@ int lpass_cpu_dai_probe(struct snd_soc_dai *dai)
 
/* ensure audio hardware is disabled */
ret = regmap_write(drvdata->lpaif_map,
-   LPAIF_I2SCTL_REG(drvdata->variant, dai->id), 0);
+   LPAIF_I2SCTL_REG(drvdata->variant, dai->driver->id), 0);
if (ret)
dev_err(dai->dev, "%s() error writing to i2sctl reg: %d\n",
__func__, ret);
---><-

In addition to your patch above, I also needed to correct the rdma_port
assignment by removing the i2s port reference:

---><-
diff --git

Re: [PATCH 1/3 v6] dt/bindings: Add binding for the BCM2835 mailbox driver

2015-05-11 Thread Jassi Brar

On Wed, May 6, 2015 at 1:57 AM, Eric Anholt  wrote:
> From: Lubomir Rintel 
>
> This patch was split out of Lubomir's original mailbox patch by Eric
> Anholt, and the required properties documentation and examples have
> been filled out more completely and updated for the driver being
> changed to expose a single channel.
>
> Signed-off-by: Lubomir Rintel 
> Signed-off-by: Craig McGeachie 
> Signed-off-by: Eric Anholt 
> Acked-by: Lee Jones 
> Acked-by: Stephen Warren 
> ---
Thanks to the reviewers. Applied all 3 patches.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8 0/9] Tegra xHCI support

2015-05-11 Thread Jassi Brar

On Mon, May 4, 2015 at 11:06 PM, Andrew Bresticker
 wrote:
> This series adds support for xHCI on NVIDIA Tegra SoCs.  This includes:
>  - patches 1, 2, and 3: minor cleanups for mailbox framework and xHCI,
>  - patches 4 and 5: adding an MFD driver for the XUSB cmoplex,
>  - patches 6 and 7: adding a driver for the mailbox used to communicate
>with the xHCI controller's firmware, and
>  - patches 8 and 9: adding a xHCI host-controller driver.
>
> The addition of USB PHY support to the XUSB padctl driver has been dropped.
> Thierry will be posting those patches later.
>
> Given the many compile and run-time dependencies in this series, it is 
> probably
> best if the first 3 patches are picked up by the relevant maintainers in topic
> branches so that the remainder of the series can go through the Tegra tree.
>
> Tested on Jetson TK1 and Nyan-Big with a variety of USB2.0 and USB3.0 memory
> sticks and ethernet dongles.  This has also been tested, with additional
> out-of-tree patches, on Tegra132 and Tegra210 based boards.
>
> Based on v4.1-rc2.  A branch with the entire series is available at:
>   https://github.com/abrestic/linux/tree/tegra-xhci-v8
>
> Cc: Jon Hunter 
>
> Changes from v7:
>  - Move non-shared resources into child nodes of MFD.
>  - Fixed a couple of mailbox driver bugs.
>
> Changes from v6:
>  - Dropped PHY changes from series.  Will be posted later by Thierry.
>  - Added an MFD device with the mailbox and xHCI host as sub-devices.
>
> Changes from v5:
>  - Addressed review comments from Jassi and Felipe.
>
> Changes from v4:
>  - Made USB support optional in padctl driver.
>  - Made usb3-port a pinconfig property again.
>  - Cleaned up mbox_request_channel() error handling and allowed it to defer
>probing (patch 3).
>  - Minor xHCI (patch 1) and mailbox framework (patch 2) cleanups suggested
>by Thierry.
>  - Addressed Thierry's review comments.
>
> Changes from v3:
>  - Fixed USB2.0 flakiness on Jetson-TK1.
>  - Switched to 32-bit DMA mask for host.
>  - Addressed Stephen's review comments.
>
> Chagnes from v2:
>  - Dropped mailbox channel specifier.  The mailbox driver allocates virtual
>channels backed by the single physical channel.
>  - Added support for HS_CURR_LEVEL adjustment pinconfig property, which
>will be required for the Blaze board.
>  - Addressed Stephen's review comments.
>
> Changes from v1:
>  - Converted mailbox driver to use the common mailbox framework.
>  - Fixed up host driver so that it can now be built and used as a module.
>  - Addressed Stephen's review comments.
>  - Misc. cleanups.
>
> Andrew Bresticker (8):
>   xhci: Set shared HCD's hcd_priv in xhci_gen_setup
>   mailbox: Make mbox_chan_ops const
>   mfd: Add binding document for NVIDIA Tegra XUSB
>   mfd: Add driver for NVIDIA Tegra XUSB
>   mailbox: Add NVIDIA Tegra XUSB mailbox binding
>   mailbox: Add NVIDIA Tegra XUSB mailbox driver
>   usb: Add NVIDIA Tegra xHCI controller binding
>   usb: xhci: Add NVIDIA Tegra xHCI host-controller driver
>
> Benson Leung (1):
>   mailbox: Fix up error handling in mbox_request_channel()
>
>  .../bindings/mailbox/nvidia,tegra124-xusb-mbox.txt |  32 +
>  .../bindings/mfd/nvidia,tegra124-xusb.txt  |  37 +
>  .../bindings/usb/nvidia,tegra124-xhci.txt  |  96 +++
>  drivers/mailbox/Kconfig|   8 +
>  drivers/mailbox/Makefile   |   2 +
>  drivers/mailbox/arm_mhu.c  |   2 +-
>  drivers/mailbox/mailbox-altera.c   |   2 +-
>  drivers/mailbox/mailbox.c  |  11 +-
>  drivers/mailbox/omap-mailbox.c |   8 +-
>  drivers/mailbox/pcc.c  |   2 +-
>  drivers/mailbox/tegra-xusb-mailbox.c   | 290 +++
>  drivers/mfd/Kconfig|   7 +
>  drivers/mfd/Makefile   |   1 +
>  drivers/mfd/tegra-xusb.c   |  75 ++
>  drivers/usb/host/Kconfig   |  10 +
>  drivers/usb/host/Makefile  |   1 +
>  drivers/usb/host/xhci-pci.c|   5 -
>  drivers/usb/host/xhci-plat.c   |   5 -
>  drivers/usb/host/xhci-tegra.c  | 947 
> +
>  drivers/usb/host/xhci.c|   6 +-
>  include/linux/mailbox_controller.h |   2 +-
>  include/soc/tegra/xusb.h   |  43 +
>  22 files changed, 1568 insertions(+), 24 deletions(-)
>  create mode 100644 
> Documentation/devicetree/bindings/mailbox/nvidia,tegra124-xusb-mbox.txt
>  create mode 100644 
> Documentation/devicetree/bindings/mfd/nvidia,tegra124-xusb.txt
>  create mode 100644 
> Documentation/devicetree/bindings/usb/nvidia,tegra124-xhci.txt
>  create mode 100644 drivers/mailbox/tegra-xusb-mailbox.c
>  create mode 100644 drivers/mfd/tegra-xusb.c
>  create mode 100644 drivers/usb/host/xhci-tegra.c
>

[PATCH] f2fs crypto: use inode number for xts_tweak

2015-05-11 Thread Jaegeuk Kim

Previoulsy when making xts_tweak, page->index was used.
But, when it supports fcollapse, the block address was moved, so that we can
lose the original page->index, which causes decrytion failure.

In order to avoid that, let's use the inode->i_ino for xfs_tweak hint.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/crypto.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/f2fs/crypto.c b/fs/f2fs/crypto.c
index 1dd0835..35986a5 100644
--- a/fs/f2fs/crypto.c
+++ b/fs/f2fs/crypto.c
@@ -375,7 +375,6 @@ typedef enum {
 static int f2fs_page_crypto(struct f2fs_crypto_ctx *ctx,
struct inode *inode,
f2fs_direction_t rw,
-   pgoff_t index,
struct page *src_page,
struct page *dest_page)
 
@@ -420,10 +419,10 @@ static int f2fs_page_crypto(struct f2fs_crypto_ctx *ctx,
req, CRYPTO_TFM_REQ_MAY_BACKLOG | CRYPTO_TFM_REQ_MAY_SLEEP,
f2fs_crypt_complete, );
 
-   BUILD_BUG_ON(F2FS_XTS_TWEAK_SIZE < sizeof(index));
-   memcpy(xts_tweak, , sizeof(index));
-   memset(_tweak[sizeof(index)], 0,
-   F2FS_XTS_TWEAK_SIZE - sizeof(index));
+   BUILD_BUG_ON(F2FS_XTS_TWEAK_SIZE < sizeof(inode->i_ino));
+   memcpy(xts_tweak, >i_ino, sizeof(inode->i_ino));
+   memset(_tweak[sizeof(inode->i_ino)], 0,
+   F2FS_XTS_TWEAK_SIZE - sizeof(inode->i_ino));
 
sg_init_table(, 1);
sg_set_page(, dest_page, PAGE_CACHE_SIZE, 0);
@@ -496,7 +495,7 @@ struct page *f2fs_encrypt(struct inode *inode,
}
ctx->bounce_page = ciphertext_page;
ctx->control_page = plaintext_page;
-   err = f2fs_page_crypto(ctx, inode, F2FS_ENCRYPT, plaintext_page->index,
+   err = f2fs_page_crypto(ctx, inode, F2FS_ENCRYPT,
plaintext_page, ciphertext_page);
if (err) {
f2fs_release_crypto_ctx(ctx);
@@ -524,7 +523,7 @@ int f2fs_decrypt(struct f2fs_crypto_ctx *ctx, struct page 
*page)
BUG_ON(!PageLocked(page));
 
return f2fs_page_crypto(ctx, page->mapping->host,
-   F2FS_DECRYPT, page->index, page, page);
+   F2FS_DECRYPT, page, page);
 }
 
 /*
-- 
2.1.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] f2fs: do not issue next dnode discard redundantly

2015-05-11 Thread Jaegeuk Kim

We have a discard map, so that we can avoid redundant discard issues.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/segment.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 2c40ce1..342e0f7 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -486,7 +486,20 @@ static int f2fs_issue_discard(struct f2fs_sb_info *sbi,
 
 void discard_next_dnode(struct f2fs_sb_info *sbi, block_t blkaddr)
 {
-   if (f2fs_issue_discard(sbi, blkaddr, 1)) {
+   int err = -ENOTSUPP;
+
+   if (test_opt(sbi, DISCARD)) {
+   struct seg_entry *se = get_seg_entry(sbi,
+   GET_SEGNO(sbi, blkaddr));
+   unsigned int offset = GET_BLKOFF_FROM_SEG0(sbi, blkaddr);
+
+   if (f2fs_test_bit(offset, se->discard_map))
+   return;
+
+   err = f2fs_issue_discard(sbi, blkaddr, 1);
+   }
+
+   if (err) {
struct page *page = grab_meta_page(sbi, blkaddr);
/* zero-filled page */
set_page_dirty(page);
-- 
2.1.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] f2fs: disable the discard option when device does not support

2015-05-11 Thread Jaegeuk Kim

This patch disables given discard option when device does not support it.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index bd8a405..19438f2 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1186,6 +1186,7 @@ try_onemore:
f2fs_msg(sb, KERN_WARNING,
"mounting with \"discard\" option, but "
"the device does not support discard");
+   clear_opt(sbi, DISCARD);
}
 
sbi->s_kobj.kset = f2fs_kset;
-- 
2.1.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] f2fs: get rid of buggy function

2015-05-11 Thread Jaegeuk Kim

This patch avoids to use a buggy function for now.
It needs to fix it later.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/segment.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 342e0f7..17e89ba 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -64,6 +64,8 @@ static inline unsigned long __reverse_ffs(unsigned long word)
return num;
 }
 
+/* FIXME: Do not use this due to a subtle bug */
+#if 0
 /*
  * __find_rev_next(_zero)_bit is copied from lib/find_next_bit.c because
  * f2fs_set_bit makes MSB and LSB reversed in a byte.
@@ -122,6 +124,7 @@ found_first:
 found_middle:
return result + __reverse_ffs(tmp);
 }
+#endif
 
 static unsigned long __find_rev_next_zero_bit(const unsigned long *addr,
unsigned long size, unsigned long offset)
@@ -542,7 +545,7 @@ static void add_discard_addrs(struct f2fs_sb_info *sbi, 
struct cp_control *cpc)
unsigned long *ckpt_map = (unsigned long *)se->ckpt_valid_map;
unsigned long *discard_map = (unsigned long *)se->discard_map;
unsigned long *dmap = SIT_I(sbi)->tmp_map;
-   unsigned int start = 0, end = -1;
+   unsigned int start = -1, end = 0;
bool force = (cpc->reason == CP_DISCARD);
int i;
 
@@ -561,12 +564,14 @@ static void add_discard_addrs(struct f2fs_sb_info *sbi, 
struct cp_control *cpc)
(cur_map[i] ^ ckpt_map[i]) & ckpt_map[i];
 
while (force || SM_I(sbi)->nr_discards <= SM_I(sbi)->max_discards) {
-   start = __find_rev_next_bit(dmap, max_blocks, end + 1);
-   if (start >= max_blocks)
-   break;
 
end = __find_rev_next_zero_bit(dmap, max_blocks, start + 1);
-   __add_discard_entry(sbi, cpc, se, start, end);
+
+   __add_discard_entry(sbi, cpc, se, start + 1, end);
+
+   if (end >= max_blocks)
+   break;
+   start = end;
}
 }
 
-- 
2.1.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 16/18 v2] f2fs crypto: add symlink encryption

2015-05-11 Thread Jaegeuk Kim

Chage log from v1:
 o split inode_operations suggested by Al

This patch implements encryption support for symlink.

Signed-off-by: Uday Savagaonkar 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/f2fs.h  |   1 +
 fs/f2fs/inode.c |   5 +-
 fs/f2fs/namei.c | 145 ++--
 3 files changed, 145 insertions(+), 6 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 686517d..5809590 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -1920,6 +1920,7 @@ extern const struct address_space_operations 
f2fs_node_aops;
 extern const struct address_space_operations f2fs_meta_aops;
 extern const struct inode_operations f2fs_dir_inode_operations;
 extern const struct inode_operations f2fs_symlink_inode_operations;
+extern const struct inode_operations f2fs_encrypted_symlink_inode_operations;
 extern const struct inode_operations f2fs_special_inode_operations;
 extern struct kmem_cache *inode_entry_slab;
 
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index e622ec9..13936f9 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -198,7 +198,10 @@ make_now:
inode->i_mapping->a_ops = _dblock_aops;
mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_HIGH_ZERO);
} else if (S_ISLNK(inode->i_mode)) {
-   inode->i_op = _symlink_inode_operations;
+   if (f2fs_encrypted_inode(inode))
+   inode->i_op = _encrypted_symlink_inode_operations;
+   else
+   inode->i_op = _symlink_inode_operations;
inode->i_mapping->a_ops = _dblock_aops;
} else if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index c857f82..5287818 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -338,16 +338,26 @@ static int f2fs_symlink(struct inode *dir, struct dentry 
*dentry,
 {
struct f2fs_sb_info *sbi = F2FS_I_SB(dir);
struct inode *inode;
-   size_t symlen = strlen(symname) + 1;
+   size_t len = strlen(symname);
+   size_t p_len;
+   char *p_str;
+   struct f2fs_str disk_link = FSTR_INIT(NULL, 0);
+   struct f2fs_encrypted_symlink_data *sd = NULL;
int err;
 
+   if (len > dir->i_sb->s_blocksize)
+   return -ENAMETOOLONG;
+
f2fs_balance_fs(sbi);
 
inode = f2fs_new_inode(dir, S_IFLNK | S_IRWXUGO);
if (IS_ERR(inode))
return PTR_ERR(inode);
 
-   inode->i_op = _symlink_inode_operations;
+   if (f2fs_encrypted_inode(inode))
+   inode->i_op = _encrypted_symlink_inode_operations;
+   else
+   inode->i_op = _symlink_inode_operations;
inode->i_mapping->a_ops = _dblock_aops;
 
f2fs_lock_op(sbi);
@@ -355,10 +365,50 @@ static int f2fs_symlink(struct inode *dir, struct dentry 
*dentry,
if (err)
goto out;
f2fs_unlock_op(sbi);
-
-   err = page_symlink(inode, symname, symlen);
alloc_nid_done(sbi, inode->i_ino);
 
+   if (f2fs_encrypted_inode(dir)) {
+   struct qstr istr = QSTR_INIT(symname, len);
+
+   err = f2fs_inherit_context(dir, inode, NULL);
+   if (err)
+   goto err_out;
+
+   err = f2fs_setup_fname_crypto(inode);
+   if (err)
+   goto err_out;
+
+   err = f2fs_fname_crypto_alloc_buffer(inode, len, _link);
+   if (err)
+   goto err_out;
+
+   err = f2fs_fname_usr_to_disk(inode, , _link);
+   if (err < 0)
+   goto err_out;
+
+   p_len = encrypted_symlink_data_len(disk_link.len) + 1;
+
+   if (p_len > dir->i_sb->s_blocksize) {
+   err = -ENAMETOOLONG;
+   goto err_out;
+   }
+
+   sd = kzalloc(p_len, GFP_NOFS);
+   if (!sd) {
+   err = -ENOMEM;
+   goto err_out;
+   }
+   memcpy(sd->encrypted_path, disk_link.name, disk_link.len);
+   sd->len = cpu_to_le16(disk_link.len);
+   p_str = (char *)sd;
+   } else {
+   p_len = len + 1;
+   p_str = (char *)symname;
+   }
+
+   err = page_symlink(inode, p_str, p_len);
+
+err_out:
d_instantiate(dentry, inode);
unlock_new_inode(inode);
 
@@ -371,10 +421,14 @@ static int f2fs_symlink(struct inode *dir, struct dentry 
*dentry,
 * If the symlink path is stored into inline_data, there is no
 * performance regression.
 */
-   filemap_write_and_wait_range(inode->i_mapping, 0, symlen - 1);
+   if (!err)
+   filemap_write_and_wait_range(inode->i_mapping, 0, p_len - 1);
 
if (IS_DIRSYNC(dir))

[PATCH perf/core ] perf probe: Show the error reason comes from invalid DSO

2015-05-11 Thread Masami Hiramatsu

Show the reason of error when dso__load* failed. This shows
when user gives wrong kernel image or wrong path.

Without this, perf probe shows an obscure message.
  
  $ perf probe -k ~/kbin/linux-3.x86_64/vmlinux -L vfs_read
  Failed to find path of kernel module.
Error: Failed to show lines.
  

With this, perf shows appropriate error message.
  
  $ perf probe -k ~/kbin/linux-3.x86_64/vmlinux -L vfs_read
  Failed to find the path for kernel: Mismatching build id
Error: Failed to show lines.
  
And
  
  $ perf probe -k /non-exist/kernel/vmlinux -L vfs_read
  Failed to find the path for kernel: No such file or directory
Error: Failed to show lines.
  

Signed-off-by: Masami Hiramatsu 
---
 tools/perf/util/probe-event.c |   47 +
 tools/perf/util/probe-event.h |3 ---
 2 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index a2d8cef..a4830a6 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -201,11 +201,12 @@ static void put_target_map(struct map *map, bool user)
 }
 
 
-static struct dso *kernel_get_module_dso(const char *module)
+static int kernel_get_module_dso(const char *module, struct dso **pdso)
 {
struct dso *dso;
struct map *map;
const char *vmlinux_name;
+   int ret = 0;
 
if (module) {
list_for_each_entry(dso, _machine->kernel_dsos.head,
@@ -215,30 +216,21 @@ static struct dso *kernel_get_module_dso(const char 
*module)
goto found;
}
pr_debug("Failed to find module %s.\n", module);
-   return NULL;
+   return -ENOENT;
}
 
map = host_machine->vmlinux_maps[MAP__FUNCTION];
dso = map->dso;
 
vmlinux_name = symbol_conf.vmlinux_name;
-   if (vmlinux_name) {
-   if (dso__load_vmlinux(dso, map, vmlinux_name, false, NULL) <= 0)
-   return NULL;
-   } else {
-   if (dso__load_vmlinux_path(dso, map, NULL) <= 0) {
-   pr_debug("Failed to load kernel map.\n");
-   return NULL;
-   }
-   }
+   dso->load_errno = 0;
+   if (vmlinux_name)
+   ret = dso__load_vmlinux(dso, map, vmlinux_name, false, NULL);
+   else
+   ret = dso__load_vmlinux_path(dso, map, NULL);
 found:
-   return dso;
-}
-
-const char *kernel_get_module_path(const char *module)
-{
-   struct dso *dso = kernel_get_module_dso(module);
-   return (dso) ? dso->long_name : NULL;
+   *pdso = dso;
+   return ret;
 }
 
 static int convert_exec_to_group(const char *exec, char **result)
@@ -390,16 +382,25 @@ static int get_alternative_line_range(struct debuginfo 
*dinfo,
 static struct debuginfo *open_debuginfo(const char *module, bool silent)
 {
const char *path = module;
-   struct debuginfo *ret;
+   char reason[STRERR_BUFSIZE];
+   struct debuginfo *ret = NULL;
+   struct dso *dso = NULL;
+   int err;
 
if (!module || !strchr(module, '/')) {
-   path = kernel_get_module_path(module);
-   if (!path) {
+   err = kernel_get_module_dso(module, );
+   if (err < 0) {
+   if (!dso || dso->load_errno == 0) {
+   if (!strerror_r(-err, reason, STRERR_BUFSIZE))
+   strcpy(reason, "(unknown)");
+   } else
+   dso__strerror_load(dso, reason, STRERR_BUFSIZE);
if (!silent)
-   pr_err("Failed to find path of %s module.\n",
-  module ?: "kernel");
+   pr_err("Failed to find the path for %s: %s\n",
+   module ?: "kernel", reason);
return NULL;
}
+   path = dso->long_name;
}
ret = debuginfo__new(path);
if (!ret && !silent) {
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 537eb32..31db6ee 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -131,9 +131,6 @@ extern void line_range__clear(struct line_range *lr);
 /* Initialize line range */
 extern int line_range__init(struct line_range *lr);
 
-/* Internal use: Return kernel/module path */
-extern const char *kernel_get_module_path(const char *module);
-
 extern int add_perf_probe_events(struct perf_probe_event *pevs, int npevs);
 extern int del_perf_probe_events(struct strfilter *filter);
 extern int show_perf_probe_events(struct strfilter *filter);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at

[PATCH] staging: slicoss: remove slic_spinlock wrapper

2015-05-11 Thread David Matlack

As per TODO. This commit introduces no functional changes.

Signed-off-by: David Matlack 
---
 drivers/staging/slicoss/TODO  |   1 -
 drivers/staging/slicoss/slic.h|  19 +++---
 drivers/staging/slicoss/slicoss.c | 125 ++
 3 files changed, 65 insertions(+), 80 deletions(-)

diff --git a/drivers/staging/slicoss/TODO b/drivers/staging/slicoss/TODO
index 20cc9ab..9019729 100644
--- a/drivers/staging/slicoss/TODO
+++ b/drivers/staging/slicoss/TODO
@@ -25,7 +25,6 @@ TODO:
- state variables for things that are
  easily available and shouldn't be kept in card structure, cardnum, ...
  slotnumber, events, ...
-   - get rid of slic_spinlock wrapper
- volatile == bad design => bad code
- locking too fine grained, not designed just throw more locks
  at problem
diff --git a/drivers/staging/slicoss/slic.h b/drivers/staging/slicoss/slic.h
index 3a5aa88..5b23254 100644
--- a/drivers/staging/slicoss/slic.h
+++ b/drivers/staging/slicoss/slic.h
@@ -56,11 +56,6 @@ static u32 OasisRcvUCodeLen = 512;
 static u32 GBRcvUCodeLen = 512;
 #define SECTION_SIZE 65536
 
-struct slic_spinlock {
-   spinlock_t  lock;
-   unsigned long   flags;
-};
-
 #define SLIC_RSPQ_PAGES_GB10
 #define SLIC_RSPQ_BUFSINPAGE  (PAGE_SIZE / SLIC_RSPBUF_SIZE)
 
@@ -165,7 +160,7 @@ struct slic_cmdqueue {
struct slic_hostcmd *head;
struct slic_hostcmd *tail;
int count;
-   struct slic_spinlock lock;
+   spinlock_t lock;
 };
 
 #define SLIC_MAX_CARDS  32
@@ -346,7 +341,7 @@ struct physcard {
 };
 
 struct base_driver {
-   struct slic_spinlock driver_lock;
+   spinlock_t   driver_lock;
u32  num_slic_cards;
u32  num_slic_ports;
u32  num_slic_ports_active;
@@ -401,8 +396,8 @@ struct adapter {
uintcard_size;
uintchipid;
struct net_device  *netdev;
-   struct slic_spinlock adapter_lock;
-   struct slic_spinlock reset_lock;
+   spinlock_t  adapter_lock;
+   spinlock_t  reset_lock;
struct pci_dev *pcidev;
uintbusnumber;
uintslotnumber;
@@ -441,8 +436,8 @@ struct adapter {
u32 pingtimerset;
struct timer_list   loadtimer;
u32 loadtimerset;
-   struct slic_spinlock upr_lock;
-   struct slic_spinlock bit64reglock;
+   spinlock_t   upr_lock;
+   spinlock_t   bit64reglock;
struct slic_rspqueue rspqueue;
struct slic_rcvqueue rcvqueue;
struct slic_cmdqueue cmdq_free;
@@ -457,7 +452,7 @@ struct adapter {
/* Free object handles*/
struct slic_handle *pfree_slic_handles;
/* Object handle list lock*/
-   struct slic_spinlock handle_lock;
+   spinlock_t  handle_lock;
ushort  slic_handle_ix;
 
u32 xmitq_full;
diff --git a/drivers/staging/slicoss/slicoss.c 
b/drivers/staging/slicoss/slicoss.c
index c2bda1d..39c140c 100644
--- a/drivers/staging/slicoss/slicoss.c
+++ b/drivers/staging/slicoss/slicoss.c
@@ -144,8 +144,9 @@ static inline void slic_reg64_write(struct adapter 
*adapter, void __iomem *reg,
u32 value, void __iomem *regh, u32 paddrh,
bool flush)
 {
-   spin_lock_irqsave(>bit64reglock.lock,
-   adapter->bit64reglock.flags);
+   unsigned long flags;
+
+   spin_lock_irqsave(>bit64reglock, flags);
if (paddrh != adapter->curaddrupper) {
adapter->curaddrupper = paddrh;
writel(paddrh, regh);
@@ -153,8 +154,7 @@ static inline void slic_reg64_write(struct adapter 
*adapter, void __iomem *reg,
writel(value, reg);
if (flush)
mb();
-   spin_unlock_irqrestore(>bit64reglock.lock,
-   adapter->bit64reglock.flags);
+   spin_unlock_irqrestore(>bit64reglock, flags);
 }
 
 static void slic_mcast_set_bit(struct adapter *adapter, char *address)
@@ -936,9 +936,10 @@ static int slic_upr_request(struct adapter *adapter,
 u32 upr_data_h,
 u32 upr_buffer, u32 upr_buffer_h)
 {
+   unsigned long flags;
int rc;
 
-   spin_lock_irqsave(>upr_lock.lock, adapter->upr_lock.flags);
+   spin_lock_irqsave(>upr_lock, flags);
rc = slic_upr_queue_request(adapter,
upr_request,
upr_data,
@@ -948,8 +949,7 @@ static int slic_upr_request(struct adapter *adapter,
 
slic_upr_start(adapter);
 err_unlock_irq:
-   spin_unlock_irqrestore(>upr_lock.lock,
-   adapter->upr_lock.flags);
+   spin_unlock_irqrestore(>upr_lock,

[PATCH] staging: slicoss: fix occasionally writing out only half of a dma address

2015-05-11 Thread David Matlack

curaddrupper caches the last written upper 32-bits of a dma address
(the device has one register for the upper 32-bits of all dma
address registers). The problem is, not every dma address write
checks and sets curaddrupper. This causes the driver to occasionally
not write the upper 32-bits of a dma address to the device when it
really should.

I've seen this manifest particularly when the driver is trying to
read config data from the device (RCONFIG) in order to checksum the
device's eeprom. Since the device writes its config data to the
wrong DMA address the driver reads 0 as the eeprom size and the
eeprom checksum fails.

This patch fixes the issue by removing curaddrupper and always
writing the upper 32-bits of dma addresses.

Signed-off-by: David Matlack 
---
 drivers/staging/slicoss/slic.h| 1 -
 drivers/staging/slicoss/slicoss.c | 5 +
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/staging/slicoss/slic.h b/drivers/staging/slicoss/slic.h
index 5b23254..67a8c9e 100644
--- a/drivers/staging/slicoss/slic.h
+++ b/drivers/staging/slicoss/slic.h
@@ -414,7 +414,6 @@ struct adapter {
u32 intrregistered;
uintisp_initialized;
uintgennumber;
-   u32 curaddrupper;
struct slic_shmem  *pshmem;
dma_addr_t  phys_shmem;
u32 isrcopy;
diff --git a/drivers/staging/slicoss/slicoss.c 
b/drivers/staging/slicoss/slicoss.c
index 39c140c..5f34ebbf 100644
--- a/drivers/staging/slicoss/slicoss.c
+++ b/drivers/staging/slicoss/slicoss.c
@@ -147,10 +147,7 @@ static inline void slic_reg64_write(struct adapter 
*adapter, void __iomem *reg,
unsigned long flags;
 
spin_lock_irqsave(>bit64reglock, flags);
-   if (paddrh != adapter->curaddrupper) {
-   adapter->curaddrupper = paddrh;
-   writel(paddrh, regh);
-   }
+   writel(paddrh, regh);
writel(value, reg);
if (flush)
mb();
-- 
2.4.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

linux-next: build warning after merge of the tip tree

2015-05-11 Thread Stephen Rothwell

Hi all,

After merging the tip tree, today's linux-next build (x86_64
allmodconfig) produced this warning:

sound/drivers/pcsp/pcsp.c: In function 'snd_pcsp_create':
sound/drivers/pcsp/pcsp.c:51:5: warning: format '%li' expects argument of type 
'long int', but argument 2 has type 'unsigned int' [-Wformat=]
 "(%linS)\n", resolution);
 ^

Introduced by commit 447fbbdc2cd5 ("sound: Use hrtimer_resolution
instead of hrtimer_get_res()").
-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpjHI3o2cS2g.pgp
Description: OpenPGP digital signature

Re: [PATCH] check smap and !cr0.wp

2015-05-11 Thread Xiao Guangrong



Hi Paolo,

Could you please apply this patch to kvm-unit-tests if it looks good to you?

Thanks!

On 05/07/2015 04:44 PM, Xiao Guangrong wrote:

This test case is used to produce the bug that:

KVM may turn a user page to a kernel page when kernel writes a readonly
user page if CR0.WP = 1. This shadow page entry will be reused after
SMAP is enabled so that kernel is allowed to access this user page

Signed-off-by: Xiao Guangrong 
---
  x86/smap.c | 26 ++
  1 file changed, 26 insertions(+)

diff --git a/x86/smap.c b/x86/smap.c
index 042c5aa..66f97b8 100644
--- a/x86/smap.c
+++ b/x86/smap.c
@@ -48,6 +48,7 @@ asm ("pf_tss:\n"

  #define USER_BASE (1 << 24)
  #define USER_VAR(v)   (*((__typeof__(&(v))) (((unsigned long)) + 
USER_BASE)))
+#define USER_ADDR(v)   ((void *)((unsigned long)() + USER_BASE))

  static void init_test(int i)
  {
@@ -58,6 +59,29 @@ static void init_test(int i)
}
  }

+static void check_smap_nowp(void)
+{
+   test = 0x99;
+
+   *get_pte(phys_to_virt(read_cr3()), USER_ADDR(test)) &= ~PTE_WRITE;
+
+   write_cr4(read_cr4() & ~X86_CR4_SMAP);
+   write_cr0(read_cr0() & ~X86_CR0_WP);
+   clac();
+   write_cr3(read_cr3());
+
+   init_test(0);
+   USER_VAR(test) = 0x99;
+   report("write from user page with SMAP=0, AC=0, WP=0, PTE.U=1 && 
PTE.W=0", pf_count == 0);
+
+   write_cr4(read_cr4() | X86_CR4_SMAP);
+   write_cr3(read_cr3());
+
+   init_test(0);
+   (void)USER_VAR(test);
+   report("read from user page with SMAP=1, AC=0, WP=0, PTE.U=1 && PTE.W=0", 
pf_count == 1 && save == 0x99);
+}
+
  int main(int ac, char **av)
  {
unsigned long i;
@@ -150,6 +174,8 @@ int main(int ac, char **av)
report("executing on user page with AC=0", pf_count == 0);
}

+   check_smap_nowp();
+
// TODO: implicit kernel access from ring 3 (e.g. int)

return report_summary();


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

linux-next: build failure after merge of the spi tree

2015-05-11 Thread Stephen Rothwell

Hi Mark,

After merging the spi tree, today's linux-next build (x86_64
allmodconfig) failed like this:

drivers/spi/spi-bcm2835.c: In function 'bcm2835_spi_can_dma':
drivers/spi/spi-bcm2835.c:381:7: warning: cast from pointer to integer of 
different size [-Wpointer-to-int-cast]
  if (((u32)tfr->tx_buf % 4 == 0) && ((u32)tfr->tx_buf % 4 == 0))
   ^
drivers/spi/spi-bcm2835.c:381:38: warning: cast from pointer to integer of 
different size [-Wpointer-to-int-cast]
  if (((u32)tfr->tx_buf % 4 == 0) && ((u32)tfr->tx_buf % 4 == 0))
  ^
drivers/spi/spi-bcm2835.c:387:7: warning: cast from pointer to integer of 
different size [-Wpointer-to-int-cast]
  if (((u32)tfr->tx_buf % SZ_4K) + tfr->len > SZ_4K) {
   ^
drivers/spi/spi-bcm2835.c:387:26: error: 'SZ_4K' undeclared (first use in this 
function)
  if (((u32)tfr->tx_buf % SZ_4K) + tfr->len > SZ_4K) {
  ^
drivers/spi/spi-bcm2835.c:387:26: note: each undeclared identifier is reported 
only once for each function it appears in
drivers/spi/spi-bcm2835.c:392:7: warning: cast from pointer to integer of 
different size [-Wpointer-to-int-cast]
  if (((u32)tfr->rx_buf % SZ_4K) + tfr->len > SZ_4K) {
   ^

Caused by commit 3ecd37edaa2a ("spi: bcm2835: enable dma modes for
transfers meeting certain conditions").

I have used the spi tree from next-20150511 for today.
-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpXVR5gmzjns.pgp
Description: OpenPGP digital signature

Re: [PATCH v5 0/6] arm64,hi6220: Enable Hisilicon Hi6220 SoC

2015-05-11 Thread Leo Yan

hi Kevin,

On Mon, May 11, 2015 at 05:20:54PM -0700, Kevin Hilman wrote:
> On Thu, May 7, 2015 at 4:11 PM, Brent Wang  wrote:
> > Hello Kevin,
> >
> > 2015-05-08 4:30 GMT+08:00 Kevin Hilman :
> >> Bintian Wang  writes:
> >>
> >>> Hi6220 is one mobile solution of Hisilicon, this patchset contains
> >>> initial support for Hi6220 SoC and HiKey development board, which
> >>> supports octal ARM Cortex A53 cores. Initial support is minimal and
> >>> includes just the arch configuration, clock driver, device tree
> >>> configuration.
> >>>
> >>> PSCI is enabled in device tree and there is no problem to boot all the
> >>> octal cores, and the CPU hotplug is also working now, you can download
> >>> and compile the latest firmware based on the following link to run this
> >>> patch set:
> >>> https://github.com/96boards/documentation/wiki/UEFI
> >>
> >> Do you have any tips for booting this using the HiSi bootloader?  It
> >> seems that I need to add the magic hisi,boardid property for dtbTool to
> >> work.  Could you share what that magic value is?
> > Yes, you need it.
> > Hisilicon has many different development boards and those boards have some
> > different hardware configuration, so we need different device tree
> > files for them.
> > the original hisi,boardid is used to distinguish different boards and
> > used by the
> > bootloader to judge which device tree to use at boot-up.
> >
> >> and maybe add it to the wiki someplace?
> > Maybe add to section "Known Issues" in
> > "https://github.com/96boards/documentation/wiki/UEFI;
> > is a good choice, I will update this section later.
> 
> You updated the wiki, but you didn't specify what the value should be
> for this to work with the old bootloader.
> 
> Can you please give the value of that property?
> 
> Also, have you tested this series with the old bootloader as well?

Below are my testing result w/t Bintian's patches and Hisilicon old
bootloader:
- Need add property "hisi,boardid" into dts;
- Need change cpu enable-method from "psci" to "spin-table";
- The bootloader has not initialized register *cntfrq_el0* so will
  introduce the failure during init arch timer.

For init cntfrq_el0, we need fix this issue in Hisilicon's old
bootloader, rather than directly add "clock-frequency" for arch
timer's node in DTS. i will try to commit one patch for fix this
issue for Hisilicon's old bootloader.

So i think upper issues mainly are introduced by Hisilicon's old
bootloader but not come from Bintian's patches. How about u think for
this?

Below is my local diff which is used to compatible w/t Hisilicon's
old bootloader; Just for your reference.

Thanks,
Leo Yan

---8<---

diff --git a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts 
b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
index e36a539..fd1f89e 100644
--- a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
+++ b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
@@ -14,6 +14,7 @@
 
 / {
model = "HiKey Development Board";
+   hisi,boardid = <0 0 4 3>;
compatible = "hisilicon,hi6220-hikey", "hisilicon,hi6220";
 
aliases {
diff --git a/arch/arm64/boot/dts/hisilicon/hi6220.dtsi 
b/arch/arm64/boot/dts/hisilicon/hi6220.dtsi
index 229937f..8ade3d9 100644
--- a/arch/arm64/boot/dts/hisilicon/hi6220.dtsi
+++ b/arch/arm64/boot/dts/hisilicon/hi6220.dtsi
@@ -13,11 +13,6 @@
#address-cells = <2>;
#size-cells = <2>;
 
-   psci {
-   compatible = "arm,psci-0.2";
-   method = "smc";
-   };
-
cpus {
#address-cells = <2>;
#size-cells = <0>;
@@ -57,56 +52,64 @@
compatible = "arm,cortex-a53", "arm,armv8";
device_type = "cpu";
reg = <0x0 0x0>;
-   enable-method = "psci";
+   enable-method = "spin-table";
+   cpu-release-addr = <0x0 0x740fff8>;
};
 
cpu1: cpu@1 {
compatible = "arm,cortex-a53", "arm,armv8";
device_type = "cpu";
reg = <0x0 0x1>;
-   enable-method = "psci";
+   enable-method = "spin-table";
+   cpu-release-addr = <0x0 0x740fff8>;
};
 
cpu2: cpu@2 {
compatible = "arm,cortex-a53", "arm,armv8";
device_type = "cpu";
reg = <0x0 0x2>;
-   enable-method = "psci";
+   enable-method = "spin-table";
+   cpu-release-addr = <0x0 0x740fff8>;
};
 
cpu3: cpu@3 {
compatible = "arm,cortex-a53", "arm,armv8";
device_type = "cpu";
reg = <0x0 0x3>;
-   enable-method = "psci";
+   enable-method = "spin-table";
+

[PATCH] arm: Don't use memblock limit for the lowmem bound

2015-05-11 Thread Laura Abbott

From: Laura Abbott 

The memblock limit is currently used in find_limits
to find the bounds for ZONE_NORMAL. The memblock
limit may need to be rounded down a PMD size to ensure
allocations are fully mapped though. This has the side
effect of reducing the amount of memory in ZONE_NORMAL.
Since we generally want to optimize for more lowmem, fix
this by using arm_lowmem_limit to calculate the bounds.
This what is used for actually mapping lowmem anyway.

Before:
# cat /proc/zoneinfo | grep managed
managed  62920

After:
# cat /proc/zoneinfo | grep managed
managed  63336



Signed-off-by: Laura Abbott 
---
 arch/arm/mm/init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index be92fa0..b4f9513 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -89,7 +89,7 @@ __tagtable(ATAG_INITRD2, parse_tag_initrd2);
 static void __init find_limits(unsigned long *min, unsigned long *max_low,
   unsigned long *max_high)
 {
-   *max_low = PFN_DOWN(memblock_get_current_limit());
+   *max_low = PFN_DOWN(arm_lowmem_limit);
*min = PFN_UP(memblock_start_of_DRAM());
*max_high = PFN_DOWN(memblock_end_of_DRAM());
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] Support for write stream IDs

2015-05-11 Thread Martin K. Petersen

> "Jens" == Jens Axboe  writes:

Jens,

Jens> There are actual technical challenges on the device side that
Jens> sometimes interferes. [...]

Right now we use the same protocol to speak to USB keys and million
dollar storage arrays. That's because the protocol was designed to be
abstract and completely device agnostic.

What's happening with flash devices and SMR is that all of a sudden
device implementation challenges are being addressed by putting them in
the protocol and punting them to the OS.

That's an obvious and cost-saving approach for a device vendor to take.
But the world would be a different place if we were still dealing with
MFM, RLL, C/H/S addressing and other implementation-specific horrors of
the past. And if that approach had continued we would explicitly have to
deal with erase blocks on USB sticks and manually drive RAID logic
inside disk arrays. But thankfully, with a few exceptions, we didn't go
there.

My beef with the current stream ID stuff and ZAC/ZBC is that those are
steps in the wrong direction in that they are both exclusively focused
on addressing implementation challenges specific to certain kinds of
devices.

The notion of letting the OS tag things as belonging together or being
independent is a useful concept that benefits *any* kind of device.
Those tags can easily be mapped to resource streams in a flash device or
a particular zone cache segment on an SMR drive or in an array.

I would just like the tag to be completely arbitrary so we can manage it
on behalf of all applications and devices. That puts the burden on the
device to manage the OS tag to internal resource mapping but I think
that's a small price to pay to have a concept that works for all classes
of devices, software RAID, etc.

This does not in any way preclude the device communicating "I'd prefer
if you only kept 8 streams going/8 zones open" like we do with all the
other device characteristics. My gripe is that the programming model is
being forcefully changed so we now have to get a permit before
submitting an I/O. And very aggressively clean up since the permits are
being envisioned as super scarce.

Jens> The reality is that we can't demand that devices support thousands
Jens> of streams.

Why not? It's just a number. Tracking a large number of independent
streams hasn't been a problem for storage arrays. Nobody says a 32-bit
ID requires you to concurrently track bazillions of streams. Pick a
reasonable number of live contexts given your device's actual resources.

Jens> The write streams proposal was already approved by t10...

Nope. It's still being discussed.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3 v8] mailbox: Enable BCM2835 mailbox support

2015-05-11 Thread Stephen Warren

On 05/08/2015 01:19 PM, Eric Anholt wrote:
> Alexander Stein  writes:
> 
>> On Thursday 07 May 2015, 12:54:20 wrote Eric Anholt:
>>> Noralf Trønnes  writes:
>>> 
 Den 05.05.2015 22:27, skrev Eric Anholt:
> From: Lubomir Rintel 
> 
> This mailbox driver provides a single mailbox channel to
> write 32-bit values to the VPU and get a 32-bit response.
> The Raspberry Pi firmware uses this mailbox channel to
> implement firmware calls, while Roku 2 (despite being
> derived from the same firmware tree) doesn't.
> 
> The driver was originally submitted by Lubomir, based on
> the out-of-tree 2708 mailbox driver.  Eric Anholt fixed it
> up for upstreaming, with the major functional change being
> that it now has no notion of multiple channels (since that
> is a firmware-dependent concept) and instead the
> raspberrypi-firmware driver will do that bit-twiddling in
> its own messages.
 ...
> +static struct platform_driver bcm2835_mbox_driver = { +
> .driver = { + .name = "bcm2835-mbox", +   .owner =
> THIS_MODULE, +.of_match_table = bcm2835_mbox_of_match, +
> }, +  .probe  = bcm2835_mbox_probe, + .remove =
> bcm2835_mbox_remove, +}; 
> +module_platform_driver(bcm2835_mbox_driver);
 
 I have tested this driver and the firmware driver booting
 directly from the VideoCore bootloader (no uboot). The
 mailbox driver loads too late to turn on USB power:
>>> 
>>> Yeah, I have a patch on my branches that returns -EPROBE_DEFER
>>> when trying to get a power domain and not finding the provider.
>>> It was rejected by the maintainers in favor of a proposed
>>> solution whose description I didn't quite follow.
>> 
>> Do you have a link for this thread?
> 
> https://lkml.org/lkml/2015/3/11/483

That's really odd; -EPROBE_DEFER was clearly invented exactly to
handle dependencies just like this. Playing with initcall levels
simply isn't scalable, and in the main people are actively working not
to use them for dependencies like this; they're far too implicit.
While the timeout mentioned earlier in the thread might work (I didn't
really look at the details), again it's far too indirect/accidental to
be a good solution.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 4/9] KVM: MMU: introduce slot_handle_level_range() and its helpers

2015-05-11 Thread Xiao Guangrong

There are several places walking all rmaps for the memslot so that
introduce common functions to cleanup the code

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 69 ++
 1 file changed, 69 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 98b7a6a..55ed4f6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4453,6 +4453,75 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
init_kvm_mmu(vcpu);
 }
 
+/* The return value indicates if tlb flush on all vcpus is needed. */
+typedef bool (*slot_level_handler) (struct kvm *kvm, unsigned long *rmap);
+
+/* The caller should hold mmu-lock before calling this function. */
+static bool
+slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
+   slot_level_handler fn, int start_level, int end_level,
+   gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb)
+{
+   struct slot_rmap_walk_iterator iterator;
+   bool flush = false;
+
+   for_each_slot_rmap_range(memslot, start_level, end_level, start_gfn,
+  end_gfn, ) {
+   if (iterator.rmap)
+   flush |= fn(kvm, iterator.rmap);
+
+   if (need_resched() || spin_needbreak(>mmu_lock)) {
+   if (flush & lock_flush_tlb) {
+   kvm_flush_remote_tlbs(kvm);
+   flush = false;
+   }
+   cond_resched_lock(>mmu_lock);
+   }
+   }
+
+   if (flush && lock_flush_tlb) {
+   kvm_flush_remote_tlbs(kvm);
+   flush = false;
+   }
+
+   return flush;
+}
+
+static bool
+slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
+ slot_level_handler fn, int start_level, int end_level,
+ bool lock_flush_tlb)
+{
+   return slot_handle_level_range(kvm, memslot, fn, start_level,
+   end_level, memslot->base_gfn,
+   memslot->base_gfn + memslot->npages - 1,
+   lock_flush_tlb);
+}
+
+static bool
+slot_handle_all_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
+ slot_level_handler fn, bool lock_flush_tlb)
+{
+   return slot_handle_level(kvm, memslot, fn, PT_PAGE_TABLE_LEVEL,
+   PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1, lock_flush_tlb);
+}
+
+static bool
+slot_handle_large_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
+   slot_level_handler fn, bool lock_flush_tlb)
+{
+   return slot_handle_level(kvm, memslot, fn, PT_PAGE_TABLE_LEVEL + 1,
+   PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1, lock_flush_tlb);
+}
+
+static bool
+slot_handle_leaf(struct kvm *kvm, struct kvm_memory_slot *memslot,
+slot_level_handler fn, bool lock_flush_tlb)
+{
+   return slot_handle_level(kvm, memslot, fn, PT_PAGE_TABLE_LEVEL,
+PT_PAGE_TABLE_LEVEL, lock_flush_tlb);
+}
+
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
  struct kvm_memory_slot *memslot)
 {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/9] KVM: MMU: fix decoding cache type from MTRR

2015-05-11 Thread Xiao Guangrong

There are some bugs in current get_mtrr_type();
1: bit 1 of mtrr_state->enabled is corresponding bit 11 of
   IA32_MTRR_DEF_TYPE MSR which completely control MTRR's enablement
   that means other bits are ignored if it is cleared

2: the fixed MTRR ranges are controlled by bit 0 of
   mtrr_state->enabled (bit 10 of IA32_MTRR_DEF_TYPE)

3: if MTRR is disabled, UC is applied to all of physical memory rather
   than mtrr_state->def_type

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3711095..f5fcfc1 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2393,19 +2393,20 @@ EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page);
 static int get_mtrr_type(struct mtrr_state_type *mtrr_state,
 u64 start, u64 end)
 {
-   int i;
u64 base, mask;
u8 prev_match, curr_match;
-   int num_var_ranges = KVM_NR_VAR_MTRR;
+   int i, num_var_ranges = KVM_NR_VAR_MTRR;
 
-   if (!mtrr_state->enabled)
-   return 0xFF;
+   /* MTRR is completely disabled, use UC for all of physical memory. */
+   if (!(mtrr_state->enabled & 0x2))
+   return MTRR_TYPE_UNCACHABLE;
 
/* Make end inclusive end, instead of exclusive */
end--;
 
/* Look in fixed ranges. Just return the type as per start */
-   if (mtrr_state->have_fixed && (start < 0x10)) {
+   if (mtrr_state->have_fixed && (mtrr_state->enabled & 0x1) &&
+ (start < 0x10)) {
int idx;
 
if (start < 0x8) {
@@ -2428,9 +2429,6 @@ static int get_mtrr_type(struct mtrr_state_type 
*mtrr_state,
 * Look of multiple ranges matching this address and pick type
 * as per MTRR precedence
 */
-   if (!(mtrr_state->enabled & 2))
-   return mtrr_state->def_type;
-
prev_match = 0xFF;
for (i = 0; i < num_var_ranges; ++i) {
unsigned short start_state, end_state;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 5/9] KVM: MMU: use slot_handle_level and its helper to clean up the code

2015-05-11 Thread Xiao Guangrong

slot_handle_level and its helper functions are ready now, use them to
clean up the code

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 130 +++--
 1 file changed, 16 insertions(+), 114 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 55ed4f6..fae349a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4522,35 +4522,19 @@ slot_handle_leaf(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
 PT_PAGE_TABLE_LEVEL, lock_flush_tlb);
 }
 
+static bool slot_rmap_write_protect(struct kvm *kvm, unsigned long *rmapp)
+{
+   return __rmap_write_protect(kvm, rmapp, false);
+}
+
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
  struct kvm_memory_slot *memslot)
 {
-   gfn_t last_gfn;
-   int i;
-   bool flush = false;
-
-   last_gfn = memslot->base_gfn + memslot->npages - 1;
+   bool flush;
 
spin_lock(>mmu_lock);
-
-   for (i = PT_PAGE_TABLE_LEVEL;
-i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
-   unsigned long *rmapp;
-   unsigned long last_index, index;
-
-   rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
-   last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
-
-   for (index = 0; index <= last_index; ++index, ++rmapp) {
-   if (*rmapp)
-   flush |= __rmap_write_protect(kvm, rmapp,
-   false);
-
-   if (need_resched() || spin_needbreak(>mmu_lock))
-   cond_resched_lock(>mmu_lock);
-   }
-   }
-
+   flush = slot_handle_all_level(kvm, memslot, slot_rmap_write_protect,
+ false);
spin_unlock(>mmu_lock);
 
/*
@@ -4611,59 +4595,18 @@ restart:
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
struct kvm_memory_slot *memslot)
 {
-   bool flush = false;
-   unsigned long *rmapp;
-   unsigned long last_index, index;
-
spin_lock(>mmu_lock);
-
-   rmapp = memslot->arch.rmap[0];
-   last_index = gfn_to_index(memslot->base_gfn + memslot->npages - 1,
-   memslot->base_gfn, PT_PAGE_TABLE_LEVEL);
-
-   for (index = 0; index <= last_index; ++index, ++rmapp) {
-   if (*rmapp)
-   flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
-
-   if (need_resched() || spin_needbreak(>mmu_lock)) {
-   if (flush) {
-   kvm_flush_remote_tlbs(kvm);
-   flush = false;
-   }
-   cond_resched_lock(>mmu_lock);
-   }
-   }
-
-   if (flush)
-   kvm_flush_remote_tlbs(kvm);
-
+   slot_handle_leaf(kvm, memslot, kvm_mmu_zap_collapsible_spte, true);
spin_unlock(>mmu_lock);
 }
 
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
   struct kvm_memory_slot *memslot)
 {
-   gfn_t last_gfn;
-   unsigned long *rmapp;
-   unsigned long last_index, index;
-   bool flush = false;
-
-   last_gfn = memslot->base_gfn + memslot->npages - 1;
+   bool flush;
 
spin_lock(>mmu_lock);
-
-   rmapp = memslot->arch.rmap[PT_PAGE_TABLE_LEVEL - 1];
-   last_index = gfn_to_index(last_gfn, memslot->base_gfn,
-   PT_PAGE_TABLE_LEVEL);
-
-   for (index = 0; index <= last_index; ++index, ++rmapp) {
-   if (*rmapp)
-   flush |= __rmap_clear_dirty(kvm, rmapp);
-
-   if (need_resched() || spin_needbreak(>mmu_lock))
-   cond_resched_lock(>mmu_lock);
-   }
-
+   flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
spin_unlock(>mmu_lock);
 
lockdep_assert_held(>slots_lock);
@@ -4682,31 +4625,11 @@ EXPORT_SYMBOL_GPL(kvm_mmu_slot_leaf_clear_dirty);
 void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
struct kvm_memory_slot *memslot)
 {
-   gfn_t last_gfn;
-   int i;
-   bool flush = false;
-
-   last_gfn = memslot->base_gfn + memslot->npages - 1;
+   bool flush;
 
spin_lock(>mmu_lock);
-
-   for (i = PT_PAGE_TABLE_LEVEL + 1; /* skip rmap for 4K page */
-i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
-   unsigned long *rmapp;
-   unsigned long last_index, index;
-
-   rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
-   last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
-
-   for (index = 0; index <= last_index; ++index, ++rmapp) {
-   if (*rmapp)
-   flush |= __rmap_write_protect(kvm, rmapp,
-

[PATCH v2 8/9] KVM: MMU: fix MTRR update

2015-05-11 Thread Xiao Guangrong

Currently, whenever guest MTRR registers are changed
kvm_mmu_reset_context is called to switch to the new root shadow page
table, however, it's useless since:
1) the cache type is not cached into shadow page's attribute so that
   the original root shadow page will be reused

2) the cache type is set on the last spte, that means we should sync
   the last sptes when MTRR is changed

This patch fixs this issue by drop all the spte in the gfn range which
is being updated by MTRR

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/x86.c | 59 +-
 1 file changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cdccbe1..a527dd0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1854,6 +1854,63 @@ bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 
data)
 }
 EXPORT_SYMBOL_GPL(kvm_mtrr_valid);
 
+static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
+{
+   struct mtrr_state_type *mtrr_state = >arch.mtrr_state;
+   unsigned char mtrr_enabled = mtrr_state->enabled;
+   gfn_t start, end, mask;
+   int index;
+   bool is_fixed = true;
+
+   if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
+ !kvm_arch_has_noncoherent_dma(vcpu->kvm))
+   return;
+
+   if (!(mtrr_enabled & 0x2) && msr != MSR_MTRRdefType)
+   return;
+
+   switch (msr) {
+   case MSR_MTRRfix64K_0:
+   start = 0x0;
+   end = 0x8;
+   break;
+   case MSR_MTRRfix16K_8:
+   start = 0x8;
+   end = 0xa;
+   break;
+   case MSR_MTRRfix16K_A:
+   start = 0xa;
+   end = 0xc;
+   break;
+   case MSR_MTRRfix4K_C ... MSR_MTRRfix4K_F8000:
+   index = msr - MSR_MTRRfix4K_C;
+   start = 0xc + index * (32 << 10);
+   end = start + (32 << 10);
+   break;
+   case MSR_MTRRdefType:
+   is_fixed = false;
+   start = 0x0;
+   end = ~0ULL;
+   break;
+   default:
+   /* variable range MTRRs. */
+   is_fixed = false;
+   index = (msr - 0x200) / 2;
+   start = (((u64)mtrr_state->var_ranges[index].base_hi) << 32) +
+  (mtrr_state->var_ranges[index].base_lo & PAGE_MASK);
+   mask = (((u64)mtrr_state->var_ranges[index].mask_hi) << 32) +
+  (mtrr_state->var_ranges[index].mask_lo & PAGE_MASK);
+   mask |= ~0ULL << cpuid_maxphyaddr(vcpu);
+
+   end = ((start & mask) | ~mask) + 1;
+   }
+
+   if (is_fixed && !(mtrr_enabled & 0x1))
+   return;
+
+   kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
+}
+
 static int set_msr_mtrr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
u64 *p = (u64 *)>arch.mtrr_state.fixed_ranges;
@@ -1887,7 +1944,7 @@ static int set_msr_mtrr(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
*pt = data;
}
 
-   kvm_mmu_reset_context(vcpu);
+   update_mtrr(vcpu, msr);
return 0;
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 6/9] KVM: MMU: introduce kvm_zap_rmapp

2015-05-11 Thread Xiao Guangrong

Split kvm_unmap_rmapp and introduce kvm_zap_rmapp which will be used in the
later patch

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fae349a..10d5e03 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1353,25 +1353,29 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
return write_protected;
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
-  struct kvm_memory_slot *slot, gfn_t gfn, int level,
-  unsigned long data)
+static bool kvm_zap_rmapp(struct kvm *kvm, unsigned long *rmapp)
 {
u64 *sptep;
struct rmap_iterator iter;
-   int need_tlb_flush = 0;
+   bool flush = false;
 
 restart:
for_each_rmap_spte(rmapp, , sptep) {
-   rmap_printk("kvm_rmap_unmap_hva: spte %p %llx gfn %llx (%d)\n",
-sptep, *sptep, gfn, level);
+   rmap_printk("%s: spte %p %llx.\n", __func__, sptep, *sptep);
 
drop_spte(kvm, sptep);
-   need_tlb_flush = 1;
+   flush = true;
goto restart;
}
 
-   return need_tlb_flush;
+   return flush;
+}
+
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
+  struct kvm_memory_slot *slot, gfn_t gfn, int level,
+  unsigned long data)
+{
+   return kvm_zap_rmapp(kvm, rmapp);
 }
 
 static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 2/9] KVM: MMU: introduce for_each_rmap_spte()

2015-05-11 Thread Xiao Guangrong

It's used to walk all the sptes on the rmap to clean up the
code

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c   | 63 +++-
 arch/x86/kvm/mmu_audit.c |  4 +--
 2 files changed, 26 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f5fcfc1..0da9cf0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1142,6 +1142,11 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
return NULL;
 }
 
+#define for_each_rmap_spte(_rmap_, _iter_, _spte_) \
+  for (_spte_ = rmap_get_first(*_rmap_, _iter_);   \
+   _spte_ && ({BUG_ON(!is_shadow_present_pte(*_spte_)); 1;});  \
+   _spte_ = rmap_get_next(_iter_))
+
 static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
if (mmu_spte_clear_track_bits(sptep))
@@ -1205,12 +1210,8 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
struct rmap_iterator iter;
bool flush = false;
 
-   for (sptep = rmap_get_first(*rmapp, ); sptep;) {
-   BUG_ON(!(*sptep & PT_PRESENT_MASK));
-
+   for_each_rmap_spte(rmapp, , sptep)
flush |= spte_write_protect(kvm, sptep, pt_protect);
-   sptep = rmap_get_next();
-   }
 
return flush;
 }
@@ -1232,12 +1233,8 @@ static bool __rmap_clear_dirty(struct kvm *kvm, unsigned 
long *rmapp)
struct rmap_iterator iter;
bool flush = false;
 
-   for (sptep = rmap_get_first(*rmapp, ); sptep;) {
-   BUG_ON(!(*sptep & PT_PRESENT_MASK));
-
+   for_each_rmap_spte(rmapp, , sptep)
flush |= spte_clear_dirty(kvm, sptep);
-   sptep = rmap_get_next();
-   }
 
return flush;
 }
@@ -1259,12 +1256,8 @@ static bool __rmap_set_dirty(struct kvm *kvm, unsigned 
long *rmapp)
struct rmap_iterator iter;
bool flush = false;
 
-   for (sptep = rmap_get_first(*rmapp, ); sptep;) {
-   BUG_ON(!(*sptep & PT_PRESENT_MASK));
-
+   for_each_rmap_spte(rmapp, , sptep)
flush |= spte_set_dirty(kvm, sptep);
-   sptep = rmap_get_next();
-   }
 
return flush;
 }
@@ -1368,13 +1361,14 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
struct rmap_iterator iter;
int need_tlb_flush = 0;
 
-   while ((sptep = rmap_get_first(*rmapp, ))) {
-   BUG_ON(!(*sptep & PT_PRESENT_MASK));
+restart:
+   for_each_rmap_spte(rmapp, , sptep) {
rmap_printk("kvm_rmap_unmap_hva: spte %p %llx gfn %llx (%d)\n",
 sptep, *sptep, gfn, level);
 
drop_spte(kvm, sptep);
need_tlb_flush = 1;
+   goto restart;
}
 
return need_tlb_flush;
@@ -1394,8 +1388,8 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
WARN_ON(pte_huge(*ptep));
new_pfn = pte_pfn(*ptep);
 
-   for (sptep = rmap_get_first(*rmapp, ); sptep;) {
-   BUG_ON(!is_shadow_present_pte(*sptep));
+restart:
+   for_each_rmap_spte(rmapp, , sptep) {
rmap_printk("kvm_set_pte_rmapp: spte %p %llx gfn %llx (%d)\n",
 sptep, *sptep, gfn, level);
 
@@ -1403,7 +1397,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
 
if (pte_write(*ptep)) {
drop_spte(kvm, sptep);
-   sptep = rmap_get_first(*rmapp, );
+   goto restart;
} else {
new_spte = *sptep & ~PT64_BASE_ADDR_MASK;
new_spte |= (u64)new_pfn << PAGE_SHIFT;
@@ -1414,7 +1408,6 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
 
mmu_spte_clear_track_bits(sptep);
mmu_spte_set(sptep, new_spte);
-   sptep = rmap_get_next();
}
}
 
@@ -1518,16 +1511,13 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long 
*rmapp,
 
BUG_ON(!shadow_accessed_mask);
 
-   for (sptep = rmap_get_first(*rmapp, ); sptep;
-sptep = rmap_get_next()) {
-   BUG_ON(!is_shadow_present_pte(*sptep));
-
+   for_each_rmap_spte(rmapp, , sptep)
if (*sptep & shadow_accessed_mask) {
young = 1;
clear_bit((ffs(shadow_accessed_mask) - 1),
 (unsigned long *)sptep);
}
-   }
+
trace_kvm_age_page(gfn, level, slot, young);
return young;
 }
@@ -1548,15 +1538,11 @@ static int kvm_test_age_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
if (!shadow_accessed_mask)
goto out;
 
-   for (sptep = rmap_get_first(*rmapp, ); sptep;
-sptep = rmap_get_next()) {
-   BUG_ON(!is_shadow_present_pte(*sptep));
-

[PATCH v2 7/9] KVM: MMU: introduce kvm_zap_gfn_range

2015-05-11 Thread Xiao Guangrong

It is used to zap all the rmaps of the specified gfn range and will
be used by the later patch

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 25 +
 arch/x86/kvm/mmu.h |  1 +
 2 files changed, 26 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 10d5e03..8c400dc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4526,6 +4526,31 @@ slot_handle_leaf(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
 PT_PAGE_TABLE_LEVEL, lock_flush_tlb);
 }
 
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
+{
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *memslot;
+
+   slots = kvm_memslots(kvm);
+
+   spin_lock(>mmu_lock);
+   kvm_for_each_memslot(memslot, slots) {
+   gfn_t start, end;
+
+   start = max(gfn_start, memslot->base_gfn);
+   end = min(gfn_end, memslot->base_gfn + memslot->npages);
+   if (start >= end)
+   continue;
+
+   slot_handle_level_range(kvm, memslot, kvm_zap_rmapp,
+   PT_PAGE_TABLE_LEVEL,
+   PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1,
+   start, end - 1, true);
+   }
+
+   spin_unlock(>mmu_lock);
+}
+
 static bool slot_rmap_write_protect(struct kvm *kvm, unsigned long *rmapp)
 {
return __rmap_write_protect(kvm, rmapp, false);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 06eb2fc..deec5a8 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -172,4 +172,5 @@ static inline bool permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
 }
 
 void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
 #endif
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 9/9] KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed

2015-05-11 Thread Xiao Guangrong

CR0.CD and CR0.NW are not used by shadow page table so that need
not adjust mmu if these two bit are changed

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/x86.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a527dd0..a82d26f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -572,8 +572,7 @@ out:
 int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 {
unsigned long old_cr0 = kvm_read_cr0(vcpu);
-   unsigned long update_bits = X86_CR0_PG | X86_CR0_WP |
-   X86_CR0_CD | X86_CR0_NW;
+   unsigned long update_bits = X86_CR0_PG | X86_CR0_WP;
 
cr0 |= X86_CR0_ET;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 3/9] KVM: MMU: introduce for_each_slot_rmap_range

2015-05-11 Thread Xiao Guangrong

It's used to abstract the code from kvm_handle_hva_range and it will be
used by later patch

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 97 +-
 1 file changed, 75 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0da9cf0..98b7a6a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1417,6 +1417,74 @@ restart:
return 0;
 }
 
+struct slot_rmap_walk_iterator {
+   /* input fields. */
+   struct kvm_memory_slot *slot;
+   gfn_t start_gfn;
+   gfn_t end_gfn;
+   int start_level;
+   int end_level;
+
+   /* output fields. */
+   gfn_t gfn;
+   unsigned long *rmap;
+   int level;
+
+   /* private field. */
+   unsigned long *end_rmap;
+};
+
+static void
+rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator, int level)
+{
+   iterator->level = level;
+   iterator->gfn = iterator->start_gfn;
+   iterator->rmap = __gfn_to_rmap(iterator->gfn, level, iterator->slot);
+   iterator->end_rmap = __gfn_to_rmap(iterator->end_gfn, level,
+  iterator->slot);
+}
+
+static void
+slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator,
+   struct kvm_memory_slot *slot, int start_level,
+   int end_level, gfn_t start_gfn, gfn_t end_gfn)
+{
+   iterator->slot = slot;
+   iterator->start_level = start_level;
+   iterator->end_level = end_level;
+   iterator->start_gfn = start_gfn;
+   iterator->end_gfn = end_gfn;
+
+   rmap_walk_init_level(iterator, iterator->start_level);
+}
+
+static bool slot_rmap_walk_okay(struct slot_rmap_walk_iterator *iterator)
+{
+   return !!iterator->rmap;
+}
+
+static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
+{
+   if (++iterator->rmap <= iterator->end_rmap) {
+   iterator->gfn += (1UL << KVM_HPAGE_GFN_SHIFT(iterator->level));
+   return;
+   }
+
+   if (++iterator->level > iterator->end_level) {
+   iterator->rmap = NULL;
+   return;
+   }
+
+   rmap_walk_init_level(iterator, iterator->level);
+}
+
+#define for_each_slot_rmap_range(_slot_, _start_level_, _end_level_,   \
+  _start_gfn, _end_gfn, _iter_)\
+   for (slot_rmap_walk_init(_iter_, _slot_, _start_level_, \
+_end_level_, _start_gfn, _end_gfn);\
+slot_rmap_walk_okay(_iter_);   \
+slot_rmap_walk_next(_iter_))
+
 static int kvm_handle_hva_range(struct kvm *kvm,
unsigned long start,
unsigned long end,
@@ -1428,10 +1496,10 @@ static int kvm_handle_hva_range(struct kvm *kvm,
   int level,
   unsigned long data))
 {
-   int j;
-   int ret = 0;
struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;
+   struct slot_rmap_walk_iterator iterator;
+   int ret = 0;
 
slots = kvm_memslots(kvm);
 
@@ -1451,26 +1519,11 @@ static int kvm_handle_hva_range(struct kvm *kvm,
gfn_start = hva_to_gfn_memslot(hva_start, memslot);
gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
 
-   for (j = PT_PAGE_TABLE_LEVEL;
-j < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++j) {
-   unsigned long idx, idx_end;
-   unsigned long *rmapp;
-   gfn_t gfn = gfn_start;
-
-   /*
-* {idx(page_j) | page_j intersects with
-*  [hva_start, hva_end)} = {idx, idx+1, ..., idx_end}.
-*/
-   idx = gfn_to_index(gfn_start, memslot->base_gfn, j);
-   idx_end = gfn_to_index(gfn_end - 1, memslot->base_gfn, 
j);
-
-   rmapp = __gfn_to_rmap(gfn_start, j, memslot);
-
-   for (; idx <= idx_end;
-  ++idx, gfn += (1UL << KVM_HPAGE_GFN_SHIFT(j)))
-   ret |= handler(kvm, rmapp++, memslot,
-  gfn, j, data);
-   }
+   for_each_slot_rmap_range(memslot, PT_PAGE_TABLE_LEVEL,
+  PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1,
+  gfn_start, gfn_end - 1, )
+   ret |= handler(kvm, iterator.rmap, memslot,
+  iterator.gfn, iterator.level, data);
}
 
return ret;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the

[PATCH v2 0/9] KVM: MTRR fixes and some cleanups

2015-05-11 Thread Xiao Guangrong

Changelog:
- fix the bit description in changelog of the first patch, thanks
  David Matlack for pointing it out

all follow changes are from Paolo's comment and really appreciate it:
- reorder the whole patchset to make it is more readable
- redesign the iterator APIs
- make TLB clean if @lock_flush_tlb is true in slot_handle_level()
- make MTRR update be generic

This are some MTRR bugs if legacy IOMMU device is used on Intel's CPU:
- In current code, whenever guest MTRR registers are changed
  kvm_mmu_reset_context is called to switch to the new root shadow page
  table, however, it's useless since:
  1) the cache type is not cached into shadow page's attribute so that the
 original root shadow page will be reused

  2) the cache type is set on the last spte, that means we should sync the
 last sptes when MTRR is changed

  We can fix it by dropping all the spte in the gfn range which is
  being updated by MTRR

- some bugs are in get_mtrr_type();
  1: bit 1 of mtrr_state->enabled is corresponding bit 11 of IA32_MTRR_DEF_TYPE
 MSR which completely control MTRR's enablement that means other bits are
 ignored if it is cleared

  2: the fixed MTRR ranges are controlled by bit 0 of mtrr_state->enabled (bit
 10 of IA32_MTRR_DEF_TYPE)
  
  3: if MTRR is disabled, UC is applied to all of physical memory rather than
 mtrr_state->def_type

- we need not to reset mmu once cache policy is changed since shadow page table
  does not virtualize any cache policy

Also, these are some cleanups to make current MMU code more cleaner and help
us fixing the bug more easier. 

Xiao Guangrong (9):
  KVM: MMU: fix decoding cache type from MTRR
  KVM: MMU: introduce for_each_rmap_spte()
  KVM: MMU: introduce for_each_slot_rmap_range
  KVM: MMU: introduce slot_handle_level_range() and its helpers
  KVM: MMU: use slot_handle_level and its helper to clean up the code
  KVM: MMU: introduce kvm_zap_rmapp
  KVM: MMU: introduce kvm_zap_gfn_range
  KVM: MMU: fix MTRR update
  KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed

 arch/x86/kvm/mmu.c   | 408 ++-
 arch/x86/kvm/mmu.h   |   1 +
 arch/x86/kvm/mmu_audit.c |   4 +-
 arch/x86/kvm/x86.c   |  62 ++-
 4 files changed, 284 insertions(+), 191 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread Daniel Phillips

On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
> On Tue, May 12, 2015 at 12:12:23AM +0200, Pavel Machek wrote:
>> Umm, are you sure. If "some areas of disk are faster than others" is
>> still true on todays harddrives, the gaps will decrease the
>> performance (as you'll "use up" the fast areas more quickly).
> 
> It's still true.  The difference between O.D. and I.D. (outer diameter
> vs inner diameter) LBA's is typically a factor of 2.  This is why
> "short-stroking" works as a technique,

That is true, and the effect is not dominant compared to introducing
a lot of extra seeks.

> and another way that people
> doing competitive benchmarking can screw up and produce misleading
> numbers.

If you think we screwed up or produced misleading numbers, could you
please be up front about it instead of making insinuations and
continuing your tirade against benchmarking and those who do it.

> (If you use partitions instead of the whole disk, you have
> to use the same partition in order to make sure you aren't comparing
> apples with oranges.)

You can rest assured I did exactly that.

Somebody complained that things would look much different with seeks
factored out, so here are some new "competitive benchmarks" using
fs_mark on a ram disk:

   tasks11664

   ext4:   231  2154   5439
   btrfs:  152   962   2230
   xfs:268  2729   6466
   tux3:   315  5529  20301

(Files per second, more is better)

The shell commands are:

   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s1048576 -w4096 -n1000 -t1
   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s65536 -w4096 -n1000 -t16
   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s4096 -w4096 -n1000 -t64

The ram disk removes seek overhead and greatly reduces media transfer
overhead. This does not change things much: it confirms that Tux3 is
significantly faster than the others at synchronous loads. This is
apparently true independently of media type, though to be sure SSD
remains to be tested.

The really interesting result is how much difference there is between
filesystems, even on a ram disk. Is it just CPU or is it synchronization
strategy and lock contention? Does our asynchronous front/back design
actually help a lot, instead of being a disadvantage as you predicted?

It is too bad that fs_mark caps number of tasks at 64, because I am
sure that some embarrassing behavior would emerge at high task counts,
as with my tests on spinning disk.

Anyway, everybody but you loves competitive benchmarks, that is why I
post them. They are not only useful for tracking down performance bugs,
but as you point out, they help us advertise the reasons why Tux3 is
interesting and ought to be merged.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry

2015-05-11 Thread Mike Galbraith

On Mon, 2015-05-11 at 16:13 -0400, Chris Metcalf wrote:
> On 05/09/2015 03:04 AM, Mike Galbraith wrote:
> > On Fri, 2015-05-08 at 13:58 -0400, Chris Metcalf wrote:
> >> For tasks which have elected dataplane functionality, we run
> >> any pending softirqs for the core before returning to userspace,
> >> rather than ever scheduling ksoftirqd to run.  The problem we
> >> fix is that by allowing another task to run on the core, we
> >> guarantee more interrupts in the future to the dataplane task,
> >> which is exactly what dataplane mode is required to prevent.
> > If ksoftirqd were rt class
> 
> I realize I actually don't know if this is true or not.  Is
> ksoftirqd rt class?  If not, it does seem pretty plausible that
> it should be...

It is in an rt kernel, not in a stock kernel, it's malleable in both ;-)

> > softirqs would be gone when the soloist gets
> > the CPU back and heads to userspace.  Being a soloist, it has no use for
> > a priority, so why can't it just let ksoftirqd run if it raises the
> > occasional softirq?  Meeting a contended lock while processing it will
> > wreck the soloist regardless of who does that processing.
> 
> The thing you want to avoid is having two processes both
> runnable at once, since then the "quiesce" mode can't make
> forward progress and basically spins in cpu_idle() until ksoftirqd
> can come in.

The only way ksoftirqd can appear is the soloist woke it.  If alleged
soloist is raising enough softirqs to matter, it ain't really an ultra
sensitive solo artist, it's part of a noise inducing (locks) chorus.

>   Alas, my recollection of the precise failure mode
> is somewhat dimmed; my commit notes from a year ago (for
> a variant of the patch I'm upstreaming now):
> 
>  - Trying to return to userspace with pending softirqs is not
>currently allowed.  Prior to this patch, when this happened
>we would just wait in cpu_idle.  Instead, what we now do is
>directly run any pending softirqs, then go back and retry the
>path where we return to userspace.
>  
>  - Raising softirqs (in this case for hrtimer support) could
>cause the ksoftirqd daemon to be woken on a core.  This is
>bad because on a dataplane core, a QUIESCE process will
>then block until the ksoftirqd runs, and the system sometimes
>seems to flag that soft irqs are available but not schedule
>the timer to arrange for a context switch to ksoftirqd.
>To handle this, we avoid bailing out in __do_softirq() when
>we've been working for a while, if we're on a dataplane core,
>and just keep working until done.  Similarly, on a dataplane
>core running a userspace task, we don't wake ksoftirqd when
>we are raising a softirq, even if we're not in an interrupt
>context where it will run promptly, since a non-interrupt
>context will also run promptly.

Thomas has nuked the hrtimer softirq.

> I'm happy to drop this patch entirely from the series for now, and
> if ksoftirqd shows up as a problem going forward, we can address it
> as necessary at that time.   What do you think?

Inlining softirqs may save a context switch, but adds cycles that we may
consume at higher frequency than the thing we're avoiding.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC v2 3/4] sched: expose capacity_of in sched.h

2015-05-11 Thread Michael Turquette

capacity_of is of use to a cpu frequency scaling policy based on cfs
load tracking and cpu capacity utilization metrics. Expose this call in
sched.h so it can be used in such a policy.

Signed-off-by: Michael Turquette 
---
Changes in v2:
Do not expose get_cpu_usage or capacity_orig_of in sched.h
Expose capacity_of instead

 kernel/sched/fair.c  | 5 -
 kernel/sched/sched.h | 5 +
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 75aec8d..d27ded9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4361,11 +4361,6 @@ static unsigned long target_load(int cpu, int type)
return max(rq->cpu_load[type-1], total);
 }
 
-static unsigned long capacity_of(int cpu)
-{
-   return cpu_rq(cpu)->cpu_capacity;
-}
-
 static unsigned long capacity_orig_of(int cpu)
 {
return cpu_rq(cpu)->cpu_capacity_orig;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e0e1299..4925bc4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1396,6 +1396,11 @@ unsigned long arch_scale_freq_capacity(struct 
sched_domain *sd, int cpu)
 }
 #endif
 
+static inline unsigned long capacity_of(int cpu)
+{
+   return cpu_rq(cpu)->cpu_capacity;
+}
+
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC v2 2/4] sched: sched feature for cpu frequency selection

2015-05-11 Thread Michael Turquette

This patch introduces the SCHED_ENERGY_FREQ sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set to false when SCHED_DEBUG is not defined and thus
disabled by default.

Signed-off-by: Michael Turquette 
---
Changes in v2:
none

 kernel/sched/fair.c | 5 +
 kernel/sched/features.h | 6 ++
 2 files changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46855d0..75aec8d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4207,6 +4207,11 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+static inline bool sched_energy_freq(void)
+{
+   return sched_feat(SCHED_ENERGY_FREQ);
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 91e33cd..77381cf 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
  */
 SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif
+
+/*
+ * Scheduler-driven CPU frequency selection aimed to save energy based on
+ * load tracking
+ */
+SCHED_FEAT(SCHED_ENERGY_FREQ, false)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC v2 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

2015-05-11 Thread Michael Turquette

Scheduler-driven cpu frequency selection is desirable as part of the
on-going effort to make the scheduler better aware of energy
consumption.  No piece of the Linux kernel has a better view of the
factors that affect a cpu frequency selection policy than the
scheduler[0], and this patch is an attempt to converge on an initial
solution.

This patch implements a cpufreq governor that directly accesses
scheduler statistics, in particular per-runqueue capacity utilization
data from cfs via cfs.utilization_load_avg.

Put plainly, this governor selects the lowest cpu frequency that will
prevent a runqueue from being over-utilized (until we hit the highest
frequency of course). This is accomplished by requesting a frequency
that matches the current capacity utilization, plus a margin.

Unlike the previous posting from 2014[1] this governor implements a
"follow the utilization" method, where utilization is defined as the
frequency-invariant product of cfs.utilization_load_avg and
cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu
idle time nor any other method which is unsynchronized with the
scheduler. The entry points for this policy are in fair.c:
enqueue_task_fair, dequeue_task_fair and task_tick_fair.

This policy is implemented using the cpufreq governor interface for two
main reasons:

1) re-using the cpufreq machine drivers without using the governor
interface is hard.

2) using the cpufreq interface allows us to switch between the
scheduler-driven policy and legacy cpufreq governors such as ondemand at
run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all
scheduling classes except for cfs. It is possible to add support for
deadline and other other classes here, but I also wonder if a
multi-governor approach would be a more maintainable solution, where the
cpufreq core aggregates the constraints set by multiple governors.
Supporting such an approach in the cpufreq core would also allow for
peripheral devices to place constraint on cpu frequency without having
to hack such behavior in at the governor level.

Thanks to Juri Lelli  for contributing design ideas,
code and test results.

[0] http://article.gmane.org/gmane.linux.kernel/1499836
[1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Juri Lelli 
Signed-off-by: Michael Turquette 
---
Changes in v2:
Folded in Abel's patch to fix builds for non-SMP. Thanks!
Dropped use of get_cpu_usage. Instead pass in
cfs.utilization_load_avg from fair.c
Added two additional conditions to quickly bail from _update_cpu
Return requested capacity from cpufreq_cfs_update_cpu
Handle frequency-table based systems more gooder
Internal data structures and the way data is shared with the
thread are changed considerably

Food for thought: in cpufreq_cfs_update_cpu we could break out
all of the code preceeding the call to cpufreq_cpu_get into
fair.c. The interface would change from,
unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long util);
to,
unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long cap_target);
This would give fair.c more control over the capacity it wants
to target, and makes the governor interface a bit more flexible
and useful.

 drivers/cpufreq/Kconfig|  24 
 include/linux/cpufreq.h|   3 +
 kernel/sched/Makefile  |   1 +
 kernel/sched/cpufreq_cfs.c | 343 +
 kernel/sched/fair.c|  14 ++
 kernel/sched/sched.h   |   8 ++
 6 files changed, 393 insertions(+)
 create mode 100644 kernel/sched/cpufreq_cfs.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index a171fef..83d51b4 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
  Be aware that not all cpufreq drivers support the conservative
  governor. If unsure have a look at the help section of the
  driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_CFS
+   bool "cfs"
+   select CPU_FREQ_GOV_CFS
+   select CPU_FREQ_GOV_PERFORMANCE
+   help
+ Use the CPUfreq governor 'cfs' as default. This scales
+ cpu frequency from the scheduler as per-entity load tracking
+ statistics are updated.
 endchoice
 
 config CPU_FREQ_GOV_PERFORMANCE
@@ -183,6 +192,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
 
  If in doubt, say N.
 
+config CPU_FREQ_GOV_CFS
+   tristate "'cfs' cpufreq governor"
+   depends on CPU_FREQ
+   select CPU_FREQ_GOV_COMMON
+   help
+ 'cfs' - this governor scales cpu frequency from the
+ scheduler as a function of cpu capacity utilization. It does
+ not evaluate utilization on a periodic basis (as ondemand
+ does) but instead is invoked from the

[PATCH RFC v2 1/4] arm: Frequency invariant scheduler load-tracking support

2015-05-11 Thread Michael Turquette

From: Morten Rasmussen 

Implements arch-specific function to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking. The
factor is:

current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)

This implementation only provides frequency invariance. No
micro-architecture invariance yet.

Cc: Russell King 
Signed-off-by: Morten Rasmussen 
---
Changes in v2:
none

 arch/arm/include/asm/topology.h |  7 ++
 arch/arm/kernel/smp.c   | 53 +++--
 arch/arm/kernel/topology.c  | 17 +
 3 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 2fe85ff..4b985dc 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -24,6 +24,13 @@ void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
 
+#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
+struct sched_domain;
+extern
+unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
+DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
+
 #else
 
 static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 86ef244..297ce1b 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -672,12 +672,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
 static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
 static unsigned long global_l_p_j_ref;
 static unsigned long global_l_p_j_ref_freq;
+static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
+DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
+
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling through arch_scale_freq_capacity()
+ * (implemented in topology.c).
+ */
+static inline
+void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max)
+{
+   unsigned long capacity;
+
+   if (!max)
+   return;
+
+   capacity = (curr << SCHED_CAPACITY_SHIFT) / max;
+   atomic_long_set(_cpu(cpu_freq_capacity, cpu), capacity);
+}
 
 static int cpufreq_callback(struct notifier_block *nb,
unsigned long val, void *data)
 {
struct cpufreq_freqs *freq = data;
int cpu = freq->cpu;
+   unsigned long max = atomic_long_read(_cpu(cpu_max_freq, cpu));
 
if (freq->flags & CPUFREQ_CONST_LOOPS)
return NOTIFY_OK;
@@ -702,6 +724,9 @@ static int cpufreq_callback(struct notifier_block *nb,
per_cpu(l_p_j_ref_freq, cpu),
freq->new);
}
+
+   scale_freq_capacity(cpu, freq->new, max);
+
return NOTIFY_OK;
 }
 
@@ -709,11 +734,35 @@ static struct notifier_block cpufreq_notifier = {
.notifier_call  = cpufreq_callback,
 };
 
+static int cpufreq_policy_callback(struct notifier_block *nb,
+   unsigned long val, void *data)
+{
+   struct cpufreq_policy *policy = data;
+   int i;
+
+   for_each_cpu(i, policy->cpus) {
+   scale_freq_capacity(i, policy->cur, policy->max);
+   atomic_long_set(_cpu(cpu_max_freq, i), policy->max);
+   }
+
+   return NOTIFY_OK;
+}
+
+static struct notifier_block cpufreq_policy_notifier = {
+   .notifier_call  = cpufreq_policy_callback,
+};
+
 static int __init register_cpufreq_notifier(void)
 {
-   return cpufreq_register_notifier(_notifier,
+   int ret;
+
+   ret = cpufreq_register_notifier(_notifier,
CPUFREQ_TRANSITION_NOTIFIER);
+   if (ret)
+   return ret;
+
+   return cpufreq_register_notifier(_policy_notifier,
+   CPUFREQ_POLICY_NOTIFIER);
 }
 core_initcall(register_cpufreq_notifier);
-
 #endif
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 08b7847..9c09e6e 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -169,6 +169,23 @@ static void update_cpu_capacity(unsigned int cpu)
cpu, arch_scale_cpu_capacity(NULL, cpu));
 }
 
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling
+ * factor is updated in smp.c
+ */
+unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+   unsigned long curr = atomic_long_read(_cpu(cpu_freq_capacity, cpu));
+
+   if (!curr)
+   return SCHED_CAPACITY_SCALE;
+
+   return curr;
+}
+
 #else
 static inline void parse_dt_topology(void) {}
 static inline void update_cpu_capacity(unsigned int cpuid) {}
-- 
1.9.1

--
To

[PATCH RFC v2 0/4] scheduler-driven cpu frequency selection

2015-05-11 Thread Michael Turquette

This series implements an event-driven cpufreq governor that scales cpu
frequency as a function of cfs runqueue utilization. The intent of this RFC is
to get some discussion going about how the scheduler can become the policy
engine for selecting cpu frequency, what limitations exist and what design do
we want to take to get to a solution.

V2 changes the interface exposed from the governor to cfs. Instead of being a
"pull" model where get_cpu_usage is used to fetch the utilization, that
information is pushed into the governor. After making this change it becomes
clear that selecting a new capacity target for a cpu can be done entirely
within fair.c without any knowledge of cpufreq or the hardware. I didn't go
that far in this version of the series, but it is something to consider. Such a
change would mean that we do not pass in a utilization value but instead a
capacity target.

RFC v1 from May 4, 2015:
http://lkml.kernel.org/r/<1430777441-15087-1-git-send-email-mturque...@linaro.org>

Old, original idea from October/November of 2014:
http://lkml.kernel.org/r/<1413958051-7103-1-git-send-email-mturque...@linaro.org>

This series depends on having frequency-invariant representations for load.
This requires Vincent's recently merged cpu capacity rework patches, as well as
a new patch from Morten included here. Morten's patch will likely make an
appearance in his energy aware scheduling v4 series.

Thanks to Juri Lelli  for contributing to the development
of the governor.

A git branch with these patches can be pulled from here:
https://git.linaro.org/people/mike.turquette/linux.git sched-freq

Smoke testing has been done on an OMAP4 Pandaboard and an Exynos 5800
Chromebook2. Extensive benchmarking and regression testing has not yet been
done.

Michael Turquette (3):
  sched: sched feature for cpu frequency selection
  sched: expose capacity_of in sched.h
  sched: cpufreq_cfs: pelt-based cpu frequency scaling

Morten Rasmussen (1):
  arm: Frequency invariant scheduler load-tracking support

 arch/arm/include/asm/topology.h |   7 +
 arch/arm/kernel/smp.c   |  53 ++-
 arch/arm/kernel/topology.c  |  17 ++
 drivers/cpufreq/Kconfig |  24 +++
 include/linux/cpufreq.h |   3 +
 kernel/sched/Makefile   |   1 +
 kernel/sched/cpufreq_cfs.c  | 343 
 kernel/sched/fair.c |  24 ++-
 kernel/sched/features.h |   6 +
 kernel/sched/sched.h|  13 ++
 10 files changed, 484 insertions(+), 7 deletions(-)
 create mode 100644 kernel/sched/cpufreq_cfs.c

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] lib/vsprintf.c: Further simplify uuid_string()

2015-05-11 Thread Joe Perches

On Mon, 2015-05-11 at 15:55 -0400, George Spelvin wrote:
> Make the endianness permutation table do double duty by having it
> list not source offsets, but destination offsets.  Thus, it both puts
> the bytes in the right order and skips the hyphens.

Thanks George.  One minor nit maybe not worth updating.

> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
[]
> @@ -1265,10 +1265,9 @@ char *uuid_string(char *buf, char *end, const u8 *addr,
> struct printf_spec spec, const char *fmt)
>  {
>   char uuid[sizeof("----")];
> - char *p = uuid;
>   int i;
> - static const u8 be[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
> - static const u8 le[16] = {3,2,1,0,5,4,7,6,8,9,10,11,12,13,14,15};
> + static const u8 be[16] = {0,2,4,6,9,11,14,16,19,21,24,26,28,30,32,34};
> + static const u8 le[16] = {6,4,2,0,11,9,16,14,19,21,24,26,28,30,32,34};

These might be better with a little comment/explanation
of the values as output offsets for each index.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] workqueue: don't expose workqueue_attrs to users

2015-05-11 Thread Lai Jiangshan

On 05/11/2015 10:59 PM, Tejun Heo wrote:
> On Mon, May 11, 2015 at 05:35:51PM +0800, Lai Jiangshan wrote:
>> workqueue_attrs is an internal-like structure and is exposed with
>> apply_workqueue_attrs() whose user has to investigate the structure
>> before use.
>>
>> And the apply_workqueue_attrs() API is inconvenient with the structure.
>> The user (although there is no user yet currently) has to assemble
>> several LoC to use:
>>  attrs = alloc_workqueue_attrs();
>>  if (!attrs)
>>  return;
>>  attrs->nice = ...;
>>  copy cpumask;
>>  attrs->no_numa = ...;
>>  apply_workqueue_attrs();
>>  free_workqueue_attrs();
>>
>> It is too elaborate. This patch changes apply_workqueue_attrs() API,
>> and one-line-code is enough to be called from user:
>>  apply_workqueue_attrs(wq, cpumask, nice, numa);
>>
>> This patch also reduces the code of workqueue.c, about -50 lines.
>> wq_sysfs_prep_attrs() is removed, wq_[nice|cpumask|numa]_store()
>> directly access to the ->unbound_attrs with the protection
>> of apply_wqattrs_lock();
>>
>> This patch is also a preparation patch of next patch which
>> remove no_numa out from the structure workqueue_attrs which
>> requires apply_workqueue_attrs() has an argument to pass numa affinity.
> 
> I'm not sure about this.  Yeah, sure, it's a bit more lines of code
> but at the same time this'd allow us to make the public interface
> atomic too.  What we prolly should do is changing the interface so
> that we do
> 
>   attrs = prepare_workqueue_attrs(gfp_mask);  /* allocate, lock & 
> copy */
>   /* modify attrs as desired */
>   commit_workqueue_attrs(attrs);  /* apply, unlock and 
> free */

I think the workqueue.c has too much complicated and rarely used APIs
and exposes too much in this way.  No one can set the nice value
and the cpuallowed of a task atomically.

If the user want atomic-able, Her/he can just disable WQ_SYSFS
on its workqueue and maintain a copy of the cpumask, nice, numa values
under its own lock.

> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aoe: Use 64-bit timestamp in frame

2015-05-11 Thread Ed Cashin


First, thanks for the patch.  I do appreciate the attempt to simplify
this part of the driver, but I don't think that this patch is good to merge.

I'll make some comments inline below.

On 05/10/2015 10:35 PM, Tina Ruchandani wrote:

'struct frame' uses two variables to store the sent timestamp - 'struct
timeval' and jiffies. jiffies is used to avoid discrepancies caused by
updates to system time. 'struct timeval' uses 32-bit representation for
seconds which will overflow in year 2038.


The comment in the deleted lines below mentions the fact that the
overflow does not matter for calculating rough-grained deltas in time.
So there is no problem in 2038 or on systems with the clock set to 2038
accidentally.


This patch does the following:
- Replace the use of 'struct timeval' and jiffies with ktime_t, which
is a 64-bit timestamp and is year 2038 safe.
- ktime_t provides both long range (like jiffies) and high resolution
(like timeval). Using ktime_get (monotonic time) instead of wall-clock
time prevents any discprepancies caused by updates to system time.


But the patch only changes the struct frame data.  The aoe driver
only has the struct frame for an incoming AoE response when that
response is "expected".  If the response comes in a bit late, the frame
may have already been used for a new command.

You can see that in aoecmd_ata_rsp when getframe_deferred returns
NULL and tsince is called instead of tsince_hr.

In that case, there is still information about the timing embedded in
the AoE tag.  The send time in jiffies is a rough-grained record of the
send time, and it's extracted from the tag.  For these "unexpected"
responses, this timing information can improve performance significantly
without introducing extra overhead or risk.

I don't think the patch considers this aspect of the way the round trip
time is calculated, and I don't think the primary motivation is justified
(if that's 2038 safety, which we have already).

Simplifying it would be nice, but it would be difficult to thoroughly test
all of the performance implications.  There are still people using 32-bit
systems, for example.



Signed-off-by: Tina Ruchandani 
---
  drivers/block/aoe/aoe.h|  3 +--
  drivers/block/aoe/aoecmd.c | 36 +++-
  2 files changed, 8 insertions(+), 31 deletions(-)

diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h
index 9220f8e..4582b3c 100644
--- a/drivers/block/aoe/aoe.h
+++ b/drivers/block/aoe/aoe.h
@@ -112,8 +112,7 @@ enum frame_flags {
  struct frame {
struct list_head head;
u32 tag;
-   struct timeval sent;/* high-res time packet was sent */
-   u32 sent_jiffs; /* low-res jiffies-based sent time */
+   ktime_t sent;
ulong waited;
ulong waited_total;
struct aoetgt *t;   /* parent target I belong to */
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 422b7d8..7f78780 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -398,8 +398,7 @@ aoecmd_ata_rw(struct aoedev *d)
  
  	skb = skb_clone(f->skb, GFP_ATOMIC);

if (skb) {
-   do_gettimeofday(>sent);
-   f->sent_jiffs = (u32) jiffies;
+   f->sent = ktime_get();
__skb_queue_head_init();
__skb_queue_tail(, skb);
aoenet_xmit();
@@ -489,8 +488,7 @@ resend(struct aoedev *d, struct frame *f)
skb = skb_clone(skb, GFP_ATOMIC);
if (skb == NULL)
return;
-   do_gettimeofday(>sent);
-   f->sent_jiffs = (u32) jiffies;
+   f->sent = ktime_get();
__skb_queue_head_init();
__skb_queue_tail(, skb);
aoenet_xmit();
@@ -499,32 +497,15 @@ resend(struct aoedev *d, struct frame *f)
  static int
  tsince_hr(struct frame *f)
  {
-   struct timeval now;
+   ktime_t now;
int n;
  
-	do_gettimeofday();

-   n = now.tv_usec - f->sent.tv_usec;
-   n += (now.tv_sec - f->sent.tv_sec) * USEC_PER_SEC;
+   now = ktime_get();
+   n = ktime_to_us(ktime_sub(now, f->sent));
  
  	if (n < 0)

n = -n;
  
-	/* For relatively long periods, use jiffies to avoid

-* discrepancies caused by updates to the system time.
-*
-* On system with HZ of 1000, 32-bits is over 49 days
-* worth of jiffies, or over 71 minutes worth of usecs.
-*
-* Jiffies overflow is handled by subtraction of unsigned ints:
-* (gdb) print (unsigned) 2 - (unsigned) 0xfffe
-* $3 = 4
-* (gdb)
-*/
-   if (n > USEC_PER_SEC / 4) {
-   n = ((u32) jiffies) - f->sent_jiffs;
-   n *= USEC_PER_SEC / HZ;
-   }
-
return n;
  }
  
@@ -589,7 +570,6 @@ reassign_frame(struct frame *f)

nf->waited = 0;
nf->waited_total = f->waited_total;
nf->sent = f->sent;
-   nf->sent_jiffs = f->sent_jiffs;
f->skb = skb;
  
  	return

Re: [PATCH v2 1/3] usb: notify hcd when USB device suspend or resume

2015-05-11 Thread Lu, Baolu




On 05/11/2015 10:25 PM, Alan Stern wrote:

On Sat, 9 May 2015, Lu, Baolu wrote:


If FSC is supported,  the cached Slot, Endpoint, Stream, or other
Context information are also saved.

Hence, when FSC is supported, software does not have to issue Stop
Endpoint Command to push public and private endpoint state into
memory as part of system suspend process.

Why do you have to push this state into memory at all?  Does the
controller hardware lose the cached state information when it is in low
power?

I don't think controller hardware will lose the cached state information
when it is in low power. But since cache in controller consumes power
and resources, by pushing state into memory, hardware can power
off the cache logic during suspend. Hence more power saving gains.


The logic in xhci_device_suspend() will look like:

if xhci_device_suspend() callback was called due to system suspend,
(mesg.event & PM_EVENT_SUSPEND is true) and FSC is supported by
the xHC implementation, xhci_device_suspend() could ignore stop
endpoint command, since CSS will be done in xhc_suspend() later and
where all the endpoint caches will be pushed to memory.

I still don't understand this.  You said earlier that according
to section 4.15.1.1 of the xHCI spec, the endpoint rings should
_always_ be stopped with SP set when a device is suspended.  Now you're

The intention of stop endpoint with SP set is to tell hardware that
"a device is going to suspend, hardware don't need to contain the
endpoint state in internal cache anymore". Hardware _could_ use
this hint to push endpoint state into memory to reduce power
consumption.


saying that they don't need to be stopped during a system suspend if
the controller supports FSC.  (Or maybe you're saying they need to be
stopped but SP doesn't need to be set -- it's hard to tell.)

Even FSC is supported, controller hardware still need to push cached
endpoint state into memory when a USB device is suspended. The
difference is when FSC is enforced, CSS command will push any
cached endpoint state into memory unconditionally.

You said above that the hardware _could_ push endpoint state into
memory.  Now you're saying it _needs_ to do this!  Make up your mind.


I'm sorry that I confused you.

FSC is a different thing from what this patch series does.

I should say "software could ask hardware to push endpoint
state into memory even FSC is supported". But in some cases,
it can be optimized as I will describe it below.





So, when xhci_device_suspend() knows that CSS command will be
executed later and CSS command will push cached endpoint state
into memory (a.k.a. FSC is enforced), it could skip issuing stop
endpoint command with SP bit set to avoid duplication and reduce
the suspend time.

This is the case for system suspend since CSS command is part of
xhci_suspend() and xhci_suspend() will be executed after all USB
devices have been suspended. But it's not case for run-time suspend
(auto-pm) since USB device suspend and host controller suspend
are independent for run-time case.

That's the reason why I wanted to keep 'msg' parameter. But just as
Greg said, we don't need to keep a parameter when it's not used
and can add it later when it is required.


So which is it?  Do you need to stop the endpoint rings?  Is it okay
not to set SP?

"stop endpoint" and "stop endpoint with SP set" serve different purposes
in Linux xhci driver as my understanding. "stop endpoint" command is
used to stop a active ring when upper layer want to cancel a URB.
"stop endpoint with SP set" is used to hint hardware that a USB is going
to suspend. Hence "stop endpoint with SP set" must be executed in case
that the transfer ring is empty.

(How does the contents of the transfer ring affect anything?  Besides,
there are never any active URBs when a device gets suspended, so the
transfer ring will _always_ be empty at such times.)

This is still extremely confusing.  You're not doing a good job of
explaining the situation clearly and logically.


I'm sorry for that.



Let's see if I understand it correctly:

When the controller goes into suspend, you want the cache to
be pushed into memory so that the cache can be powered down,
thereby saving additional energy.


Not the controller goes into suspend, but USB devices.

In order to talking to USB devices, xHCI has an endpoint state for each
endpoint of a device. When a USB device goes into suspend, xHC driver
could ask the hardware to push the state from cache to the memory.
Thereby saving additional energy. This is the intention of this patch 
series.




If the hardware supports FSC, this will happen automatically.

If the hardware doesn't support FSC, the cached data won't get
pushed to memory unless the driver tells the controller to do
so at the time the device is suspended.  But this will slow
things down, so the driver should avoid doing it when it's not
needed.

During system

Question: use of FS, GS segment registers in Linux

2015-05-11 Thread Pengfei Yuan

Hi,

What is the use of FS, GS segment registers in kernel and user mode,
on x86 and x86-64?
Where can I find the related kernel (definition/setup) code?

I have read the code in arch/x86/include/asm/segment.h, but still a
little confused.

Regards,

Yuan, Pengfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] workqueue: merge the similar code

2015-05-11 Thread Lai Jiangshan

On 05/11/2015 10:31 PM, Tejun Heo wrote:
> Hello, Lai.

Hello, TJ

> 
>>   * @node: the target NUMA node
>> - * @cpu_going_down: if >= 0, the CPU to consider as offline
>> - * @cpumask: outarg, the resulting cpumask
>> + * @cpu_off: if >= 0, the CPU to consider as offline
> 
> @cpu_off sounds like offset into cpu array or sth.  Is there a reason
> to change the name?

@cpu_off is a local variable in wq_update_unbound_numa() and is a shorter
name.

> 
>> + *
>> + * Allocate or reuse a pwq with the cpumask that @wq should use on @node.
> 
> I wonder whether a better name for the function would be sth like
> get_alloc_node_unbound_pwq().
> 

The name length of alloc_node_unbound_pwq() had already added trouble to me
for code-indent in the called-site.  I can add a variable to ease the indent
problem later, but IMHO, get_alloc_node_unbound_pwq() is not strictly a better
name over alloc_node_unbound_pwq().  Maybe we can consider 
get_node_unbound_pwq()?

>>   *
>> - * Calculate the cpumask a workqueue with @attrs should use on @node.  If
>> - * @cpu_going_down is >= 0, that cpu is considered offline during
>> - * calculation.  The result is stored in @cpumask.
>> + * If NUMA affinity is not enabled, @dfl_pwq is always used.  @dfl_pwq
>> + * was allocated with the effetive attrs saved in @dfl_pwq->pool->attrs.
> 
> I'm not sure we need the second sentence.

effetive -> effective

I used "the effetive attrs" twice bellow.  I need help to rephrase it,
might you do me a favor? Or just use it without introducing it at first?

+ * If enabled and @node has online CPUs requested by the effetive attrs,
+ * the cpumask is the intersection of the possible CPUs of @node and
+ * the cpumask of the effetive attrs.

>> +if (cpumask_equal(cpumask, attrs->cpumask))
>> +goto use_dfl;
>> +if (pwq && wqattrs_equal(tmp_attrs, pwq->pool->attrs))
>> +goto use_existed;
> 
>   goto use_current;

The label use_existed is shared with use_dfl:

use_dfl:
pwq = dfl_pwq;
use_existed:
spin_lock_irq(>pool->lock);
get_pwq(pwq);
spin_unlock_irq(>pool->lock);
return pwq;

But I don't think the dfl_pwq is current.

> 
> Also, would it be difficult to put this in a separate patch?  This is
> mixing code refactoring with behavior change.  Make both code paths
> behave the same way first and then refactor?
> 
>> +
>> +/* create a new pwq */
>> +pwq = alloc_unbound_pwq(wq, tmp_attrs);
>> +if (!pwq && use_dfl_when_fail) {
>> +pr_warn("workqueue: allocation failed while updating NUMA 
>> affinity of \"%s\"\n",
>> +wq->name);
>> +goto use_dfl;
> 
> Does this need to be in this function?  Can't we let the caller handle
> the fallback instead?

Will it leave the duplicated code that this patch tries to remove?

I will try it with introducing a get_pwq_unlocked().

Thanks,
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH kernel v10 01/34] powerpc/eeh/ioda2: Use device::iommu_group to check IOMMU group

2015-05-11 Thread Gavin Shan

On Tue, May 12, 2015 at 01:38:50AM +1000, Alexey Kardashevskiy wrote:
>This relies on the fact that a PCI device always has an IOMMU table
>which may not be the case when we get dynamic DMA windows so
>let's use more reliable check for IOMMU group here.
>
>As we do not rely on the table presence here, remove the workaround
>from pnv_pci_ioda2_set_bypass(); also remove the @add_to_iommu_group
>parameter from pnv_ioda_setup_bus_dma().
>
>Signed-off-by: Alexey Kardashevskiy 

Acked-by: Gavin Shan 

Thanks,
Gavin

>---
> arch/powerpc/kernel/eeh.c |  4 +---
> arch/powerpc/platforms/powernv/pci-ioda.c | 27 +--
> 2 files changed, 6 insertions(+), 25 deletions(-)
>
>diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>index 9ee61d1..defd874 100644
>--- a/arch/powerpc/kernel/eeh.c
>+++ b/arch/powerpc/kernel/eeh.c
>@@ -1412,13 +1412,11 @@ static int dev_has_iommu_table(struct device *dev, 
>void *data)
> {
>   struct pci_dev *pdev = to_pci_dev(dev);
>   struct pci_dev **ppdev = data;
>-  struct iommu_table *tbl;
>
>   if (!dev)
>   return 0;
>
>-  tbl = get_iommu_table_base(dev);
>-  if (tbl && tbl->it_group) {
>+  if (dev->iommu_group) {
>   *ppdev = pdev;
>   return 1;
>   }
>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
>b/arch/powerpc/platforms/powernv/pci-ioda.c
>index f8bc950..2f092bb 100644
>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>@@ -1654,21 +1654,15 @@ static u64 pnv_pci_ioda_dma_get_required_mask(struct 
>pnv_phb *phb,
> }
>
> static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
>- struct pci_bus *bus,
>- bool add_to_iommu_group)
>+ struct pci_bus *bus)
> {
>   struct pci_dev *dev;
>
>   list_for_each_entry(dev, >devices, bus_list) {
>-  if (add_to_iommu_group)
>-  set_iommu_table_base_and_group(>dev,
>- pe->tce32_table);
>-  else
>-  set_iommu_table_base(>dev, pe->tce32_table);
>+  set_iommu_table_base_and_group(>dev, pe->tce32_table);
>
>   if (dev->subordinate)
>-  pnv_ioda_setup_bus_dma(pe, dev->subordinate,
>- add_to_iommu_group);
>+  pnv_ioda_setup_bus_dma(pe, dev->subordinate);
>   }
> }
>
>@@ -1845,7 +1839,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
>*phb,
>   } else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
>   iommu_register_group(tbl, phb->hose->global_number,
>pe->pe_number);
>-  pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
>+  pnv_ioda_setup_bus_dma(pe, pe->pbus);
>   } else if (pe->flags & PNV_IODA_PE_VF) {
>   iommu_register_group(tbl, phb->hose->global_number,
>pe->pe_number);
>@@ -1882,17 +1876,6 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table 
>*tbl, bool enable)
>window_id,
>pe->tce_bypass_base,
>0);
>-
>-  /*
>-   * EEH needs the mapping between IOMMU table and group
>-   * of those VFIO/KVM pass-through devices. We can postpone
>-   * resetting DMA ops until the DMA mask is configured in
>-   * host side.
>-   */
>-  if (pe->pdev)
>-  set_iommu_table_base(>pdev->dev, tbl);
>-  else
>-  pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
>   }
>   if (rc)
>   pe_err(pe, "OPAL error %lld configuring bypass window\n", rc);
>@@ -1984,7 +1967,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
>*phb,
>   } else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
>   iommu_register_group(tbl, phb->hose->global_number,
>pe->pe_number);
>-  pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
>+  pnv_ioda_setup_bus_dma(pe, pe->pbus);
>   } else if (pe->flags & PNV_IODA_PE_VF) {
>   iommu_register_group(tbl, phb->hose->global_number,
>pe->pe_number);
>-- 
>2.4.0.rc3.8.gfb3e7d5
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

linux-next: manual merge of the net-next tree with the net tree

2015-05-11 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
drivers/net/ethernet/qualcomm/qca_spi.c between commit 268be0f7a7d9
("net: qca_spi: Fix possible race during probe") from the net tree and
commit cf9d0dcc5a46 ("ethernet: qualcomm: use spi instead of
spi_device") from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc drivers/net/ethernet/qualcomm/qca_spi.c
index 6af028d5f9bc,c6b749880e46..
--- a/drivers/net/ethernet/qualcomm/qca_spi.c
+++ b/drivers/net/ethernet/qualcomm/qca_spi.c
@@@ -909,12 -909,10 +909,12 @@@ qca_spi_probe(struct spi_device *spi
return -ENOMEM;
}
qca->net_dev = qcaspi_devs;
-   qca->spi_dev = spi_device;
+   qca->spi_dev = spi;
qca->legacy_mode = legacy_mode;
  
-   spi_set_drvdata(spi_device, qcaspi_devs);
++  spi_set_drvdata(spi, qcaspi_devs);
 +
-   mac = of_get_mac_address(spi_device->dev.of_node);
+   mac = of_get_mac_address(spi->dev.of_node);
  
if (mac)
ether_addr_copy(qca->net_dev->dev_addr, mac);


pgpp035UpP9QD.pgp
Description: OpenPGP digital signature

linux-next: manual merge of the net-next tree with the net tree

2015-05-11 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
include/net/codel.h between commit a5d280904050 ("codel: fix
maxpacket/mtu confusion") from the net tree and commit 80ba92fa1a92
("codel: add ce_threshold attribute") from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc include/net/codel.h
index 1e18005f7f65,8c0f78f209e8..
--- a/include/net/codel.h
+++ b/include/net/codel.h
@@@ -119,14 -119,14 +119,16 @@@ static inline u32 codel_time_to_us(code
  /**
   * struct codel_params - contains codel parameters
   * @target:   target queue size (in time units)
+  * @ce_threshold:  threshold for marking packets with ECN CE
   * @interval: width of moving time window
 + * @mtu:  device mtu, or minimal queue backlog in bytes.
   * @ecn:  is Explicit Congestion Notification enabled
   */
  struct codel_params {
codel_time_ttarget;
+   codel_time_tce_threshold;
codel_time_tinterval;
 +  u32 mtu;
boolecn;
  };
  
@@@ -166,14 -167,16 +169,18 @@@ struct codel_stats 
u32 maxpacket;
u32 drop_count;
u32 ecn_mark;
+   u32 ce_mark;
  };
  
+ #define CODEL_DISABLED_THRESHOLD INT_MAX
+ 
 -static void codel_params_init(struct codel_params *params)
 +static void codel_params_init(struct codel_params *params,
 +const struct Qdisc *sch)
  {
params->interval = MS2TIME(100);
params->target = MS2TIME(5);
+   params->ce_threshold = CODEL_DISABLED_THRESHOLD;
 +  params->mtu = psched_mtu(qdisc_dev(sch));
params->ecn = false;
  }
  


pgpu1g9HwwugZ.pgp
Description: OpenPGP digital signature

linux-next: manual merge of the net-next tree with the net tree

2015-05-11 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
net/core/sock.c between commit 2e70aedd3d52 ("Revert "net: kernel
socket should be released in init_net namespace"") from the net tree
and commit affb9792f1d9 ("net: kill sk_change_net and
sk_release_kernel") from the net-next tree.

I fixed it up (the latter removed a function updated by the former) and
can carry the fix as necessary (no action is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpfxyvytFOFX.pgp
Description: OpenPGP digital signature

Re: [PATCH 0/6] support "dataplane" mode for nohz_full

2015-05-11 Thread Mike Galbraith

On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote:
> On 05/11/2015 03:19 PM, Mike Galbraith wrote:
> > I really shouldn't have acked nohz_full -> isolcpus.  Beside the fact
> > that old static isolcpus was_supposed_  to crawl off and die, I know
> > beyond doubt that having isolated a cpu as well as you can definitely
> > does NOT imply that said cpu should become tickless.
> 
> True, at a high level, I agree that it would be better to have a
> top-level concept like Frederic's proposed ISOLATION that includes
> isolcpus and nohz_cpu (and other stuff as needed).
> 
> That said, what you wrote above is wrong; even with the patch you
> acked, setting isolcpus does not automatically turn on nohz_full for
> a given cpu.  The patch made it true the other way around: when
> you say nohz_full, you automatically get isolcpus on that cpu too.
> That does, at least, make sense for the semantics of nohz_full.

I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus.
Yes, with nohz_full currently being static, the old allegedly dying but
also static isolcpus scheduler off switch is a convenient thing to wire
the nohz_full CPU SET (<- hint;) property to.

-Mike


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND][PATCH] Bluetooth: Make request workqueue freezable

2015-05-11 Thread Laura Abbott


On 05/11/2015 06:07 PM, Marcel Holtmann wrote:

Hi Laura,


We've received a number of reports of warnings when coming
out of suspend with certain bluetooth firmware configurations:

WARNING: CPU: 3 PID: 3280 at drivers/base/firmware_class.c:1126
_request_firmware+0x558/0x810()
Modules linked in: ccm ip6t_rpfilter ip6t_REJECT nf_reject_ipv6
xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter
ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw ip6table_filter
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
binfmt_misc bnep intel_rapl iosf_mbi arc4 x86_pkg_temp_thermal
snd_hda_codec_hdmi coretemp kvm_intel joydev snd_hda_codec_realtek
iwldvm snd_hda_codec_generic kvm iTCO_wdt mac80211 iTCO_vendor_support
snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul
snd_hwdep crc32_pclmul snd_seq crc32c_intel ghash_clmulni_intel uvcvideo
snd_seq_device iwlwifi btusb videobuf2_vmalloc snd_pcm videobuf2_core
serio_raw bluetooth cfg80211 videobuf2_memops sdhci_pci v4l2_common
videodev thinkpad_acpi sdhci i2c_i801 lpc_ich mfd_core wacom mmc_core
media snd_timer tpm_tis hid_logitech_hidpp wmi tpm rfkill snd mei_me mei
shpchp soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915
i2c_algo_bit drm_kms_helper e1000e drm hid_logitech_dj ptp pps_core
video
CPU: 3 PID: 3280 Comm: kworker/u17:0 Not tainted 3.19.3-200.fc21.x86_64
Hardware name: LENOVO 343522U/343522U, BIOS GCET96WW (2.56 ) 10/22/2013
Workqueue: hci0 hci_power_on [bluetooth]
 89944328 88040acffb78 8176e215
  88040acffbb8 8109bc1a
 88040acffcd0 fff5 8804076bac40
Call Trace:
[] dump_stack+0x45/0x57
[] warn_slowpath_common+0x8a/0xc0
[] warn_slowpath_null+0x1a/0x20
[] _request_firmware+0x558/0x810
[] request_firmware+0x35/0x50
[] btusb_setup_bcm_patchram+0x86/0x590 [btusb]
[] ? rpm_idle+0xd6/0x230
[] hci_dev_do_open+0xe1/0xa90 [bluetooth]
[] ? ttwu_do_activate.constprop.90+0x5d/0x70
[] hci_power_on+0x40/0x200 [bluetooth]
[] process_one_work+0x14c/0x3f0
[] worker_thread+0x53/0x470
[] ? rescuer_thread+0x300/0x300
[] kthread+0xd8/0xf0
[] ? kthread_create_on_node+0x1b0/0x1b0
[] ret_from_fork+0x58/0x90
[] ? kthread_create_on_node+0x1b0/0x1b0

This occurs after every resume.

When resuming, the bluetooth stack calls hci_register_dev,
allocates a new workqueue, and immediately schedules the
power_on on the newly created workqueue. Since the new
workqueue is not freezable, the work runs immediately and
triggers the warning since resume is still happening and
usermodehelper has not yet been re-enabled. Fix this by
making the request workqueue freezable. This ensures
the work will not run until unfreezing occurs and usermodehelper
is re-enabled.

Signed-off-by: Laura Abbott 
---
Resend because I think this got lost in the thread.
This should be fixing the actual root cause of the warnings.


so I am not convinced that it actually fixes the root cause. This is just 
papering over it.

The problem is pretty clear, the firmware for some of the Bluetooth controllers 
is optional and that means during the original hotplug event it will not be 
found and the controller keeps operating. However for some reason instead of 
actually suspending and resuming the Bluetooth controller, we see a unplug + 
replug (since we are going through probe) and that is causing this funny 
behaviour.



Fundamentally the issue is the request_firmware is being called at the
wrong time. From Documentation/workqueue.txt:

  WQ_FREEZABLE

A freezable wq participates in the freeze phase of the system
suspend operations.  Work items on the wq are drained and no
new work item starts execution until thawed.


By making the request workqueue freezable, any work that gets scheduled
will not run until the time for tasks to unthaw.
4320f6b1d9db4ca912c5eb6ecb328b2e090e1586
("PM / sleep: Fix request_firmware() error at resume") fixed the resume
path such that before all tasks are unthawed, calls to
usermodehelper_read_trylock will block until usermodehelper is fully
resumed. This means that any task which is frozen and then woken up
again should have the right sequencing for usermodehelper. The workqueue
which handled the bluetooth power on was never being frozen properly so
there was never any guarantee of when it would run. This patch gives
it the necessary sequence.


So how does making one of the core workqueues freezable fixes this the right way. 
I do not even know how many other side effects that might have. That 
hdev->req_workqueue is a Bluetooth core internal workqueue that we are using 
for multiple tasks.

Rather tell me on why we are probing the USB devices that might need firmware 
without having userspace ready. It sounds to me that the USB driver probe 
callback should be delayed if we

[GIT PULL] power supply changes for 4.1-rc

2015-05-11 Thread Sebastian Reichel

Hi,

The following changes since commit b787f68c36d49bb1d9236f403813641efa74a031:

  Linux 4.1-rc1 (2015-04-26 17:59:10 -0700)

are available in the git repository at:

  git://git.infradead.org/battery-2.6.git tags/for-v4.1-rc

for you to fetch changes up to 8ebb7e9c1a502cfc300618c19c3c6f06fc76d237:

  power: bq27x00_battery: Add missing MODULE_ALIAS (2015-05-01 23:01:48 +0200)


power supply and reset fixes for the v4.1 series

 * misc. fixes


Dmitry Eremin-Solenikov (1):
  power_supply: fix oops in collie_battery driver

Florian Fainelli (1):
  power: reset: Add MFD_SYSCON depends for brcmstb

Marek Belisko (1):
  power: bq27x00_battery: Add missing MODULE_ALIAS

Pali Rohár (1):
  MAINTAINERS: Add me as maintainer of Nokia N900 power supply drivers

Ramakrishna Pallala (1):
  axp288_fuel_gauge: Add original author details

Thomas Gleixner (1):
  power: reset: ltc2952: Remove bogus hrtimer_start() return value checks

Wei Yongjun (1):
  power/reset: at91: fix return value check in at91_reset_platform_probe()

 MAINTAINERS| 11 +++
 drivers/power/axp288_fuel_gauge.c  |  1 +
 drivers/power/bq27x00_battery.c|  8 
 drivers/power/collie_battery.c |  2 +-
 drivers/power/reset/Kconfig|  1 +
 drivers/power/reset/at91-reset.c   |  4 ++--
 drivers/power/reset/ltc2952-poweroff.c | 18 +++---
 7 files changed, 27 insertions(+), 18 deletions(-)

-- Sebastian


signature.asc
Description: Digital signature

Re: [PATCH RFC v2 1/2] crypto: add PKE API

2015-05-11 Thread Herbert Xu

On Mon, May 11, 2015 at 02:45:27PM +0100, David Howells wrote:
>
> What if the fallback doesn't exist?  For instance, a H/W contained key is
> specifically limited to, say, just sign/verify and the not permitted to be
> used for encrypt/decrypt.  How do you provide a fallback given you can't get
> at the key?

That's a transform with a specific key.  I don't see why such a
piece of hardware would even need to be exposed through the crypto
API which is about generic implementations that can take any
arbitrary key.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-11 Thread Dave Chinner

On Mon, May 11, 2015 at 10:30:58AM -0700, Sage Weil wrote:
> On Mon, 11 May 2015, Trond Myklebust wrote:
> > On Mon, May 11, 2015 at 12:39 PM, Sage Weil  wrote:
> > > On Mon, 11 May 2015, Dave Chinner wrote:
> > >> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
> > >> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil  wrote:
> > >> > > I'm sure you realize what we're try to achieve is the same 
> > >> > > "invisible IO"
> > >> > > that the XFS open by handle ioctls do by default.  Would you be more
> > >> > > comfortable if this option where only available to the generic
> > >> > > open_by_handle syscall, and not to open(2)?
> > >> >
> > >> > It should be an ioctl(). It has no business being part of
> > >> > open_by_handle either, since that is another generic interface.
> > >
> > > Our use-case doesn't make sense on network file systems, but it does on
> > > any reasonably featureful local filesystem, and the goal is to be generic
> > > there.  If mtime is critical to a network file system's consistency it
> > > seems pretty reasonable to disallow/ignore it for just that file system
> > > (e.g., by masking off the flag at open time), as others won't have that
> > > same problem (cephfs doesn't, for example).
> > >
> > > Perhaps making each fs opt-in instead of handling it in a generic path
> > > would alleviate this concern?
> > 
> > The issue isn't whether or not you have a network file system, it's
> > whether or not you want users to be able to manage data. mtime isn't
> > useful for the application (which knows whether or not it has changed
> > the file) or for the filesystem (ditto). It exists, rather, in order
> > to enable data management by users and other applications, letting
> > them know whether or not the data contents of the file have changed,
> > and when that change occurred.
> 
> Agreed.
>  
> > If you are able to guarantee that your users don't care about that,
> > then fine, but that would be a very special case that doesn't fit the
> > way that most data centres are run. Backups are one case where mtime
> > matters, tiering and archiving is another.
> 
> This is true, although I argue it is becoming increasingly common for the 
> data management (including backups and so forth) to be layered not on top 
> of the POSIX file system but on something higher up in the stack. This is 

In the cloud storage world, yes. In the rest of the world, no.
It's the rest of the world we are worried about here. :/

> > Neither of these examples
> > cases are under the control of the application that calls
> > open(O_NOMTIME).
> 
> Wouldn't a mount option (e.g., allow_nomtime) address this concern?  Only 
> nodes provisioned explicitly to run these systems would be enable this 
> option.

Back to my Joe Speedracer comments.

I'm not sure what the right answer is - mount options are simply too
easy to add without understanding the full implications of them.
e.g. we didn't merge FALLOC_FL_NO_HIDE_STALE simply because it was
too dangerous for unsuspecting users. This isn't at that same level
or concern, but it's still a landmine we want to avoid users from
arming without realising it...

> > >> I'm happy for it to be an ioctl interface - even an XFS specific
> > >> interface if you want to go that route, Sage - and it probably
> > >> should emit a warning to syslog first time it is used so there is
> > >> trace for bug triage purposes. i.e. we know the app is not using
> > >> mtime updates, so bug reports that are the result of mtime
> > >> mishandling don't result in large amounts of wasted developer time
> > >> trying to understand them...
> > >
> > > A warning on using the interface (or when mounting with user_nomtime)
> > > sounds reasonable.
> > >
> > > I'd rather not make this XFS specific as other local filesystmes (ext4,
> > > f2fs, possibly btrfs) would similarly benefit.  (And if we want to target
> > > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it
> > > already does O_NOMTIME unconditionally.)
> > 
> > Lack of a namespace, doesn't imply that you don't want to manage the
> > data. The whole point of using object storage instead of plain old
> > block storage is to be able to provide whatever metadata you still
> > need in order to manage the object.
> 
> Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd 
> like to use) doesn't assume O_NOMTIME.

Right - the XFS ioctls were designed specifically for applications
that interacted directly with the structure of XFS filesystems and
so needed invisible IO (e.g. online defragmenter). IOWs, they are
not interfaces intended for general usage. They are also only
available to root, so a typical user application won't be making use
of them, either.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: [RESEND][PATCH] Bluetooth: Make request workqueue freezable

2015-05-11 Thread Marcel Holtmann

Hi Laura,

> We've received a number of reports of warnings when coming
> out of suspend with certain bluetooth firmware configurations:
> 
> WARNING: CPU: 3 PID: 3280 at drivers/base/firmware_class.c:1126
> _request_firmware+0x558/0x810()
> Modules linked in: ccm ip6t_rpfilter ip6t_REJECT nf_reject_ipv6
> xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter
> ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter
> ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
> nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
> binfmt_misc bnep intel_rapl iosf_mbi arc4 x86_pkg_temp_thermal
> snd_hda_codec_hdmi coretemp kvm_intel joydev snd_hda_codec_realtek
> iwldvm snd_hda_codec_generic kvm iTCO_wdt mac80211 iTCO_vendor_support
> snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul
> snd_hwdep crc32_pclmul snd_seq crc32c_intel ghash_clmulni_intel uvcvideo
> snd_seq_device iwlwifi btusb videobuf2_vmalloc snd_pcm videobuf2_core
> serio_raw bluetooth cfg80211 videobuf2_memops sdhci_pci v4l2_common
> videodev thinkpad_acpi sdhci i2c_i801 lpc_ich mfd_core wacom mmc_core
> media snd_timer tpm_tis hid_logitech_hidpp wmi tpm rfkill snd mei_me mei
> shpchp soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915
> i2c_algo_bit drm_kms_helper e1000e drm hid_logitech_dj ptp pps_core
> video
> CPU: 3 PID: 3280 Comm: kworker/u17:0 Not tainted 3.19.3-200.fc21.x86_64
> Hardware name: LENOVO 343522U/343522U, BIOS GCET96WW (2.56 ) 10/22/2013
> Workqueue: hci0 hci_power_on [bluetooth]
>  89944328 88040acffb78 8176e215
>   88040acffbb8 8109bc1a
>  88040acffcd0 fff5 8804076bac40
> Call Trace:
> [] dump_stack+0x45/0x57
> [] warn_slowpath_common+0x8a/0xc0
> [] warn_slowpath_null+0x1a/0x20
> [] _request_firmware+0x558/0x810
> [] request_firmware+0x35/0x50
> [] btusb_setup_bcm_patchram+0x86/0x590 [btusb]
> [] ? rpm_idle+0xd6/0x230
> [] hci_dev_do_open+0xe1/0xa90 [bluetooth]
> [] ? ttwu_do_activate.constprop.90+0x5d/0x70
> [] hci_power_on+0x40/0x200 [bluetooth]
> [] process_one_work+0x14c/0x3f0
> [] worker_thread+0x53/0x470
> [] ? rescuer_thread+0x300/0x300
> [] kthread+0xd8/0xf0
> [] ? kthread_create_on_node+0x1b0/0x1b0
> [] ret_from_fork+0x58/0x90
> [] ? kthread_create_on_node+0x1b0/0x1b0
> 
> This occurs after every resume.
> 
> When resuming, the bluetooth stack calls hci_register_dev,
> allocates a new workqueue, and immediately schedules the
> power_on on the newly created workqueue. Since the new
> workqueue is not freezable, the work runs immediately and
> triggers the warning since resume is still happening and
> usermodehelper has not yet been re-enabled. Fix this by
> making the request workqueue freezable. This ensures
> the work will not run until unfreezing occurs and usermodehelper
> is re-enabled.
> 
> Signed-off-by: Laura Abbott 
> ---
> Resend because I think this got lost in the thread.
> This should be fixing the actual root cause of the warnings.

so I am not convinced that it actually fixes the root cause. This is just 
papering over it.

The problem is pretty clear, the firmware for some of the Bluetooth controllers 
is optional and that means during the original hotplug event it will not be 
found and the controller keeps operating. However for some reason instead of 
actually suspending and resuming the Bluetooth controller, we see a unplug + 
replug (since we are going through probe) and that is causing this funny 
behaviour.

So how does making one of the core workqueues freezable fixes this the right 
way. I do not even know how many other side effects that might have. That 
hdev->req_workqueue is a Bluetooth core internal workqueue that we are using 
for multiple tasks.

Rather tell me on why we are probing the USB devices that might need firmware 
without having userspace ready. It sounds to me that the USB driver probe 
callback should be delayed if we can not guarantee that it can request 
firmware. As I explained many times, the call path that causes this is going 
through probe callback of the driver itself.

Regards

Marcel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [V3 PATCH 1/5] ACPI / scan: Parse _CCA and setup device coherency

2015-05-11 Thread Rafael J. Wysocki

On Monday, May 11, 2015 05:16:27 PM Catalin Marinas wrote:
> On Fri, May 08, 2015 at 10:53:59PM +0200, Rafael J. Wysocki wrote:
> > On Thursday, May 07, 2015 07:37:12 PM Suravee Suthikulpanit wrote:
> > > diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
> > > index ab2cbb5..7822149 100644
> > > --- a/drivers/acpi/Kconfig
> > > +++ b/drivers/acpi/Kconfig
> > > @@ -54,6 +54,12 @@ config ACPI_GENERIC_GSI
> > >  config ACPI_SYSTEM_POWER_STATES_SUPPORT
> > >   bool
> > >  
> > > +config ACPI_CCA_REQUIRED
> > > + bool
> > > +
> > > +config ARM64_SUPPORT_ACPI_CCA_ZERO
> > 
> > Hmm.  I guess the Arnd's idea what to simply use CONFIG_ARM64 directly 
> > instead
> > of adding this new option.
> 
> I agree.
> 
> > > +static inline bool acpi_dma_is_supported(struct acpi_device *adev)
> > > +{
> > > + /**
> > > +  * Currently, we mainly support _CCA=1 (i.e. is_coherent=1)
> > > +  * This should be equivalent to specifyig dma-coherent for
> > > +  * a device in OF.
> > > +  *
> > > +  * For the case when _CCA=0 (i.e. is_coherent=0 && cca_seen=1),
> > > +  * we would rely on arch-specific cache maintenance for
> > > +  * non-coherence DMA operations if architecture specifies
> > > +  * _XXX_SUPPORT_CCA_ZERO. Otherwise, we do not support
> > > +  * DMA on this device and fallback to arch-specific default
> > > +  * handling.
> > > +  *
> > > +  * For the case when _CCA is missing (i.e. cca_seen=0) but
> > > +  * platform specifies ACPI_CCA_REQUIRED, we do not support DMA,
> > > +  * and fallback to arch-specific default handling.
> > > +  */
> > > + return adev && (adev->flags.is_coherent ||
> > > + (adev->flags.cca_seen &&
> > > +  IS_ENABLED(CONFIG_ARM64_SUPPORT_ACPI_CCA_ZERO)));
> > 
> > So what exactly would be wrong with using IS_ENABLED(CONFIG_ARM64) here?
> 
> I'm not sure I follow why we need to check for ARM64 here at all. Can we
> not just have something like:
> 
>   return adev && (!IS_ENABLED(CONFIG_ACPI_CCA_REQUIRED) ||
>   adev->flags.cca_seen)

If _CCA returns 0 on non-ARM64, DMA is not supported for this device, so
in that case the function should return 'false' while the above check will
make it return 'true'.


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: "Directly mapped persistent memory page cache"

2015-05-11 Thread Dave Chinner

On Mon, May 11, 2015 at 11:18:36AM +0200, Ingo Molnar wrote:
> 
> * Dave Chinner  wrote:
> 
> > On Sat, May 09, 2015 at 10:45:10AM +0200, Ingo Molnar wrote:
> > > 
> > > * Rik van Riel  wrote:
> > > 
> > > > On 05/08/2015 11:54 AM, Linus Torvalds wrote:
> > > > > On Fri, May 8, 2015 at 7:40 AM, John Stoffel  wrote:
> > > > >>
> > > > >> Now go and look at your /home or /data/ or /work areas, where the
> > > > >> endusers are actually keeping their day to day work.  Photos, mp3,
> > > > >> design files, source code, object code littered around, etc.
> > > > > 
> > > > > However, the big files in that list are almost immaterial from a
> > > > > caching standpoint.
> > > > 
> > > > > The big files in your home directory? Let me make an educated guess.
> > > > > Very few to *none* of them are actually in your page cache right now.
> > > > > And you'd never even care if they ever made it into your page cache
> > > > > *at*all*. Much less whether you could ever cache them using large
> > > > > pages using some very fancy cache.
> > > > 
> > > > However, for persistent memory, all of the files will be "in 
> > > > memory".
> > > > 
> > > > Not instantiating the 4kB struct pages for 2MB areas that are not 
> > > > currently being accessed with small files may make a difference.
> > > >
> > > > For dynamically allocated 4kB page structs, we need some way to 
> > > > discover where they are. It may make sense, from a simplicity point 
> > > > of view, to have one mechanism that works both for pmem and for 
> > > > normal system memory.
> > > 
> > > I don't think we need to or want to allocate page structs dynamically, 
> > > which makes the model really simple and robust.
> > > 
> > > If we 'think big', we can create something very exciting IMHO, that 
> > > also gets rid of most of the complications with DIO, DAX, etc:
> > > 
> > > "Directly mapped pmem integrated into the page cache":
> > > --
> > > 
> > >   - The pmem filesystem is mapped directly in all cases, it has device 
> > > side struct page arrays, and its struct pages are directly in the 
> > > page cache, write-through cached. (See further below about how we 
> > > can do this.)
> > > 
> > > Note that this is radically different from the current approach 
> > > that tries to use DIO and DAX to provide specialized "direct
> > > access" APIs.
> > > 
> > > With the 'directly mapped' approach we have numerous advantages:
> > > 
> > >- no double buffering to main RAM: the device pages represent 
> > >  file content.
> > > 
> > >- no bdflush, no VM pressure, no writeback pressure, no
> > >  swapping: this is a very simple VM model where the device is
> > 
> > But, OTOH, no encryption, no compression, no
> > mirroring/redundancy/repair, etc. [...]
> 
> mirroring/redundancy/repair should be relatively easy to add without 
> hurting the the simplicity of the scheme - but it can also be part of 
> the filesystem.

We already have it in the filesystems and block layer, but the
persistent page cache infrastructure you are proposing makes it
impossible for the existing infrastructure to be used for this
purpose.

> Compression and encryption is not able to directly represent content 
> in pram anyway. You could still do per file encryption and 
> compression, if the filesystem supports it. Any block based filesystem 
> can be used.

Right, but they require a buffered IO path through volatile RAM,
which means treating it just like a normal storage device. IOWs,
if we add persistent page cache paths, the filesystem now will have
to support 3 different IO paths for persistent memory - a) direct
map page cache, b) buffered page cache with readahead and writeback,
and c) direct IO bypassing the page cache.

IOWs, it's not anywhere near as simple as you are implying it will
be. One of the main reasons we chose to use direct IO for DAX was so
we didn't need to add a third IO path to filesystems that wanted to
make use of DAX

> But you are wrong about mirroring/redundancy/repair: these concepts do 
> not require destructive data (content) transformation: they mostly 
> work by transforming addresses (or at most adding extra metadata), 
> they don't destroy the original content.

You're missing the fact that such data transformations all require
synchronisation of some kind at the IO level - it's way more complex
than just writing to RAM.  e.g. parity/erasure codes need to be
calculated before any update hits the persistent storage, otherwise
the existing codes on disk are invalidated and incorrect. Hence you
cannot use direct mapped page cache (or DAX, for that matter) if the
storage path requires syncronised data updates to multiple locations
to be done.

> > >- every read() would be equivalent a DIO read, without the
> > >  complexity of DIO.
> > 
> > Sure, it is replaced with the complexity of the buffered read path. 
> > Swings and

[RESEND][PATCH] Bluetooth: Make request workqueue freezable

2015-05-11 Thread Laura Abbott

We've received a number of reports of warnings when coming
out of suspend with certain bluetooth firmware configurations:

WARNING: CPU: 3 PID: 3280 at drivers/base/firmware_class.c:1126
_request_firmware+0x558/0x810()
Modules linked in: ccm ip6t_rpfilter ip6t_REJECT nf_reject_ipv6
xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter
ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw ip6table_filter
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
binfmt_misc bnep intel_rapl iosf_mbi arc4 x86_pkg_temp_thermal
snd_hda_codec_hdmi coretemp kvm_intel joydev snd_hda_codec_realtek
iwldvm snd_hda_codec_generic kvm iTCO_wdt mac80211 iTCO_vendor_support
snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul
snd_hwdep crc32_pclmul snd_seq crc32c_intel ghash_clmulni_intel uvcvideo
snd_seq_device iwlwifi btusb videobuf2_vmalloc snd_pcm videobuf2_core
 serio_raw bluetooth cfg80211 videobuf2_memops sdhci_pci v4l2_common
videodev thinkpad_acpi sdhci i2c_i801 lpc_ich mfd_core wacom mmc_core
media snd_timer tpm_tis hid_logitech_hidpp wmi tpm rfkill snd mei_me mei
shpchp soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915
i2c_algo_bit drm_kms_helper e1000e drm hid_logitech_dj ptp pps_core
video
CPU: 3 PID: 3280 Comm: kworker/u17:0 Not tainted 3.19.3-200.fc21.x86_64
Hardware name: LENOVO 343522U/343522U, BIOS GCET96WW (2.56 ) 10/22/2013
Workqueue: hci0 hci_power_on [bluetooth]
  89944328 88040acffb78 8176e215
   88040acffbb8 8109bc1a
  88040acffcd0 fff5 8804076bac40
Call Trace:
 [] dump_stack+0x45/0x57
 [] warn_slowpath_common+0x8a/0xc0
 [] warn_slowpath_null+0x1a/0x20
 [] _request_firmware+0x558/0x810
 [] request_firmware+0x35/0x50
 [] btusb_setup_bcm_patchram+0x86/0x590 [btusb]
 [] ? rpm_idle+0xd6/0x230
 [] hci_dev_do_open+0xe1/0xa90 [bluetooth]
 [] ? ttwu_do_activate.constprop.90+0x5d/0x70
 [] hci_power_on+0x40/0x200 [bluetooth]
 [] process_one_work+0x14c/0x3f0
 [] worker_thread+0x53/0x470
 [] ? rescuer_thread+0x300/0x300
 [] kthread+0xd8/0xf0
 [] ? kthread_create_on_node+0x1b0/0x1b0
 [] ret_from_fork+0x58/0x90
 [] ? kthread_create_on_node+0x1b0/0x1b0

This occurs after every resume.

When resuming, the bluetooth stack calls hci_register_dev,
allocates a new workqueue, and immediately schedules the
power_on on the newly created workqueue. Since the new
workqueue is not freezable, the work runs immediately and
triggers the warning since resume is still happening and
usermodehelper has not yet been re-enabled. Fix this by
making the request workqueue freezable. This ensures
the work will not run until unfreezing occurs and usermodehelper
is re-enabled.

Signed-off-by: Laura Abbott 
---
Resend because I think this got lost in the thread.
This should be fixing the actual root cause of the warnings.
---
 net/bluetooth/hci_core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index 476709b..87f2e48 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -3131,7 +3131,8 @@ int hci_register_dev(struct hci_dev *hdev)
}
 
hdev->req_workqueue = alloc_workqueue("%s", WQ_HIGHPRI | WQ_UNBOUND |
- WQ_MEM_RECLAIM, 1, hdev->name);
+ WQ_MEM_RECLAIM | WQ_FREEZABLE,
+ 1, hdev->name);
if (!hdev->req_workqueue) {
destroy_workqueue(hdev->workqueue);
error = -ENOMEM;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/6] Btrfs: show subvolume name and ID in /proc/mounts

2015-05-11 Thread Qu Wenruo

 Original Message  
Subject: Re: [PATCH v2 0/6] Btrfs: show subvolume name and ID in 
/proc/mounts

From: Omar Sandoval 
To: David Sterba , Qu Wenruo , 

Date: 2015年05月11日 17:42

On Thu, Apr 09, 2015 at 02:34:50PM -0700, Omar Sandoval wrote:

Here's version 2 of providing the subvolume name and ID in /proc/mounts.

It turns out that getting the name of a subvolume reliably is a bit
trickier than it would seem because of how mounting subvolumes by ID is
implemented. In particular, in that case, the dentry we get for the root
of the mount is not necessarily attached to the dentry tree, which means
that the obvious solution of just dumping the dentry does not work. The
solution I put together makes the tradeoff of churning a bit more code
in order to avoid implementing this with weird hacks.

Changes from v1 (https://lkml.org/lkml/2015/4/8/16):

- Put subvol= last in show_options
- Change commit log to remove comment about userspace having no way to
   know which subvolume is mounted, as David pointed out you can use
   btrfs inspect-internal rootid 
- Split up patch 2
- Minor coding style fixes

This still applies to v4.0-rc7. Tested manually and with the script
below (updated from v1).

Thanks!

Omar Sandoval (6):
   Btrfs: lock superblock before remounting for rw subvol
   Btrfs: remove all subvol options before mounting top-level
   Btrfs: clean up error handling in mount_subvol()
   Btrfs: fail on mismatched subvol and subvolid mount options
   Btrfs: unify subvol= and subvolid= mounting
   Btrfs: show subvol= and subvolid= in /proc/mounts

  fs/btrfs/super.c | 376 ---
  fs/seq_file.c|   1 +
  2 files changed, 251 insertions(+), 126 deletions(-)

Hi, everyone,

Just wanted to revive this so we can hopefully come up with a solution
we agree on in time for 4.2.

Just to recap, my approach (and also Qu Wenruo's original approach) is
to convert subvolid= mounts to subvol= mounts at mount time, which makes
showing the subvolume in /proc/mounts easy. The benefit of this approach
is that looking at mount information, which is supposed to be a
lightweight operation, is simple and always works. Additionally, we'll
have the info in a convenient format in /proc/mounts in addition to
/proc/$PID/mountinfo. The only caveat is that a mount by subvolid can
fail if the mount races with a rename of the subvolume.

Qu Wenruo's second approach was to instead convert the subvolid to a
subvolume path when reading /proc/$PID/mountinfo. The benefit of this
approach is that mounts by subvolid will always succeed in the face of
concurrent renames. However, instead, getting the subvolume path in
mountinfo can now fail, and it makes what should probably be a
lightweight operation somewhat complex.

In terms of the code, I think the original approach is cleaner: the
heavy lifting is done when mounting instead of when reading a proc file.
Additionally, I don't think that the concurrent rename race will be much
of a problem in practice. I can't imagine that too many people are
actively renaming subvolumes at the same time as they are mounting them,
and even if they are, I don't think it's so surprising that it would
fail. On the other hand, reading mount info while renaming subvolumes
might be marginally more common, and personally, if that failed, I'd be
unpleasantly surprised.

Orthogonal to that decision is the precedence of subvolid= and subvol=.
Although it's true that mount options usually have last-one-wins
behavior, I think David's argument regarding the principle of least
surprise is solid. Namely, someone's going to be unhappy with a
seemingly arbitrary decision when they don't match.

Sorry for the long-winded email! Thoughts, David, Qu?

Thanks,

I'm OK with your patchset, just as you mentioned, concurrently mount 
with rename is not such a common thing.

And I'm also happy with the cleaner unified mount codes.

Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Regression due to "device property: Make it possible to use secondary firmware nodes" Re: Xen-unstable + linux 4.1-mergewindow: problems with PV guest pci passthrough: pcifront pci-0: pciback no

2015-05-11 Thread Rafael J. Wysocki

On Monday, May 11, 2015 11:20:29 AM Konrad Rzeszutek Wilk wrote:
> On Tue, May 05, 2015 at 12:18:49AM +0200, Sander Eikelenboom wrote:
> > Hello Sander,
> > 
> > Monday, April 27, 2015, 5:48:00 PM, you wrote:
> > 
> > > Hi David / Konrad,
> > 
> > > Here the other problem i found, which is introduced somewhere in the 
> > > 4.1 mergewindow:
> > 
> > > on 4.1.0-rc1 (with the one revert to get things booting) i get this in
> > > the PV Guest console:
> > 
> > > [0.517392] crc32c_combine: 8373 self tests passed
> > > [0.517608] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
> > > [0.517655] pciehp: PCI Express Hot Plug Controller Driver version: 0.4
> > > [0.517677] cpcihp_generic: Generic port I/O CompactPCI Hot Plug 
> > > Driver version: 0.1
> > > [0.517684] cpcihp_generic: not configured, disabling.
> > > [0.517700] shpchp: Standard Hot Plug PCI Controller Driver version: 
> > > 0.4
> > > [0.517713] acpiphp_ibm: ibm_acpiphp_init: acpi_walk_namespace failed
> > > [0.519849] usbcore: registered new interface driver udlfb
> > > [0.613289] xen:xen_evtchn: Event-channel device installed
> > > [0.613436] pcifront pci-0: Installing PCI frontend
> > > [0.613578] pcifront pci-0: Creating PCI Frontend Bus :00
> > > [0.613616] pcifront pci-0: PCI host bridge to bus :00
> > > [0.613624] pci_bus :00: root bus resource [io  0x-0x]
> > > [0.613631] pci_bus :00: root bus resource [mem 
> > > 0x-0x]
> > > [0.613638] pci_bus :00: root bus resource [bus 00-ff]
> > > [0.616672] pcifront pci-0: pciback not responding!!!
> > > [2.613762] clocksource tsc: mask: 0x max_cycles: 
> > > 0x2e20fd6f2ba, max_idle_ns: 440795302556 ns
> > > [2.614275] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > > [2.614682] Linux agpgart interface v0.103
> > > [2.614731] Hangcheck: starting hangcheck timer 0.9.1 (tick is 180 
> > > seconds, margin is 60 seconds).
> > > [2.614762] [drm] Initialized drm 1.1.0 20060810
> > > [2.614789] [drm] radeon kernel modesetting enabled.
> > > [2.616529] brd: module loaded
> > > [2.617844] loop: module loaded
> > > [2.620008] pcifront pci-0: pciback not responding!!!
> > > [4.621490] pcifront pci-0: pciback not responding!!!
> > > [6.621866] pcifront pci-0: pciback not responding!!!
> > > [8.622421] pcifront pci-0: pciback not responding!!!
> > > etc. etc. etc.
> > 
> > 
> > > Where on 4.0.0 it get:
> > 
> > > [0.442554] shpchp: Standard Hot Plug PCI Controller Driver version: 
> > > 0.4
> > > [0.442583] acpiphp_ibm: ibm_acpiphp_init: acpi_walk_namespace failed
> > > [0.443293] pcifront pci-0: Allocated pdev @ 0x88001ab23c00 
> > > pdev->sh_info @ 0x88001937f000
> > > [0.444885] pcifront pci-0: publishing successful!
> > > [0.445302] usbcore: registered new interface driver udlfb
> > > [0.445829] xen:xen_evtchn: Event-channel device installed
> > > [0.446499] pcifront pci-0: Installing PCI frontend
> > > [0.446715] pcifront pci-0: Creating PCI Frontend Bus :00
> > > [0.446951] pcifront pci-0: PCI host bridge to bus :00
> > > [0.446960] pci_bus :00: root bus resource [io  0x-0x]
> > > [0.446968] pci_bus :00: root bus resource [mem 
> > > 0x-0x]
> > > [0.446988] pci_bus :00: root bus resource [bus 00-ff]
> > > [0.447002] pci_bus :00: scanning bus
> > > [0.447140] pci :00:00.0: [13f6:0111] type 00 class 0x040100
> > > [0.447520] pci :00:00.0: reg 0x10: [io  0x7800-0x78ff]
> > > [0.449148] pci :00:00.0: supports D1 D2
> > > [0.449791] pci_bus :00: fixups for bus
> > > [0.449794] pci_bus :00: bus scan returning with max=00
> > > [0.450604] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > > [0.451991] Linux agpgart interface v0.103
> > > [0.452160] Hangcheck: starting hangcheck timer 0.9.1 (tick is 180 
> > > seconds, margin is 60 seconds).
> > > [0.45] [drm] Initialized drm 1.1.0 20060810
> > > [0.452300] [drm] radeon kernel modesetting enabled.
> > > [0.462384] pcifront pci-0: claiming resource :00:00.0/0
> > 
> > > But i thought the patches that would change pci bus scanning were 
> > > destined for 
> > > 4.2 though ...
> > 
> > > --
> > > Sander
> > 
> > Hi David / Konrad,
> > 
> > I have bisected this one .. it leads to:
> > 
> > commit 97badf873ab60e841243b66133ff9eff2a46ef29
> > Author: Rafael J. Wysocki 
> > Date:   Fri Apr 3 23:23:37 2015 +0200
> > 
> > device property: Make it possible to use secondary firmware nodes
> > 
> > Since i didn't see it directly related to pci-front, i double checked by 
> > reverting this commit(and 9b73262ccbf2fb0060303f047863214269e64f9a since it 
> > build depends on the other) on 4.1-rc2.
> > 
> > Reverting in the guest kernel indeed makes pci-front work correct again.
> 
> That is quite odd.

Yes,

Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers

2015-05-11 Thread Doug Ledford

On Mon, 2015-05-11 at 19:49 -0400, ira.weiny wrote:
> I have run with this series and the only issue I have found is not with this
> patch set directly.
> 
> This patch:
> 
> >   IB/Verbs: Use management helper rdma_cap_ib_mad()
> 
> causes an error when you actually use the port passed from the ib_umad module.
> I have a patch to fix that which I found while trying to build on this series
> for the use of a bit mask.
> 
> Doug, I don't know what you would like to do for this fix.  I am submitting it
> shortly with a new version of the core capability bit patches.  If you want to
> just add it after this series or force Michael to respin with the fix?

As I recall, there was a comment from Or requesting to squash some of
the individual patches down, but I no longer have that email in my Inbox
to double check.  And it seemed like there was one other review comment
not yet addressed.  Do I have that right Michael?  And if so, are you
working on a v9?

>   Frankly
> I vote for the former because as it stands this series does not break 
> directly.
> It was only after I changed the implementation of rdma_cap_ib_mad that it
> broke.
> 
> 
> For the rest of the series.
> 
> Reviewed-by: Ira Weiny 
> Tested-by: Ira Weiny 
>   -- Limited to mlx4, qib, and OPA (with additional patches.)
> 
> 
> On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote:
> > Since v7:
> >   * Thanks to Doug, Ira, Devesh for the testing :-)
> >   * Thanks for the comments from or, Doug, Ira, Jason :-)
> > Please remind me if anything missed :-P
> >   * Use rdma_cap_XX() instead of cap_XX() for readability
> >   * Remove CC list in git log for maintainability
> >   * Use bool as return value
> >   * Updated github repository to v8
> > 
> > There are plenty of lengthy code to check the transport type of IB device,
> > or the link layer type of it's port, but actually we are just speculating
> > whether a particular management/feature is supported by the device/port.
> > 
> > Thus instead of inferring, we should have our own mechanism for IB 
> > management
> > capability/protocol/feature checking, several proposals below.
> > 
> > This patch set will introduce query_protocol() to check management 
> > requirement
> > instead of inferring from transport and link layer respectively, along with
> > the new enum on protocol type.
> > 
> > Mapping List:
> > node-type   link-layer  transport   protocol
> > nes RNICETH IWARP   IWARP
> > amso1100RNICETH IWARP   IWARP
> > cxgb3   RNICETH IWARP   IWARP
> > cxgb4   RNICETH IWARP   IWARP
> > usnic   USNIC_UDP   ETH USNIC_UDP   USNIC_UDP
> > ocrdma  IB_CA   ETH IB  IBOE
> > mlx4IB_CA   IB/ETH  IB  IB/IBOE
> > mlx5IB_CA   IB  IB  IB
> > ehcaIB_CA   IB  IB  IB
> > ipath   IB_CA   IB  IB  IB
> > mthca   IB_CA   IB  IB  IB
> > qib IB_CA   IB  IB  IB
> > 
> > For example:
> > if (transport == IB) && (link-layer == ETH)
> > will now become:
> > if (query_protocol() == IBOE)
> > 
> > Thus we will be able to get rid of the respective transport and link-layer
> > checking, and it will help us to add new protocol/Technology (like OPA) more
> > easier, also with the introduced management helpers, IB management logical
> > will be more clear and easier for extending.
> > 
> > Highlights:
> > The long CC list in each patches was complained consider about the
> > maintainability, it was suggested folks to provide their reviewed-by or
> > Acked-by instead, so for those who used to be on the CC list, please
> > provide your signature voluntarily :-)
> > 
> > The 'mgmt-helpers' branch of 
> > 'g...@github.com:ywang-pb/infiniband-wy.git'
> > contain this series based on the latest 'infiniband/for-next'
> > 
> > Patch 1#~14# included all the logical reform, 15#~23# introduced the
> > management helpers.
> > 
> > Doug suggested the bitmask mechanism:
> > https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html
> > which could be the plan for future reforming, we prefer that to be 
> > another
> > series which focus on semantic and performance.
> > 
> > This patch-set is somewhat 'bloated' now and it may be a good timing for
> > staging, I'd like to suggest we focus on improving existed helpers and 
> > push
> > all the further reforms into next series ;-)
> > 
> > Proposals:
> > Sean:
> > https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23339.html
> > Doug:
> > https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23418.html
> >

Re: [PATCH v5 0/6] arm64,hi6220: Enable Hisilicon Hi6220 SoC

2015-05-11 Thread Kevin Hilman

On Thu, May 7, 2015 at 4:11 PM, Brent Wang  wrote:
> Hello Kevin,
>
> 2015-05-08 4:30 GMT+08:00 Kevin Hilman :
>> Bintian Wang  writes:
>>
>>> Hi6220 is one mobile solution of Hisilicon, this patchset contains
>>> initial support for Hi6220 SoC and HiKey development board, which
>>> supports octal ARM Cortex A53 cores. Initial support is minimal and
>>> includes just the arch configuration, clock driver, device tree
>>> configuration.
>>>
>>> PSCI is enabled in device tree and there is no problem to boot all the
>>> octal cores, and the CPU hotplug is also working now, you can download
>>> and compile the latest firmware based on the following link to run this
>>> patch set:
>>> https://github.com/96boards/documentation/wiki/UEFI
>>
>> Do you have any tips for booting this using the HiSi bootloader?  It
>> seems that I need to add the magic hisi,boardid property for dtbTool to
>> work.  Could you share what that magic value is?
> Yes, you need it.
> Hisilicon has many different development boards and those boards have some
> different hardware configuration, so we need different device tree
> files for them.
> the original hisi,boardid is used to distinguish different boards and
> used by the
> bootloader to judge which device tree to use at boot-up.
>
>> and maybe add it to the wiki someplace?
> Maybe add to section "Known Issues" in
> "https://github.com/96boards/documentation/wiki/UEFI;
> is a good choice, I will update this section later.

You updated the wiki, but you didn't specify what the value should be
for this to work with the old bootloader.

Can you please give the value of that property?

Also, have you tested this series with the old bootloader as well?

Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A desktop environment[1] kernel wishlist

2015-05-11 Thread Rafael J. Wysocki

On Tuesday, May 12, 2015 12:12:30 AM Pavel Machek wrote:
> Hi!
> 
> > > If the event was user-triggered it sends
> > > out a DBus signal announcing the end of the suspend, Chrome thaws its
> > > renderer processes, the full UI comes back up, and the user can start
> > > working.  If the event was _not_ user-triggred (if it was the RTC or
> > > NIC), the power manager sends out a different DBus signal announcing
> > > that the system is in lucid sleep and will re-suspend soon.  It will
> > > then wait for all registered applications to report readiness to
> > > suspend or for the max timeout to expire.
> > 
> > First let me say that the "user-triggered" vs "non-user-triggered" 
> > distinction
> > seems somewhat artificial to me.  All boils down to having a special class
> > of wakeup events that are supposed to make the power manager behave 
> > differently
> > after resuming.  Whether or not they are actually triggered by the user
> > doesn't really matter technically.
> ...
> > > So that was a little long-winded but hopefully I've addressed all your
> > > concerns about potential race conditions in this code.  I simplified a
> > > few bits because would just complicate the discussion but for the most
> > > part this is how the feature works now.  Having the kernel emit a
> > > uevent with the wakeup event type would take the place of the power
> > > manager reading from /sys/power/wakeup_type in this system but
> > > wouldn't really affect anything else.
> > 
> > Which loops back to my previous remark: Things may get ugly if 
> > /sys/power/wakeup_type
> > doesn't do the right thing (the uevent mechanics you'd like to replace it 
> > with
> > will really need to do the same, so I'm not quite sure it's worth the 
> > effort).
> > 
> > Namely, it really has to cover all events that might have woken you up and
> > happened before stuff has started to be added to the input buffers that 
> > Chrome
> > cares about.  It is difficult to identify the exact point where that takes 
> > place
> > in the resume sequence, but it should be somewhere in dpm_resume_end().  
> > Why so?
> > Because it really doesn't matter why exactly the system is waking up.  What
> > matters is whether or not an event that you should react to by bringing up 
> > the
> > UI happens *at* *any* *time* between (and including) the actual wakeup and 
> > the
> > point when you can rely on the input buffers to contain any useful 
> > information
> > consumable by Chrome.
> > 
> > This pretty much means that /sys/power/wakeup_type needs to behave almost 
> > like
> > /sys/power/wakeup_count, but is limited to a subset of wakeup sources.  
> > That's
> > why I was talking about splitting the wakeup count.
> > 
> > So instead of adding an entirely new mechanics for that, why don't you add
> > something like "priority" or "weight" to struct wakeup_source and assign
> > higher values of that to the wakeup sources associated with the events
> > you want to bring up the UI after resume?  And make those "higher-priority"
> > wakeup sources use a separate wakeup counter, so you can easily verify if
> > any of them has triggered by reading that or making it trigger a uevent if
> > you want to?
> 
> Does it do all we want?

I believe so.

> What if one device wants to generate both "normal" and "higher-priority"
> wakeup events? (*)
>
> Should not we have normal interface for keyboard (and similar devices)
> where we could ask "did something interesting happen while we were
> sleeping"? Actually.. maybe the device can queue the events
> that happened during sleep, and deliver them after wakeup? If user
> pressed key during sleep, you should have key event waiting on
> /dev/input/event3...

If it can queue up all of them, it can be "normal" priority just fine
and user space can read all the queue and decide what to do then.

The "high-priority" idea is for devices that can't do that at least at one
point during the suspend-resume cycle.  In those cases we can't simply go
back and check what the event was, so we need to rely on the device's
"importance" or "class" (with respect to wakeup).


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2] mmc:Add pointer cast to uintptr_t for slave_id_rx and tx in the function, sh_mmcif_request_dma_one

2015-05-11 Thread Kuninori Morimoto


Hi

Nicholas Krause wrote:
> 
> This adds a cast to the variables,slave_id_rx and slave_id_rx
> to uintptr_t before casting to void* in order to avoid build
> warning on 64bit platforms for the function, sh_mmcif_request_dma_one.
> Signed-off-by: Nicholas Krause 
> ---

Acked-by: Kuninori Morimoto 

# 1 open line is needed between log and Signed-off-by ?

>  drivers/mmc/host/sh_mmcif.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mmc/host/sh_mmcif.c b/drivers/mmc/host/sh_mmcif.c
> index 7eff087..e276558 100644
> --- a/drivers/mmc/host/sh_mmcif.c
> +++ b/drivers/mmc/host/sh_mmcif.c
> @@ -398,8 +398,8 @@ sh_mmcif_request_dma_one(struct sh_mmcif_host *host,
>  
>   if (pdata)
>   slave_data = direction == DMA_MEM_TO_DEV ?
> - (void *)pdata->slave_id_tx :
> - (void *)pdata->slave_id_rx;
> + (void *)(uintptr_t)pdata->slave_id_tx :
> + (void *)(uintptr_t)pdata->slave_id_rx;
>  
>   chan = dma_request_slave_channel_compat(mask, shdma_chan_filter,
>   slave_data, >pd->dev,
> -- 
> 2.1.4
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: boot loader

2015-05-11 Thread Thiago Farina

Hi Pavel,

On Mon, May 11, 2015 at 7:11 PM, Pavel Machek  wrote:
> On Tue 2015-04-28 12:12:26, Thiago Farina wrote:
>> Hi,
>>
>> Does the kernel include a simple boot loader like FreeBSD does?
>
> Long time ago, Linux included ability to boot from floppy on PC. Not
> any more, IIRC.
Yeah. Maybe it was a right decision from Linus to isolate, focus and work
solely on kernel and not including everything else (like FreeBSD does)
that makes a usable system.

Regards,

-- 
Thiago Farina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread David Lang


On Mon, 11 May 2015, Daniel Phillips wrote:


On 05/11/2015 03:12 PM, Pavel Machek wrote:

It is a fact of life that when you change one aspect of an intimately 
interconnected system,
something else will change as well. You have naive/nonexistent free space 
management now; when you
design something workable there it is going to impact everything else you've 
already done. It's an
easy bet that the impact will be negative, the only question is to what degree.


You might lose that bet. For example, suppose we do strictly linear allocation
each delta, and just leave nice big gaps between the deltas for future
expansion. Clearly, we run at similar or identical speed to the current naive
strategy until we must start filling in the gaps, and at that point our layout
is not any worse than XFS, which started bad and stayed that way.


Umm, are you sure. If "some areas of disk are faster than others" is
still true on todays harddrives, the gaps will decrease the
performance (as you'll "use up" the fast areas more quickly).


That's why I hedged my claim with "similar or identical". The
difference in media speed seems to be a relatively small effect
compared to extra seeks. It seems that XFS puts big spaces between
new directories, and suffers a lot of extra seeks because of it.
I propose to batch new directories together initially, then change
the allocation goal to a new, relatively empty area if a big batch
of files lands on a directory in a crowded region. The "big" gaps
would be on the order of delta size, so not really very big.


This is an interesting idea, but what happens if the files don't arrive as a big 
batch, but rather trickle in over time (think a logserver that if putting files 
into a bunch of directories at a fairly modest rate per directory)


And when you then decide that you have to move the directory/file info, doesn't 
that create a potentially large amount of unexpected IO that could end up 
interfering with what the user is trying to do?


David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] iscsi_ibft: filter null v4-mapped v6 addresses

2015-05-11 Thread Mike Christie

On 05/08/2015 12:14 AM, Chris Leech wrote:
> I've had reports of UEFI platforms failing iSCSI boot in various
> configurations, that ended up being caused by network initialization
> scripts getting tripped up by unexpected null addresses (0.0.0.0) being
> reported for gateways, dhcp servers, and dns servers.
> 
> The tianocore EDK2 iSCSI driver generates an iBFT table that always uses
> IPv4-mapped IPv6 addresses for the NIC structure fields.  This results
> in values that are "not present or not specified" being reported as
> :::0.0.0.0 rather than all zeros as specified.
> 
> The iscsi_ibft module filters unspecified fields from the iBFT from
> sysfs, preventing userspace from using invalid values and making it easy
> to check for the presence of a value.  This currently fails in regard to
> these mapped null addresses.
> 
> In order to remain consistent with how the iBFT information is exposed,
> we should accommodate the behavior of the tianocore iSCSI driver as it's
> already in the wild in a large number of servers.
> 
> Tested under qemu using an OVMF build of tianocore EDK2.
> 
> Signed-off-by: Chris Leech 

Looks ok to me.

Reviewed-by: Mike Christie 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread Daniel Phillips

Hi Pavel,

On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>> It is a fact of life that when you change one aspect of an intimately 
>>> interconnected system,
>>> something else will change as well. You have naive/nonexistent free space 
>>> management now; when you
>>> design something workable there it is going to impact everything else 
>>> you've already done. It's an
>>> easy bet that the impact will be negative, the only question is to what 
>>> degree.
>>
>> You might lose that bet. For example, suppose we do strictly linear 
>> allocation
>> each delta, and just leave nice big gaps between the deltas for future
>> expansion. Clearly, we run at similar or identical speed to the current naive
>> strategy until we must start filling in the gaps, and at that point our 
>> layout
>> is not any worse than XFS, which started bad and stayed that way.
> 
> Umm, are you sure. If "some areas of disk are faster than others" is
> still true on todays harddrives, the gaps will decrease the
> performance (as you'll "use up" the fast areas more quickly).

That's why I hedged my claim with "similar or identical". The
difference in media speed seems to be a relatively small effect
compared to extra seeks. It seems that XFS puts big spaces between
new directories, and suffers a lot of extra seeks because of it.
I propose to batch new directories together initially, then change
the allocation goal to a new, relatively empty area if a big batch
of files lands on a directory in a crowded region. The "big" gaps
would be on the order of delta size, so not really very big.

Anyway, some people seem to have pounced on the words "naive" and
"linear allocation" and jumped to the conclusion that our whole
strategy is naive. Far from it. We don't just throw files randomly
at the disk. We sort and partition files and metadata, and we
carefully arrange the order of our allocation operations so that
linear allocation produces a nice layout for both read and write.

This turned out to be so much better than fiddling with the goal
of individual allocations that we concluded we would get best
results by sticking with linear allocation, but improve our sort
step. The new plan is to partition updates into batches according
to some affinity metrics, and set the linear allocation goal per
batch. So for example, big files and append-type files can get
special treatment in separate batches, while files that seem to
be related because of having the same directory parent and being
written in the same delta will continue to be streamed out using
"naive" linear allocation, which is not necessarily as naive as
one might think.

It will take time and a lot of performance testing to get this
right, but nobody should get the idea that it is any inherent
design limitation. The opposite is true: we have no restrictions
at all in media layout.

Compared to Ext4, we do need to address the issue that data moves
around when updated. This can cause rapid fragmentation. Btrfs has
shown issues with that for big, randomly updated files. We want to
fix it without falling back on update-in-place as Btrfs does.

Actually, Tux3 already has update-in-place, and unlike Btrfs, we
can switch to it for non-empty files. But we think that perfect data
isolation per delta is something worth fighting for, and we would
rather not force users to fiddle around with mode settings just to
make something work as well as it already does on Ext4. We will
tackle this issue by partitioning as above, and use a dedicated
allocation strategy for such files, which are easy to detect.

Metadata moving around per update does not seem to be a problem
because it is all single blocks that need very little slack space
to stay close to home.

> Anyway... you have brand new filesystem. Of course it should be
> faster/better/nicer than the existing filesystems. So don't be too
> harsh with XFS people.

They have done a lot of good work, but they still have a long way
to go. I don't see any shame in that.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers

2015-05-11 Thread ira.weiny

I have run with this series and the only issue I have found is not with this
patch set directly.

This patch:

>   IB/Verbs: Use management helper rdma_cap_ib_mad()

causes an error when you actually use the port passed from the ib_umad module.
I have a patch to fix that which I found while trying to build on this series
for the use of a bit mask.

Doug, I don't know what you would like to do for this fix.  I am submitting it
shortly with a new version of the core capability bit patches.  If you want to
just add it after this series or force Michael to respin with the fix?  Frankly
I vote for the former because as it stands this series does not break directly.
It was only after I changed the implementation of rdma_cap_ib_mad that it
broke.


For the rest of the series.

Reviewed-by: Ira Weiny 
Tested-by: Ira Weiny 
-- Limited to mlx4, qib, and OPA (with additional patches.)


On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote:
> Since v7:
>   * Thanks to Doug, Ira, Devesh for the testing :-)
>   * Thanks for the comments from or, Doug, Ira, Jason :-)
> Please remind me if anything missed :-P
>   * Use rdma_cap_XX() instead of cap_XX() for readability
>   * Remove CC list in git log for maintainability
>   * Use bool as return value
>   * Updated github repository to v8
> 
> There are plenty of lengthy code to check the transport type of IB device,
> or the link layer type of it's port, but actually we are just speculating
> whether a particular management/feature is supported by the device/port.
> 
> Thus instead of inferring, we should have our own mechanism for IB management
> capability/protocol/feature checking, several proposals below.
> 
> This patch set will introduce query_protocol() to check management requirement
> instead of inferring from transport and link layer respectively, along with
> the new enum on protocol type.
> 
> Mapping List:
>   node-type   link-layer  transport   protocol
> nes   RNICETH IWARP   IWARP
> amso1100  RNICETH IWARP   IWARP
> cxgb3 RNICETH IWARP   IWARP
> cxgb4 RNICETH IWARP   IWARP
> usnic USNIC_UDP   ETH USNIC_UDP   USNIC_UDP
> ocrdmaIB_CA   ETH IB  IBOE
> mlx4  IB_CA   IB/ETH  IB  IB/IBOE
> mlx5  IB_CA   IB  IB  IB
> ehca  IB_CA   IB  IB  IB
> ipath IB_CA   IB  IB  IB
> mthca IB_CA   IB  IB  IB
> qib   IB_CA   IB  IB  IB
> 
> For example:
>   if (transport == IB) && (link-layer == ETH)
> will now become:
>   if (query_protocol() == IBOE)
> 
> Thus we will be able to get rid of the respective transport and link-layer
> checking, and it will help us to add new protocol/Technology (like OPA) more
> easier, also with the introduced management helpers, IB management logical
> will be more clear and easier for extending.
> 
> Highlights:
> The long CC list in each patches was complained consider about the
> maintainability, it was suggested folks to provide their reviewed-by or
> Acked-by instead, so for those who used to be on the CC list, please
> provide your signature voluntarily :-)
> 
> The 'mgmt-helpers' branch of 'g...@github.com:ywang-pb/infiniband-wy.git'
> contain this series based on the latest 'infiniband/for-next'
> 
> Patch 1#~14# included all the logical reform, 15#~23# introduced the
> management helpers.
> 
> Doug suggested the bitmask mechanism:
>   https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html
> which could be the plan for future reforming, we prefer that to be another
> series which focus on semantic and performance.
> 
> This patch-set is somewhat 'bloated' now and it may be a good timing for
> staging, I'd like to suggest we focus on improving existed helpers and 
> push
> all the further reforms into next series ;-)
> 
> Proposals:
> Sean:
>   https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23339.html
> Doug:
>   https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23418.html
>   https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html
> Jason:
>   https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23425.html
> 
> Michael Wang (23):
>   IB/Verbs: Implement new callback query_protocol()
>   IB/Verbs: Implement raw management helpers
>   IB/Verbs: Reform IB-core mad/agent/user_mad
>   IB/Verbs: Reform IB-core cm
>   IB/Verbs: Reform IB-core sa_query
>   IB/Verbs: Reform IB-core multicast
>   IB/Verbs: Reform IB-ulp ipoib
>   IB/Verbs: Reform IB-ulp xprtrdma
>   IB/Verbs: Reform IB-core verbs
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2452 matches

Mail list logo