Re: [patch 1/2] [PATCH] mm: Save soft-dirty bits on swapped pages

2013-07-31 Thread Cyrill Gorcunov
On Thu, Aug 01, 2013 at 09:51:32AM +0900, Minchan Kim wrote:
> > Index: linux-2.6.git/include/linux/swapops.h
> > ===
> > --- linux-2.6.git.orig/include/linux/swapops.h
> > +++ linux-2.6.git/include/linux/swapops.h
> > @@ -67,6 +67,8 @@ static inline swp_entry_t pte_to_swp_ent
> > swp_entry_t arch_entry;
> >  
> > BUG_ON(pte_file(pte));
> > +   if (pte_swp_soft_dirty(pte))
> > +   pte = pte_swp_clear_soft_dirty(pte);
> 
> Why do you remove soft-dirty flag whenever pte_to_swp_entry is called?
> Isn't there any problem if we use mincore?

No, there is no problem. pte_to_swp_entry caller when we know that pte
we're decoding is having swap format (except the case in swap code which
figures out the number of bits allowed for offset). Still since this bit
is set on "higher" level than __swp_type/__swp_offset helpers it should
be cleaned before the value from pte comes to "one level down" helpers
function.

> > +static inline int maybe_same_pte(pte_t pte, pte_t swp_pte)
> 
> Nitpick.
> If maybe_same_pte is used widely, it looks good to me
> but it's used for only swapoff at the moment so I think pte_swap_same
> would be better name.

I don't see much difference, but sure, lets rename it on top once series
in -mm tree, sounds good?

Cyrill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [26/84] ARM: OMAP: RX-51: change probe order of touchscreen and panel SPI devices

2013-07-31 Thread Tomi Valkeinen
Hi,

On 31/07/13 16:23, Ben Hutchings wrote:
> 3.2.50-rc1 review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Aaro Koskinen 
> 
> commit e65f131a14726e5f1b880a528271a52428e5b3a5 upstream.
> 
> Commit 9fdca9df (spi: omap2-mcspi: convert to module_platform_driver)
> broke the SPI display/panel driver probe on RX-51/N900. The exact cause is
> not fully understood, but it seems to be related to the probe order. SPI
> communication to the panel driver (spi1.2) fails unless the touchscreen
> (spi1.0) has been probed/initialized before. When the omap2-mcspi driver
> was converted to a platform driver, it resulted in that the devices are
> probed immediately after the board registers them in the order they are
> listed in the board file.

I'm not really familiar with this issue, but:

The commit 9fdca9df was merged to v3.5. Has that also been backported to
3.2? If not, I don't think this patch is needed for 3.2, as the patch
causing the problem is not in 3.2.

 Tomi




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 01/11] writeback: plug writeback at a high level

2013-07-31 Thread Dave Chinner
On Wed, Jul 31, 2013 at 04:40:19PM +0200, Jan Kara wrote:
> On Wed 31-07-13 14:15:40, Dave Chinner wrote:
> > From: Dave Chinner 
> > 
> > Doing writeback on lots of little files causes terrible IOPS storms
> > because of the per-mapping writeback plugging we do. This
> > essentially causes imeediate dispatch of IO for each mapping,
> > regardless of the context in which writeback is occurring.
> > 
> > IOWs, running a concurrent write-lots-of-small 4k files using fsmark
> > on XFS results in a huge number of IOPS being issued for data
> > writes.  Metadata writes are sorted and plugged at a high level by
> > XFS, so aggregate nicely into large IOs. However, data writeback IOs
> > are dispatched in individual 4k IOs, even when the blocks of two
> > consecutively written files are adjacent.
> > 
> > Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
> > metadata CRCs enabled.
> > 
> > Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)
> > 
> > Test:
> > 
> > $ ./fs_mark  -D  1  -S0  -n  1  -s  4096  -L  120  -d
> > /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
> > /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
> > /mnt/scratch/6  -d  /mnt/scratch/7
> > 
> > Result:
> > 
> > wallsys create rate Physical write IO
> > timeCPU (avg files/s)IOPS   Bandwidth
> > -   -   --  -
> > unpatched   6m56s   15m47s  24,000+/-50026,000  130MB/s
> > patched 5m06s   13m28s  32,800+/-600 1,500  180MB/s
> > improvement -26.44% -14.68%   +36.67%   -94.23% +38.46%
> > 
> > If I use zero length files, this workload at about 500 IOPS, so
> > plugging drops the data IOs from roughly 25,500/s to 1000/s.
> > 3 lines of code, 35% better throughput for 15% less CPU.
> > 
> > The benefits of plugging at this layer are likely to be higher for
> > spinning media as the IO patterns for this workload are going make a
> > much bigger difference on high IO latency devices.
> > 
> > Signed-off-by: Dave Chinner 
>   Just one question: Won't this cause a regression when files are say 2 MB
> large? Then we generate maximum sized requests for these files with
> per-inode plugging anyway and they will unnecessarily sit in the plug list
> until the plug list gets full (that is after 16 requests). Granted it
> shouldn't be too long but with fast storage it may be measurable...

Latency of IO dispatch only matters for the initial IOs being
queued. This, however, is not a latency sensitive IO path -
writeback is our bulk throughput IO engine, and in those cases low
latency dispatch is precisely what we don't want. We want to
optimise IO patterns for maximum *bandwidth*, not minimal latency.

The problem is that fast storage with immediate dispatch and dep
queues can keep ahead of IO dispatch, preventing throughput
optimisations like IO aggregation from being made because there is
never any IO queued to aggregate. That's why I'm seeing a couple of
orders of magnitude higher IOPS than I should. Sure, the hardware
can do that, but it's not the *most efficient* method of dispatching
background IO.

Allowing IOs a chance to aggregate in the scheduler for a short
while because dispatch allows existing bulk throughput optimisations
to be made to the IO stream, and as we can see, where a delayed
allocation filesystem is optimised for adjacent allocation
across sequentially written inodes such oppportunites for IO
aggregation make a big difference to performance.

So, to test your 2MB IO case, I ran a fsmark test using 40,000
2MB files instead of 10 million 4k files.

wall time   IOPSBW
mmotm   170s1000350MB/s
patched 167s1000350MB/s

The IO profiles are near enough to be identical, and the wall time
is basically the same.


I just don't see any particular concern about larger IOs and initial
dispatch latency here from either a theoretical or an observed POV.
Indeed, I haven't seen a performance degradation as a result of this
patch in any of the testing I've done since I first posted it...

> Now if we have maximum sized request in the plug list, maybe we could just
> dispatch it right away but that's another story.

That, in itself is potentially an issue, too, as it prevents seek
minimisation optimisations from being made when we batch up multiple
IOs on the plug list...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/18] ARM: davinci: Switch to sched_clock_register()

2013-07-31 Thread Sekhar Nori
On Thursday 01 August 2013 04:01 AM, Stephen Boyd wrote:
> The 32 bit sched_clock interface now supports 64 bits. Upgrade to
> the 64 bit function to allow us to remove the 32 bit registration
> interface.
> 
> Cc: Sekhar Nori 
> Cc: Kevin Hilman 
> Signed-off-by: Stephen Boyd 
> ---
>  arch/arm/mach-davinci/time.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/mach-davinci/time.c b/arch/arm/mach-davinci/time.c
> index 7a55b5c..6d5440a 100644
> --- a/arch/arm/mach-davinci/time.c
> +++ b/arch/arm/mach-davinci/time.c
> @@ -285,7 +285,7 @@ static struct clocksource clocksource_davinci = {
>  /*
>   * Overwrite weak default sched_clock with something more precise
>   */
> -static u32 notrace davinci_read_sched_clock(void)
> +static u64 notrace davinci_read_sched_clock(void)
>  {
>   return timer32_read([TID_CLOCKSOURCE]);
>  }
> @@ -392,7 +392,7 @@ void __init davinci_timer_init(void)
>   davinci_clock_tick_rate))
>   printk(err, clocksource_davinci.name);
>  
> - setup_sched_clock(davinci_read_sched_clock, 32,
> + sched_clock_register(davinci_read_sched_clock, 32,
> davinci_clock_tick_rate);

Please fix the line break to align with open parens. Otherwise looks
good. I tested it on DA850 EVM as well. With above fixed:

Acked-by: Sekhar Nori 

Thanks,
Sekhar

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] scripts/mod/modpost.c: permit '.cranges' secton for sh64 architecture.

2013-07-31 Thread Chen Gang
Need permit '.cranges' section for sh64 architecture, or modpost will
report warning:

LD  init/built-in.o
  WARNING: init/built-in.o (.cranges): unexpected non-allocatable section.
  Did you forget to use "ax"/"aw" in a .S file?
  Note that for example  contains
  section definitions for use in .S files.


Signed-off-by: Chen Gang 
---
 scripts/mod/modpost.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index 6216434..8247979 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -821,6 +821,7 @@ static const char *section_white_list[] =
 {
".comment*",
".debug*",
+   ".cranges", /* sh64 */
".zdebug*", /* Compressed debug sections. */
".GCC-command-line",/* mn10300 */
".GCC.command.line",/* record-gcc-switches, non mn10300 */
-- 
1.7.7.6
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH 3/5] usb: phy: Add AM335x PHY driver

2013-07-31 Thread George Cherian

On 7/31/2013 1:46 AM, Sebastian Andrzej Siewior wrote:

This driver is a redo of my earlier attempt. It uses parts of the
generic PHY driver and uses the new control driver for the register
the phy needs to power on/off the phy. It also enables easy access for
the wakeup register which is not yet implemented.
The difference between the omap attempt is:
- no static holding variable
- one global visible function which exports a struct with callbacks to
   access the "control" registers.

Signed-off-by: Sebastian Andrzej Siewior 
---
  





+static const struct of_device_id omap_control_usb_id_table[] = {
+   { .compatible = "ti,am335x-ctrl-module", .data = _am335x },
+   {}
+};
+MODULE_DEVICE_TABLE(of, omap_control_usb_id_table);
+
+static struct platform_driver am335x_control_driver;
+static int match(struct device *dev, void *data)
+{
+   struct device_node *node = (struct device_node *)data;
+   return dev->of_node == node &&
+   dev->driver == _control_driver.driver;
+}
+
+struct phy_control *am335x_get_phy_control(struct device *dev)
+{
+   struct device_node *node;
+   struct am335x_control_usb *ctrl_usb;
+
+   node = of_parse_phandle(dev->of_node, "ti,ctrl_mod", 0);
+   if (!node)
+   return NULL;
+
+   dev = bus_find_device(_bus_type, NULL, node, match);


of_find_device_by_node doesn't work???


+   ctrl_usb = dev_get_drvdata(dev);
+   if (!ctrl_usb)
+   return NULL;
+   return _usb->phy_ctrl;
+}
+EXPORT_SYMBOL_GPL(am335x_get_phy_control);
+
+static int am335x_control_usb_probe(struct platform_device *pdev)
+{
+   struct resource *res;
+   struct am335x_control_usb *ctrl_usb;
+   const struct of_device_id *of_id;
+   const struct phy_control *phy_ctrl;
+
+   of_id = of_match_node(omap_control_usb_id_table, pdev->dev.of_node);
+   if (!of_id)
+   return -EINVAL;
+
+   phy_ctrl = of_id->data;
+
+   ctrl_usb = devm_kzalloc(>dev, sizeof(*ctrl_usb), GFP_KERNEL);
+   if (!ctrl_usb) {
+   dev_err(>dev, "unable to alloc memory for control usb\n");
+   return -ENOMEM;
+   }
+
+   ctrl_usb->dev = >dev;
+
+   res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "control_dev");
+   ctrl_usb->reset_reg = devm_ioremap_resource(>dev, res);


As per current DTSI in patch 5 you are mapping from control module base till 
0x650.
Cant it be split to 2 registers one mapping for phy on/off and one for phy wkup?


+   if (IS_ERR(ctrl_usb->reset_reg))
+   return PTR_ERR(ctrl_usb->reset_reg);
+   spin_lock_init(_usb->lock);
+   ctrl_usb->phy_ctrl = *phy_ctrl;
+
+   dev_set_drvdata(ctrl_usb->dev, ctrl_usb);
+   return 0;
+}
+
+static struct platform_driver am335x_control_driver = {
+   .probe  = am335x_control_usb_probe,
+   .driver = {
+   .name   = "am335x-control-usb",
+   .owner  = THIS_MODULE,
+   .of_match_table = of_match_ptr(omap_control_usb_id_table),
+   },
+};
+
+module_platform_driver(am335x_control_driver);
+MODULE_LICENSE("GPL v2");



--
-George

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/9] syslog_ns: make syslog handling per namespace

2013-07-31 Thread Gao feng
On 08/01/2013 11:10 AM, Rui Xiang wrote:
> On 2013/8/1 9:36, Gao feng wrote:
>> On 07/29/2013 10:31 AM, Rui Xiang wrote:
>>> This patch makes syslog buf and other fields per
>>> namespace.
>>>
>>> Here use ns->log_buf(log_buf_len, logbuf_lock,
>>> log_first_seq, logbuf_lock, and so on) fields
>>> instead of global ones to handle syslog.
>>>
>>> Syslog interfaces such as /dev/kmsg, /proc/kmsg,
>>> and syslog syscall are all containerized for
>>> container users.
>>>
>>
>> /dev/kmsg is used by the syslog api closelog, openlog, syslog, vsyslog,
>> this should be per user namespace, but seems in your patch,
> 
> Yes, /dev/kmsg is per user namespace, and per syslog ns, too.
> 
>> the syslog message generated through these APIs on host can be exported
>> to the /dev/kmsg of container, is this want we want?
>>
> Ah.. I think your question targets at devkmsg_writev function, right? 

yep, another small problem, you forgot to remove the global logbuf_lock.

> You remind me that it's really an issue. Printk_emit in devkmsg_writev 
> should not use init_syslog_ns as its syslog_ns but current_user_ns->syslog_ns.
> 
> In 1st version, current_syslog_ns was used in vprintk_emit. In this version,
> the interface vprintk_emit has changed, but this patch misses that.
> I will fix it.
> 

Ok, thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] more ext4 bug fixes

2013-07-31 Thread Theodore Ts'o

(Sorry for the repeated resends to the kernel.org mailing lists; I'm on
the road, and my attempts to send this has been screwing up the e-mail
send to Linus in various different ways, most of it caused by the fact
that emacs message mode is helpfully trying to colorize the e-mail
addresses in a way that makes them impossible to read when using ssh
from a ChromeOS laptop...)

The following changes since commit 3b2f64d00c46e1e4e9bd0bb9bb12619adac27a4b:

  Linux 3.11-rc2 (2013-07-21 12:05:29 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to 44fb851dfb2f8e3462617e19a7b3b9025db9d919:

  ext4: add WARN_ON to check the length of allocated blocks (2013-07-29 
12:51:42 -0400)


ext4 bugfixes for v3.11-rc4


Eric Sandeen (1):
  ext4: destroy ext4_es_cachep on module unload

Theodore Ts'o (2):
  ext4: make sure group number is bumped after a inode allocation race
  ext4: fix retry handling in ext4_ext_truncate()

Zheng Liu (1):
  ext4: add WARN_ON to check the length of allocated blocks

 fs/ext4/extents.c |  2 +-
 fs/ext4/ialloc.c  | 10 +-
 fs/ext4/inode.c   | 39 ++-
 fs/ext4/super.c   |  1 +
 4 files changed, 25 insertions(+), 27 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] xfs: Implement FALLOC_FL_COLLAPSE_RANGE

2013-07-31 Thread Namjae Jeon
2013/8/1, Dave Chinner :
> On Wed, Jul 31, 2013 at 11:42:14PM +0900, Namjae Jeon wrote:
>> From: Namjae Jeon 
>>
>> New fallocate flag FALLOC_FL_COLLAPSE_RANGE implementation for XFS.
>
Hi Dave.
> A good start, but there's lots of work needed to make this safe for
> general use.
First, Thanks for your advice and help.

>
>> Signed-off-by: Namjae Jeon 
>> Signed-off-by: Ashish Sangwan 
>> ---
>>  fs/xfs/xfs_bmap.c |   92
>> +
>>  fs/xfs/xfs_bmap.h |3 ++
>>  fs/xfs/xfs_file.c |   26 --
>>  fs/xfs/xfs_iops.c |   35 +++
>>  fs/xfs/xfs_vnodeops.c |   62 +
>>  fs/xfs/xfs_vnodeops.h |2 ++
>>  6 files changed, 217 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
>> index 05c698c..ae677b1 100644
>> --- a/fs/xfs/xfs_bmap.c
>> +++ b/fs/xfs/xfs_bmap.c
>> @@ -6145,3 +6145,95 @@ next_block:
>>
>>  return error;
>>  }
>> +
>> +/*
>> + * xfs_update_logical()
>> + *  Updates starting logical block of extents by subtracting
>> + *  shift from them. At max XFS_LINEAR_EXTS number extents
>> + *  will be updated and *current_ext is the extent number which
>> + *  is currently being updated.
>
> Single space after the "*" in the comments. Also, format them out to
> 80 columns.
>
Single space after * followed by a tab gives checkpath warning:
WARNING: please, no space before tabs
#29: FILE: fs/xfs/xfs_bmap.c:6151:
+ * ^IUpdates starting logical block of extents by subtracting$

>> + */
>> +int
>> +xfs_update_logical(
>> +xfs_trans_t *tp,
>> +struct xfs_inode*ip,
>> +int *done,
>> +xfs_fileoff_t   start_fsb,
>> +xfs_fileoff_t   shift,
>> +xfs_extnum_t*current_ext)
>
> This belongs in xfs_bmap_btree.c, named something like
> xfs_bmbt_shift_rec_offsets().
Okay.

>
> Also, not typedefs for structures, please. (i.e. struct xfs_trans).
Okay.

>
>
>> +{
>> +xfs_btree_cur_t *cur;
>
>   struct xfs_btree_cur*cur;
>
> And for all the other structures, too.
Okay.

>
>> +xfs_bmbt_rec_host_t *gotp;
>> +xfs_mount_t *mp;
>> +xfs_ifork_t *ifp;
>> +xfs_extnum_tnexts = 0;
>> +xfs_fileoff_t   startoff;
>> +int error = 0;
>> +int i;
>> +
>> +ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
>> +mp = ip->i_mount;
>
> Do the assigment in the declaration on mp.
Okay.

>
>   struct xfs_mount*mp = ip->i_mount;
>
>> +
>> +if (!(ifp->if_flags & XFS_IFEXTENTS) &&
>> +(error = xfs_iread_extents(tp, ip, XFS_DATA_FORK)))
>> +return error;
>
> H - that rings alarm bells.
ok, we will remove the alarm.

>
>> +
>> +if (!*current_ext) {
>> +gotp = xfs_iext_bno_to_ext(ifp, start_fsb, current_ext);
>> +/*
>> + * gotp can be null in 2 cases: 1) if there are no extents
>> + * or 2) start_fsb lies in a hole beyond which there are
>> + * no extents. Either way, we are done.
>> + */
>> +if (!gotp) {
>> +*done = 1;
>> +return 0;
>> +}
>> +}
>
> So, you do a lookup on an extent index, which returns a incore
> extent record.
>
>> +
>> +if (ifp->if_flags & XFS_IFBROOT)
>> +cur = xfs_bmbt_init_cursor(mp, tp, ip, XFS_DATA_FORK);
>> +else
>> +cur = NULL;
>> +
>> +while (nexts++ < XFS_LINEAR_EXTS &&
>
> What has XFS_LINEAR_EXTS got to do with how many extents can be moved
> in a single transaction at a time?
We are also not sure how many extents to move in 1 transaction.
xfs_bunmapi deletes 3 extents in 1 iteration.
But deletion and updation are 2 different tasks.
BTW, we tested this patch with 10,000 extents and it is working fine.
Could you help us out here? What number would be appropriate?

>
>> +   *current_ext <  XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK)) {
>> +gotp = xfs_iext_get_ext(ifp, (*current_ext)++);
>> +startoff = xfs_bmbt_get_startoff(gotp);
>> +if (cur) {
>> +if ((error = xfs_bmbt_lookup_eq(cur,
>> +startoff,
>> +xfs_bmbt_get_startblock(gotp),
>> +xfs_bmbt_get_blockcount(gotp),
>> +)))
>
> Please don't put assignments inside if() statements where possible.
> i.e.
>   error = xfs_bmbt_lookup_eq(cur, 
>   if (error)
>
> Is the normal way of doing this.
Okay:)

>
>> +goto del_cursor;
>> +XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
>> +}
>> +startoff -= shift;
>> +xfs_bmbt_set_startoff(gotp, startoff);
>
> So, we've looked up 

linux-next: manual merge of the usb-gadget tree with Linus' tree

2013-07-31 Thread Stephen Rothwell
Hi Felipe,

Today's linux-next merge of the usb-gadget tree got conflicts in
drivers/usb/musb/omap2430.c and drivers/usb/musb/tusb6010.c between
commit c1f01be4060b ("usb: musb: fix resource passed from glue layer to
musb") from Linus' tree and commit c1a7d67c1901 ("usb: musb: use
dev_get_platdata()") from the usb-gadget tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc drivers/usb/musb/omap2430.c
index f44e8b5,ebb46ec..000
--- a/drivers/usb/musb/omap2430.c
+++ b/drivers/usb/musb/omap2430.c
@@@ -481,8 -481,8 +481,8 @@@ static u64 omap2430_dmamask = DMA_BIT_M
  
  static int omap2430_probe(struct platform_device *pdev)
  {
 -  struct resource musb_resources[2];
 +  struct resource musb_resources[3];
-   struct musb_hdrc_platform_data  *pdata = pdev->dev.platform_data;
+   struct musb_hdrc_platform_data  *pdata = dev_get_platdata(>dev);
struct omap_musb_board_data *data;
struct platform_device  *musb;
struct omap2430_glue*glue;
diff --cc drivers/usb/musb/tusb6010.c
index 6f8a9ca,2196ee6..000
--- a/drivers/usb/musb/tusb6010.c
+++ b/drivers/usb/musb/tusb6010.c
@@@ -1156,8 -1156,8 +1156,8 @@@ static u64 tusb_dmamask = DMA_BIT_MASK(
  
  static int tusb_probe(struct platform_device *pdev)
  {
 -  struct resource musb_resources[2];
 +  struct resource musb_resources[3];
-   struct musb_hdrc_platform_data  *pdata = pdev->dev.platform_data;
+   struct musb_hdrc_platform_data  *pdata = dev_get_platdata(>dev);
struct platform_device  *musb;
struct tusb6010_glue*glue;
  


pgpwzXdAZeVie.pgp
Description: PGP signature


[GIT PULL] more ext4 bug fixes

2013-07-31 Thread Theodore Ts'o

The following changes since commit 3b2f64d00c46e1e4e9bd0bb9bb12619adac27a4b:

  Linux 3.11-rc2 (2013-07-21 12:05:29 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to 44fb851dfb2f8e3462617e19a7b3b9025db9d919:

  ext4: add WARN_ON to check the length of allocated blocks (2013-07-29 
12:51:42 -0400)


ext4 bugfixes for v3.11-rc4


Eric Sandeen (1):
  ext4: destroy ext4_es_cachep on module unload

Theodore Ts'o (2):
  ext4: make sure group number is bumped after a inode allocation race
  ext4: fix retry handling in ext4_ext_truncate()

Zheng Liu (1):
  ext4: add WARN_ON to check the length of allocated blocks

 fs/ext4/extents.c |  2 +-
 fs/ext4/ialloc.c  | 10 +-
 fs/ext4/inode.c   | 39 ++-
 fs/ext4/super.c   |  1 +
 4 files changed, 25 insertions(+), 27 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] more ext4 bug fixes

2013-07-31 Thread Theodore Ts'o

The following changes since commit 3b2f64d00c46e1e4e9bd0bb9bb12619adac27a4b:

  Linux 3.11-rc2 (2013-07-21 12:05:29 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to 44fb851dfb2f8e3462617e19a7b3b9025db9d919:

  ext4: add WARN_ON to check the length of allocated blocks (2013-07-29 
12:51:42 -0400)


ext4 bugfixes for v3.11-rc4


Eric Sandeen (1):
  ext4: destroy ext4_es_cachep on module unload

Theodore Ts'o (2):
  ext4: make sure group number is bumped after a inode allocation race
  ext4: fix retry handling in ext4_ext_truncate()

Zheng Liu (1):
  ext4: add WARN_ON to check the length of allocated blocks

 fs/ext4/extents.c |  2 +-
 fs/ext4/ialloc.c  | 10 +-
 fs/ext4/inode.c   | 39 ++-
 fs/ext4/super.c   |  1 +
 4 files changed, 25 insertions(+), 27 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] usb: musb: dsps: use proper child nodes

2013-07-31 Thread George Cherian

On 7/31/2013 1:46 AM, Sebastian Andrzej Siewior wrote:

This moves the two instances from the big node into two child nodes. The
glue layer ontop does almost nothing.
There is one devices containing the control module for USB (2) phy,
(2) usb and later the dma engine. The usb device is the "glue device"
which contains the musb device as a child. This is what we do ever since.
The new file musb_am335x is just here to prob the new bus and populate
child devices.
There are a lot of changes to the dsps file as a result of the changes:
- musb_core_offset
   This is gone. The device tree provides memory ressources information
   for the device there is no need to "fix" things
- instances
   This is gone as well. If we have two instances then we have have two
   child enabled nodes in the device tree. For instance the SoC in beagle
   bone has two USB instances but only one has been wired up so there is
   no need to load and init the second instance since it won't be used.
- dsps_glue is now per glue device
   In the past there was one of this structs but with an array of two and
   each instance accessed its variable depending on the platform device
   id.
- no unneeded copy of structs
   I do not know why struct dsps_musb_wrapper is copied but it is not
   necessary. The same goes for musb_hdrc_platform_data which allocated
   on demand and then again by platform_device_add_data(). One copy is
   enough.

Signed-off-by: Sebastian Andrzej Siewior 




  
diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi

index 38b446b..0f756ca 100644
--- a/arch/arm/boot/dts/am33xx.dtsi
+++ b/arch/arm/boot/dts/am33xx.dtsi
@@ -26,6 +26,10 @@
serial5 = 
d_can0 = 
d_can1 = 
+   usb0 = 
+   usb1 = 
+   phy0 = _phy;
+   phy1 = _phy;
};
  
  	cpus {

@@ -333,21 +337,85 @@
status = "disabled";
};
  
-		usb@4740 {

-   compatible = "ti,musb-am33xx";
-   reg = <0x4740 0x1000 /* usbss */
-  0x47401000 0x800 /* musb instance 0 */
-  0x47401800 0x800>;/* musb instance 1 */
-   interrupts = <17 /* usbss */
- 18/* musb instance 0 */
- 19>;   /* musb instance 1 */
-   multipoint = <1>;
-   num-eps = <16>;
-   ram-bits = <12>;
-   port0-mode = <3>;
-   port1-mode = <3>;
-   power = <250>;
+   usb: usb@4740 {
+   compatible = "ti,am33xx-usb";
+   reg = <0x4740 0x1000>;
+   ranges;
+   #address-cells = <1>;
+   #size-cells = <1>;
ti,hwmods = "usb_otg_hs";
+   status = "disabled";
+
+   ctrl_mod: control@44e1 {
+   compatible = "ti,am335x-ctrl-module";
+   reg = <0x44e1 0x650>;


Do you really need to map Control Module base to 0x650? If some other driver 
does this mapping

we will always end returning -EPROBE_DEFER.

How about this
reg =   <0x44e10620 0x4> ,
<0x44e10648 0x1>;
reg-names = "phycontrol_dev" , "phywkup_dev";

and map for power on/off and phywkup separately in the control driver.


+   reg-names = "control_dev";
+   status = "disabled";
+   };
+





diff --git a/drivers/usb/musb/Kconfig b/drivers/usb/musb/Kconfig
index 797e3fd..b7257ae 100644
--- a/drivers/usb/musb/Kconfig
+++ b/drivers/usb/musb/Kconfig
@@ -83,6 +83,7 @@ config USB_MUSB_AM35X
  
  config USB_MUSB_DSPS

tristate "TI DSPS platforms"
+   select USB_MUSB_AM335X_CHILD


How about adding select AM335X_PHY_USB here?

  
  config USB_MUSB_BLACKFIN

tristate "Blackfin"
@@ -93,6 +94,9 @@ config USB_MUSB_UX500
  
  endchoice
  
+config USB_MUSB_AM335X_CHILD

+   tristate
+
  choice
prompt 'MUSB DMA mode'
default MUSB_PIO_ONLY if ARCH_MULTIPLATFORM



--
-George

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: manual merge of the driver-core tree with the net-next tree

2013-07-31 Thread Stephen Rothwell
Hi Greg,

Today's linux-next merge of the driver-core tree got a conflict in
net/core/net-sysfs.c between commit ff80e519ab1b ("net: export physical
port id via sysfs") from the net-next tree and commit 6be8aeef348a ("net:
core: convert class code to use dev_groups") from the driver-core tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc net/core/net-sysfs.c
index 8826b0d,707c313..000
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@@ -237,9 -250,9 +250,31 @@@ static ssize_t operstate_show(struct de
  
return sprintf(buf, "%s\n", operstates[operstate]);
  }
+ static DEVICE_ATTR_RO(operstate);
+ 
++static ssize_t phys_port_id_show(struct device *dev,
++   struct device_attribute *attr, char *buf)
++{
++  struct net_device *netdev = to_net_dev(dev);
++  ssize_t ret = -EINVAL;
++
++  if (!rtnl_trylock())
++  return restart_syscall();
++
++  if (dev_isalive(netdev)) {
++  struct netdev_phys_port_id ppid;
++
++  ret = dev_get_phys_port_id(netdev, );
++  if (!ret)
++  ret = sprintf(buf, "%*phN\n", ppid.id_len, ppid.id);
++  }
++  rtnl_unlock();
++
++  return ret;
++}
++static DEVICE_ATTR_RO(phys_port_id);
 +
  /* read-write attributes */
- NETDEVICE_SHOW(mtu, fmt_dec);
  
  static int change_mtu(struct net_device *net, unsigned long new_mtu)
  {
@@@ -333,52 -344,32 +366,33 @@@ static ssize_t group_store(struct devic
  {
return netdev_store(dev, attr, buf, len, change_group);
  }
- 
- static ssize_t show_phys_port_id(struct device *dev,
-struct device_attribute *attr, char *buf)
- {
-   struct net_device *netdev = to_net_dev(dev);
-   ssize_t ret = -EINVAL;
- 
-   if (!rtnl_trylock())
-   return restart_syscall();
- 
-   if (dev_isalive(netdev)) {
-   struct netdev_phys_port_id ppid;
- 
-   ret = dev_get_phys_port_id(netdev, );
-   if (!ret)
-   ret = sprintf(buf, "%*phN\n", ppid.id_len, ppid.id);
-   }
-   rtnl_unlock();
- 
-   return ret;
- }
- 
- static struct device_attribute net_class_attributes[] = {
-   __ATTR(addr_assign_type, S_IRUGO, show_addr_assign_type, NULL),
-   __ATTR(addr_len, S_IRUGO, show_addr_len, NULL),
-   __ATTR(dev_id, S_IRUGO, show_dev_id, NULL),
-   __ATTR(ifalias, S_IRUGO | S_IWUSR, show_ifalias, store_ifalias),
-   __ATTR(iflink, S_IRUGO, show_iflink, NULL),
-   __ATTR(ifindex, S_IRUGO, show_ifindex, NULL),
-   __ATTR(type, S_IRUGO, show_type, NULL),
-   __ATTR(link_mode, S_IRUGO, show_link_mode, NULL),
-   __ATTR(address, S_IRUGO, show_address, NULL),
-   __ATTR(broadcast, S_IRUGO, show_broadcast, NULL),
-   __ATTR(carrier, S_IRUGO | S_IWUSR, show_carrier, store_carrier),
-   __ATTR(speed, S_IRUGO, show_speed, NULL),
-   __ATTR(duplex, S_IRUGO, show_duplex, NULL),
-   __ATTR(dormant, S_IRUGO, show_dormant, NULL),
-   __ATTR(operstate, S_IRUGO, show_operstate, NULL),
-   __ATTR(mtu, S_IRUGO | S_IWUSR, show_mtu, store_mtu),
-   __ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
-   __ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
-  store_tx_queue_len),
-   __ATTR(netdev_group, S_IRUGO | S_IWUSR, show_group, store_group),
-   __ATTR(phys_port_id, S_IRUGO, show_phys_port_id, NULL),
-   {}
+ NETDEVICE_SHOW(group, fmt_dec);
+ static DEVICE_ATTR(netdev_group, S_IRUGO | S_IWUSR, group_show, group_store);
+ 
+ static struct attribute *net_class_attrs[] = {
+   _attr_netdev_group.attr,
+   _attr_type.attr,
+   _attr_dev_id.attr,
+   _attr_iflink.attr,
+   _attr_ifindex.attr,
+   _attr_addr_assign_type.attr,
+   _attr_addr_len.attr,
+   _attr_link_mode.attr,
+   _attr_address.attr,
+   _attr_broadcast.attr,
+   _attr_speed.attr,
+   _attr_duplex.attr,
+   _attr_dormant.attr,
+   _attr_operstate.attr,
++  _attr_phys_port_id.attr,
+   _attr_ifalias.attr,
+   _attr_carrier.attr,
+   _attr_mtu.attr,
+   _attr_flags.attr,
+   _attr_tx_queue_len.attr,
+   NULL,
  };
+ ATTRIBUTE_GROUPS(net_class);
  
  /* Show a given an attribute in the statistics group */
  static ssize_t netstat_show(const struct device *d,


pgpLzs2vJkKrd.pgp
Description: PGP signature


Re: [PATCH 17/18] sched: Retry migration of tasks to CPU on a preferred node

2013-07-31 Thread Srikar Dronamraju
* Mel Gorman  [2013-07-15 16:20:19]:

> When a preferred node is selected for a tasks there is an attempt to migrate
> the task to a CPU there. This may fail in which case the task will only
> migrate if the active load balancer takes action. This may never happen if

Apart from load imbalance or heavily loaded cpus on the preferred node,
what could be the other reasons for migration failure with
migrate_task_to()? I see it almost similar to active load balance except
for pushing instead of pulling tasks.

If load imbalance is the only reason, do we need to retry? If the task
is really so attached to memory on that node, shouldn't we getting
task_numa_placement hit before the next 5 seconds? 

> the conditions are not right. This patch will check at NUMA hinting fault
> time if another attempt should be made to migrate the task. It will only
> make an attempt once every five seconds.
> 
> Signed-off-by: Mel Gorman 

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] fs: Introduce new flag FALLOC_FL_COLLAPSE_RANGE

2013-07-31 Thread Namjae Jeon
2013/8/1, Dave Chinner :
> On Wed, Jul 31, 2013 at 11:42:00PM +0900, Namjae Jeon wrote:
>> From: Namjae Jeon 
>>
>> Fallocate now supports new FALLOC_FL_COLLAPSE_RANGE flag.
>> The semantics of this flag are following:
>> 1) It collapses the range lying between offset and length by removing any
>> data
>>blocks which are present in this range and than updates all the
>> logical
>>offsets of extents beyond "offset + len" to nullify the hole created
>> by
>>removing blocks. In short, it does not leave a hole.
>> 1) It should be used exclusively. No other fallocate flag in combination.
>> 2) Offset and length supplied to fallocate should be fs block size
>> aligned.
>
Hi Dave.

> Given that the rest of fallocate() interfaces are byte based, this
> is going to cause some confusion if it's not well documented. i.e.
> this restriction needs to be documented in the header file that is
> exposed to userspace, as well as in the fallocate(2) man page.
Agree.

>
>> 3) It wokrs beyond "EOF", so the extents which are pre-allocated beyond
>> "EOF"
>>are also updated.
>
> I don't think that's valid for this operation. If the range is
> beyond EOF, you are not modifying anything visible to userspace, and
> that makes it the same as a hole punch operation. So, I'd get rid of
> thisnasty implementation complexity altogether.
The basic idea behind collapse range is that it does'nt leaves a hole.
And this is achieved by updating starting block of each of the extents
which lies beyond hole.
When we are updating an extent, we are actually moving hole to the
right, towards the last allocated block in the file space.
We continue doing this for each extent and finally when this operation
completes, the hole is gone.
If the file has preallocated extent(s) beyond EOF, something like this=>
hole, extent1, extent2, extent_beyond_EOF1.
And we start updating extents one by one, to nullify the effect of hole=>
STEP1: After updating extent1: extent1, hole, extent2, extent_beyond_EOF1.
STEP2: After updating extent2: extent1, extent2, hole, extent_beyond_EOF1.
STEP3: After updating extent_beyond_EOF1: extent1, extent2,
extent_beyond_EOF1, hole => And this removes hole from the file.
If we stop updating just after extent2, this will not remove hole from
the file space but instead, it will place the hole between EOF and 1st
preallcoated extent beyond EOF as in STEP2.
So, if user does an extending truncate, hole will became visible as it
is still there.

>
>> 4) It reduces the i_size of inode by the amount of collapse range which
>> lies
>>within i_size. So, if offset >= i_size, i_size won't be changed at
>> all.
>
> Similarly, I don't think that asking for a range that overlaps EOF
> is valid, either. The moment you overlap or go beyond EOF, you are
> effectively truncating the file as there is no data to collapse into
> the range
Yes, we could put a restriction on that, something like: if (len +
offset > i_size) { return -EINVAL;}

>
> As it is, I don't see these EOF rules enforced by this patch, nor
> are they documented at all in the patch.
The EOF file rules are handled in the fs specific patches. As in XFS =>
+   if (mode & FALLOC_FL_COLLAPSE_RANGE) {
+   error = -xfs_collapse_file_space(ip, offset + len, len);
+   if (error || offset >= i_size_read(inode))
+   goto out_unlock;
+
+   /* Shrink size in case of FALLOC_FL_COLLAPSE_RANGE */
+   if ((offset + len) > i_size_read(inode))
+   new_size = offset;
+   else
+   new_size = i_size_read(inode) - len;
+   error = -xfs_update_file_size(ip, new_size);
But it seems ok to not allow (offset + len) > i_size.

>
> Regardless of what semantics we decide on, this needs xfs_io
> support and extensive corner case tests added to xfstests so that we
> can confirm the implementation in every filesystem is correct and we
> don't introduce regressions in future. it also needs documentation
> in the fallocate(2) man page.
Yes, we are currently working on it.
With next version of collapse_range patches, we will also send xfs_io
and xfstests patches and manpage also.

>
>> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
>> index 990c4cc..7567c8c 100644
>> --- a/include/uapi/linux/falloc.h
>> +++ b/include/uapi/linux/falloc.h
>> @@ -1,9 +1,10 @@
>>  #ifndef _UAPI_FALLOC_H_
>>  #define _UAPI_FALLOC_H_
>>
>> -#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */
>> -#define FALLOC_FL_PUNCH_HOLE0x02 /* de-allocates range */
>> +#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */
>> +#define FALLOC_FL_PUNCH_HOLE0x02 /* de-allocates range */
>>  #define FALLOC_FL_NO_HIDE_STALE 0x04 /* reserved codepoint */
>> +#define FALLOC_FL_COLLAPSE_RANGE0x08 /* it does not leave a hole */
>
> I'd suggest that you need to explain what FALLOC_FL_COLLAPSE_RANGE
> does in great detail right 

Re: [PATCH v5 0/4] [SCSI] sg: fix race condition in sg_open

2013-07-31 Thread Douglas Gilbert

On 13-07-22 01:03 PM, Jörn Engel wrote:

On Mon, 22 July 2013 12:40:29 +0800, Vaughan Cao wrote:


There is a race when open sg with O_EXCL flag. Also a race may happen between
sg_open and sg_remove.

Changes from v4:
  * [3/4] use ERR_PTR series instead of adding another parameter in sg_add_sfp
  * [4/4] fix conflict for cherry-pick from v3.

Changes from v3:
  * release o_sem in sg_release(), not in sg_remove_sfp().
  * not set exclude with sfd_lock held.

Vaughan Cao (4):
   [SCSI] sg: use rwsem to solve race during exclusive open
   [SCSI] sg: no need sg_open_exclusive_lock
   [SCSI] sg: checking sdp->detached isn't protected when open
   [SCSI] sg: push file descriptor list locking down to per-device
 locking

  drivers/scsi/sg.c | 178 +-
  1 file changed, 83 insertions(+), 95 deletions(-)


Patchset looks good to me, although I didn't test it on hardware yet.
Signed-off-by: Joern Engel 

James, care to pick this up?


Acked-by: Douglas Gilbert 

Tested O_EXCL with multiple processes and threads; passed.
sg driver prior to this patch had "leaky" O_EXCL logic
according to the same test. Block device passed.

James, could you clean this up:
  drivers/scsi/sg.c:242:6: warning: unused variable ‘res’ [-Wunused-variable]

Doug Gilbert


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 18/18] sched: Swap tasks when reschuling if a CPU on a target node is imbalanced

2013-07-31 Thread Srikar Dronamraju
> @@ -904,6 +908,8 @@ static int task_numa_find_cpu(struct task_struct *p, int 
> nid)
>   src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, 
> -weight);
> 
>   for_each_cpu(cpu, cpumask_of_node(nid)) {
> + struct task_struct *swap_candidate = NULL;
> +
>   dst_load = target_load(cpu, idx);
> 
>   /* If the CPU is idle, use it */
> @@ -922,12 +928,41 @@ static int task_numa_find_cpu(struct task_struct *p, 
> int nid)
>* migrate to its preferred node due to load imbalances.
>*/
>   balanced = (dst_eff_load <= src_eff_load);
> - if (!balanced)
> - continue;
> + if (!balanced) {
> + struct rq *rq = cpu_rq(cpu);
> + unsigned long src_faults, dst_faults;
> +
> + /* Do not move tasks off their preferred node */
> + if (rq->curr->numa_preferred_nid == nid)
> + continue;
> +
> + /* Do not attempt an illegal migration */
> + if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(rq->curr)))
> + continue;
> +
> + /*
> +  * Do not impair locality for the swap candidate.
> +  * Destination for the swap candidate is the source cpu
> +  */
> + if (rq->curr->numa_faults) {
> + src_faults = 
> rq->curr->numa_faults[task_faults_idx(nid, 1)];
> + dst_faults = 
> rq->curr->numa_faults[task_faults_idx(src_cpu_node, 1)];
> + if (src_faults > dst_faults)
> + continue;
> + }
> +
> + /*
> +  * The destination is overloaded but running a task
> +  * that is not running on its preferred node. Consider
> +  * swapping the CPU tasks are running on.
> +  */
> + swap_candidate = rq->curr;
> + }
> 
>   if (dst_load < min_load) {
>   min_load = dst_load;
>   dst_cpu = cpu;
> + *swap_p = swap_candidate;

Are we some times passing a wrong candidate?
Lets say the first cpu balanced is false and we set the swap_candidate,
but find the second cpu(/or later cpus) to be idle or has lesser effective 
load, then we
could be sending the task that is running on the first cpu as the swap
candidate.

Then would the preferred cpu and swap_candidate match?

-- 
Thanks and Regards
Srikar

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle backend driver to sysdev.

2013-07-31 Thread Preeti U Murthy
Hi Dongsheng,

On 07/31/2013 11:16 AM, Wang Dongsheng-B40534 wrote:
> Hi Preeti,
> 
>> -Original Message-
>> From: Preeti U Murthy [mailto:pre...@linux.vnet.ibm.com]
>> Sent: Wednesday, July 31, 2013 12:00 PM
>> To: Wang Dongsheng-B40534
>> Cc: Deepthi Dharwar; b...@kernel.crashing.org; daniel.lezc...@linaro.org;
>> linux-kernel@vger.kernel.org; mich...@ellerman.id.au;
>> srivatsa.b...@linux.vnet.ibm.com; sva...@linux.vnet.ibm.com; linuxppc-
>> d...@lists.ozlabs.org; r...@sisk.pl; linux...@vger.kernel.org
>> Subject: Re: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle
>> backend driver to sysdev.
>>
>> Hi Dongsheng,
>>
>> On 07/31/2013 08:52 AM, Wang Dongsheng-B40534 wrote:
>>>
>>>
 -Original Message-
 From: Deepthi Dharwar [mailto:deep...@linux.vnet.ibm.com]
 Sent: Wednesday, July 31, 2013 10:59 AM
 To: b...@kernel.crashing.org; daniel.lezc...@linaro.org; linux-
 ker...@vger.kernel.org; mich...@ellerman.id.au;
 srivatsa.b...@linux.vnet.ibm.com; pre...@linux.vnet.ibm.com;
 sva...@linux.vnet.ibm.com; linuxppc-...@lists.ozlabs.org
 Cc: r...@sisk.pl; Wang Dongsheng-B40534; linux...@vger.kernel.org
 Subject: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle
 backend driver to sysdev.

 Move pseries_idle backend driver code to arch/powerpc/sysdev so that
 the code can be used for a common driver for powernv and pseries.
 This removes a lot of code duplicacy.

>>> Why not drivers/cpuidle/?
>>>
>>> I think it should be move to drivers/cpuidle.
>>
>> Please take a look at what the cpuidle under drivers has to provide.
>> cpuidle has two parts to it. The front end and the back end. The front
>> end constitutes the cpuidle governors, registering of arch specific
>> cpuidle drivers, disabling and enabling of cpuidle feature. It is this
>> front end code which is present under drivers/cpuidle.
>>
>> The arch specific cpuidle drivers which decide what needs to be done to
>> enter a specific idle state chosen by the cpuidle governor is what
>> constitutes the back end of cpuidle. This will not be in drivers/cpuidle
>> but in an arch/ specific code.
>>
>> The cpuidle under drivers/cpuidle drives the idle power management, but
>> the low level handling of the entry into idle states should be taken care
>> of by the architecture.
>>
>> Your recent patch :
>> cpuidle: add freescale e500 family porcessors idle support IMO should
>> hook onto the backend cpuidle driver that this patchset provides.
>>
> Sorry, I don't think so, cpuidle framework has been already very common.
> Here we just need to do state definition and handling. I wonder whether
> we need this layer.
> 
> If your handle is platform dependent, it should be in arch/platform.
> 
> If it is only for some platforms and the operation of these platforms can be
> multiplexed, Why cannot as a driver to put into driver/cpuidle?
> 
> If it a general driver, I think we can put some common operating to 
> driver/cpuidle
> and make the platform specific code to arch/powerpc/platform.
> 
> This patch include front end and back end, not just back end.
> 
> This patch include too many state of different platforms and handle function. 
> This state
> and handle that should belong to itself platforms. Not a general way. If 
> Deepthi will do
> a general powerpc cpuidle, I think, it's cannot just using the macro to 
> distinguish
> platform. the front end code maybe move to driver/cpuidle(drvier register) 
> should be better,
> make the Low Power State and what should be handle to 
> arch/powerpc/platform/**, because different
> platforms have different state of low power consumption, and the processing 
> method.
> The front end can provide some general methods to register into general 
> powerpc cpuidle driver.


As Daniel pointed out, with a call to cpuidle_register(), we can get the
cpuidle_driver and cpuidle_device registered through the generic cpuidle
framework. Hence we can get rid of the powerpc_idle_devices_init() routine.

We can have the hotplug notifier in the generic cpuidle framework as
well. The rest of the patchset however should be arch specific IMO.

Regards
Preeti U Murthy

> 
> -dongsheng
> 
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/18] sched: Reschedule task on preferred NUMA node once selected

2013-07-31 Thread Srikar Dronamraju
* Mel Gorman  [2013-07-15 16:20:10]:

> A preferred node is selected based on the node the most NUMA hinting
> faults was incurred on. There is no guarantee that the task is running
> on that node at the time so this patch rescheules the task to run on
> the most idle CPU of the selected node when selected. This avoids
> waiting for the balancer to make a decision.
> 
> Signed-off-by: Mel Gorman 
> ---
>  kernel/sched/core.c  | 17 +
>  kernel/sched/fair.c  | 46 +-
>  kernel/sched/sched.h |  1 +
>  3 files changed, 63 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5e02507..b67a102 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4856,6 +4856,23 @@ fail:
>   return ret;
>  }
> 
> +#ifdef CONFIG_NUMA_BALANCING
> +/* Migrate current task p to target_cpu */
> +int migrate_task_to(struct task_struct *p, int target_cpu)
> +{
> + struct migration_arg arg = { p, target_cpu };
> + int curr_cpu = task_cpu(p);
> +
> + if (curr_cpu == target_cpu)
> + return 0;
> +
> + if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
> + return -EINVAL;
> +
> + return stop_one_cpu(curr_cpu, migration_cpu_stop, );

As I had noted earlier, this upsets schedstats badly.
Can we add a TODO for this patch, which mentions that schedstats need to
taken care.

One alternative that I can think of is to have a per scheduling class
routine that gets called and does the needful.
for example: for fair share, it could update the schedstats as well as
check for cfs_throttling.

But I think its an issue that needs some fix or we should obsolete
schedstats.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-07-31 Thread Alexey Kardashevskiy
This is to reserve a capablity number for upcoming support
of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls
which support mulptiple DMA map/unmap operations per one call.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
2013/07/16:
* changed the number

Signed-off-by: Alexey Kardashevskiy 
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index acccd08..99c2533 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_ARM_EL1_32BIT 93
+#define KVM_CAP_SPAPR_MULTITCE 94
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/10] powerpc: add real mode support for dma operations on powernv

2013-07-31 Thread Alexey Kardashevskiy
The existing TCE machine calls (tce_build and tce_free) only support
virtual mode as they call __raw_writeq for TCE invalidation what
fails in real mode.

This introduces tce_build_rm and tce_free_rm real mode versions
which do mostly the same but use "Store Doubleword Caching Inhibited
Indexed" instruction for TCE invalidation.

This new feature is going to be utilized by real mode support of VFIO.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
2013/11/07:
* added comment why stdcix cannot be used in virtual mode

2013/08/07:
* tested on p7ioc and fixed a bug with realmode addresses

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/machdep.h| 12 
 arch/powerpc/platforms/powernv/pci-ioda.c | 47 +++
 arch/powerpc/platforms/powernv/pci.c  | 38 +
 arch/powerpc/platforms/powernv/pci.h  |  3 +-
 4 files changed, 81 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 8b48090..07dd3b1 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -78,6 +78,18 @@ struct machdep_calls {
long index);
void(*tce_flush)(struct iommu_table *tbl);
 
+   /* _rm versions are for real mode use only */
+   int (*tce_build_rm)(struct iommu_table *tbl,
+long index,
+long npages,
+unsigned long uaddr,
+enum dma_data_direction direction,
+struct dma_attrs *attrs);
+   void(*tce_free_rm)(struct iommu_table *tbl,
+   long index,
+   long npages);
+   void(*tce_flush_rm)(struct iommu_table *tbl);
+
void __iomem *  (*ioremap)(phys_addr_t addr, unsigned long size,
   unsigned long flags, void *caller);
void(*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index d8140b1..5815f1d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -70,6 +70,16 @@ define_pe_printk_level(pe_err, KERN_ERR);
 define_pe_printk_level(pe_warn, KERN_WARNING);
 define_pe_printk_level(pe_info, KERN_INFO);
 
+/*
+ * stdcix is only supposed to be used in hypervisor real mode as per
+ * the architecture spec
+ */
+static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr)
+{
+   __asm__ __volatile__("stdcix %0,0,%1"
+   : : "r" (val), "r" (paddr) : "memory");
+}
+
 static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
 {
unsigned long pe;
@@ -454,10 +464,13 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe 
*pe, struct pci_bus *bus)
}
 }
 
-static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
-u64 *startp, u64 *endp)
+static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
+struct iommu_table *tbl,
+u64 *startp, u64 *endp, bool rm)
 {
-   u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+   u64 __iomem *invalidate = rm?
+   (u64 __iomem *)pe->tce_inval_reg_phys:
+   (u64 __iomem *)tbl->it_index;
unsigned long start, end, inc;
 
start = __pa(startp);
@@ -484,7 +497,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct 
iommu_table *tbl,
 
 mb(); /* Ensure above stores are visible */
 while (start <= end) {
-__raw_writeq(start, invalidate);
+   if (rm)
+   __raw_rm_writeq(start, invalidate);
+   else
+   __raw_writeq(start, invalidate);
 start += inc;
 }
 
@@ -496,10 +512,12 @@ static void pnv_pci_ioda1_tce_invalidate(struct 
iommu_table *tbl,
 
 static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 struct iommu_table *tbl,
-u64 *startp, u64 *endp)
+u64 *startp, u64 *endp, bool rm)
 {
unsigned long start, end, inc;
-   u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+   u64 __iomem *invalidate = rm?
+   (u64 __iomem *)pe->tce_inval_reg_phys:
+   (u64 __iomem *)tbl->it_index;
 
/* We'll invalidate DMA address in PE scope */
start = 0x2ul << 60;
@@ -515,22 +533,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
pnv_ioda_pe *pe,
mb();
 
while (start <= end) {
-   __raw_writeq(start, invalidate);
+   if (rm)
+   

[PATCH 08/10] powerpc/iommu: rework to support realmode

2013-07-31 Thread Alexey Kardashevskiy
The TCE tables handling may differ for real and virtual modes so
additional ppc_md.tce_build_rm/ppc_md.tce_free_rm/ppc_md.tce_flush_rm
handlers were introduced earlier.

So this adds the following:
1. support for the new ppc_md calls;
2. ability to iommu_tce_build to process mupltiple entries per
call;
3. arch_spin_lock to protect TCE table from races in both real and virtual
modes;
4. proper TCE table protection from races with the existing IOMMU code
in iommu_take_ownership/iommu_release_ownership;
5. hwaddr variable renamed to hpa as it better describes what it
actually represents;
6. iommu_tce_direction is static now as it is not called from anywhere else.

This will be used by upcoming real mode support of VFIO on POWER.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h |   9 +-
 arch/powerpc/kernel/iommu.c  | 198 ++-
 2 files changed, 136 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c34656a..b01bde1 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
+   arch_spinlock_t it_rm_lock;
 #endif
 };
 
@@ -152,9 +153,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table 
*tbl,
 extern int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce);
 extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-   unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-   unsigned long entry);
+   unsigned long *hpas, unsigned long npages, bool rm);
+extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+   unsigned long npages, bool rm);
 extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
 extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
@@ -164,7 +165,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
 extern void iommu_release_ownership(struct iommu_table *tbl);
 
-extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b20ff17..8314c80 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl,
kfree(name);
 }
 
-enum dma_data_direction iommu_tce_direction(unsigned long tce)
+static enum dma_data_direction iommu_tce_direction(unsigned long tce)
 {
if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
return DMA_BIDIRECTIONAL;
@@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long 
tce)
else
return DMA_NONE;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_direction);
 
 void iommu_flush_tce(struct iommu_table *tbl)
 {
@@ -972,73 +971,117 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
-   unsigned long oldtce;
-   struct iommu_pool *pool = get_pool(tbl, entry);
-
-   spin_lock(&(pool->lock));
-
-   oldtce = ppc_md.tce_get(tbl, entry);
-   if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
-   ppc_md.tce_free(tbl, entry, 1);
-   else
-   oldtce = 0;
-
-   spin_unlock(&(pool->lock));
-
-   return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
 int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
 {
-   unsigned long oldtce;
-   struct page *page;
-
-   for ( ; pages; --pages, ++entry) {
-   oldtce = iommu_clear_tce(tbl, entry);
-   if (!oldtce)
-   continue;
-
-   page = pfn_to_page(oldtce >> PAGE_SHIFT);
-   WARN_ON(!page);
-   if (page) {
-   if (oldtce & TCE_PCI_WRITE)
-   SetPageDirty(page);
-   put_page(page);
-   }
-   }
-
-   return 0;
+   return iommu_free_tces(tbl, entry, pages, false);
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
 
-/*
- * hwaddr is a kernel virtual address here (0xc... bazillion),
- * tce_build converts it to a physical address.
- */
+int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+   unsigned long npages, bool rm)
+{
+   int i, ret = 0, to_free = 0;
+
+   if (rm && !ppc_md.tce_free_rm)
+   return 

[PATCH 03/10] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-07-31 Thread Alexey Kardashevskiy
This is to reserve a capablity number for upcoming support
of VFIO-IOMMU DMA operations in real mode.

Signed-off-by: Alexey Kardashevskiy 

---
Changes:
2013/07/16:
* changed the number

2013/07/11:
* changed order in a file, added comment about a gap in ioctl number

Signed-off-by: Alexey Kardashevskiy 
---
 include/uapi/linux/kvm.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 99c2533..53c3f1f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_ARM_EL1_32BIT 93
 #define KVM_CAP_SPAPR_MULTITCE 94
+#define KVM_CAP_SPAPR_TCE_IOMMU 95
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -933,6 +934,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ARM_SET_DEVICE_ADDR  _IOW(KVMIO,  0xab, struct 
kvm_arm_device_addr)
 /* Available with KVM_CAP_PPC_RTAS */
 #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xac, struct 
kvm_rtas_token_args)
+/* 0xad and 0xaf are already taken */
+/* Available with KVM_CAP_SPAPR_TCE_IOMMU */
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct 
kvm_create_spapr_tce_iommu)
 
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE_IOWR(KVMIO,  0xe0, struct kvm_create_device)
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/10] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-31 Thread Alexey Kardashevskiy
This adds special support for huge pages (16MB) in real mode.

The reference counting cannot be easily done for such pages in real
mode (when MMU is off) so we added a hash table of huge pages.
It is populated in virtual mode and get_page is called just once
per a huge page. Real mode handlers check if the requested page is
in the hash table, then no reference counting is done, otherwise
an exit to virtual mode happens. The hash table is released at KVM
exit.

At the moment the fastest card available for tests uses up to 9 huge
pages so walking through this hash table does not cost much.
However this can change and we may want to optimize this.

Signed-off-by: Paul Mackerras 
Signed-off-by: Alexey Kardashevskiy 

---

Changes:
2013/07/12:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64

2013/06/27:
* list of huge pages replaces with hashtable for better performance
* spinlock removed from real mode and only protects insertion of new
huge [ages descriptors into the hashtable

2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n

2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).

Conflicts:
arch/powerpc/kernel/iommu.c

Conflicts:
arch/powerpc/kernel/iommu.c

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/kvm_host.h |  25 
 arch/powerpc/kernel/iommu.c |   6 +-
 arch/powerpc/kvm/book3s_64_vio.c| 121 ++--
 arch/powerpc/kvm/book3s_64_vio_hv.c |  32 +-
 4 files changed, 175 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 4eeaf7d..c57b25a 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -183,9 +184,33 @@ struct kvmppc_spapr_tce_table {
u32 window_size;
struct iommu_group *grp;/* used for IOMMU groups */
struct vfio_group *vfio_grp;/* used for IOMMU groups */
+   DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
+   spinlock_t hugepages_write_lock;/* used for IOMMU groups */
struct page *pages[0];
 };
 
+/*
+ * The KVM guest can be backed with 16MB pages.
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa >> 24, 32)
+
+struct kvmppc_spapr_iommu_hugepage {
+   struct hlist_node hash_node;
+   unsigned long gpa;  /* Guest physical address */
+   unsigned long hpa;  /* Host physical address */
+   struct page *page;  /* page struct of the very first subpage */
+   unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
 struct kvmppc_linear_info {
void*base_virt;
unsigned longbase_pfn;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 8314c80..e4a8135 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long 
entry,
if (!pg) {
ret = -EAGAIN;
} else if (PageCompound(pg)) {
-   ret = -EAGAIN;
+   /* Hugepages will be released at KVM exit */
+   ret = 0;
} else {
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(pg);
@@ -1010,6 +1011,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned 
long entry,
struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT);
if (!pg) {
ret = -EAGAIN;
+   } else if (PageCompound(pg)) {
+   /* Hugepages will be released at KVM exit */
+   ret = 0;
} else {
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(pg);
diff --git 

[PATCH 07/10] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-31 Thread Alexey Kardashevskiy
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a function to convert a guest physical address to a host
virtual address in order to parse a TCE list from H_PUT_TCE_INDIRECT.

This also implements the KVM_CAP_PPC_MULTITCE capability. When present,
the hypercalls mentioned above may or may not be processed successfully
in the kernel based fast path. If they can not be handled by the kernel,
they will get passed on to user space. So user space still has to have
an implementation for these despite the in kernel acceleration.

Signed-off-by: Paul Mackerras 
Signed-off-by: Alexey Kardashevskiy 

---
Changelog:
2013/08/01 (v7):
* realmode_get_page/realmode_put_page use was replaced with
get_page_unless_zero/put_page_unless_one

2013/07/11:
* addressed many, many comments from maintainers

2013/07/06:
* fixed number of wrong get_page()/put_page() calls

2013/06/27:
* fixed clear of BUSY bit in kvmppc_lookup_pte()
* H_PUT_TCE_INDIRECT does realmode_get_page() now
* KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
* updated doc

2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)

Signed-off-by: Alexey Kardashevskiy 
---
 Documentation/virtual/kvm/api.txt   |  26 
 arch/powerpc/include/asm/kvm_host.h |   9 ++
 arch/powerpc/include/asm/kvm_ppc.h  |  16 +-
 arch/powerpc/kvm/book3s_64_vio.c| 132 +++-
 arch/powerpc/kvm/book3s_64_vio_hv.c | 267 
 arch/powerpc/kvm/book3s_hv.c|  41 -
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   8 +-
 arch/powerpc/kvm/book3s_pr_papr.c   |  35 +
 arch/powerpc/kvm/powerpc.c  |   3 +
 9 files changed, 503 insertions(+), 34 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index ef925ea..1c8942a 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2382,6 +2382,32 @@ calls by the guest for that service will be passed to 
userspace to be
 handled.
 
 
+4.86 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significantly accelerates DMA operations for PPC KVM guests.
+User space should expect that its handlers for these hypercalls
+are not going to be called if user space previously registered LIOBN
+in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+The hypercalls mentioned above may or may not be processed successfully
+in the kernel based fast path. If they can not be handled by the kernel,
+they will get passed on to user space. So user space still has to have
+an implementation for these despite the in kernel acceleration.
+
+This capability is always enabled.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index af326cd..b8fe3de 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+   unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT hcall 
*/
+   enum {
+   TCERM_NONE,
+   TCERM_GETPAGE,
+   TCERM_PUTTCE,
+   TCERM_PUTLIST,
+   } tce_rm_fail;  /* failed stage of 

[PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-07-31 Thread Alexey Kardashevskiy
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API function realmode_pfn_to_page() to get page struct when
MMU is off.

This adds to MM a new function put_page_unless_one() which drops a page
if counter is bigger than 1. It is going to be used when MMU is off
(for example, real mode on PPC64) and we want to make sure that page
release will not happen in real mode as it may crash the kernel in
a horrible way.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Cc: linux...@kvack.org
Cc: Benjamin Herrenschmidt 
Cc: Andrew Morton 
Reviewed-by: Paul Mackerras 
Signed-off-by: Paul Mackerras 
Signed-off-by: Alexey Kardashevskiy 

---

Changes:
2013/07/25 (v7):
* removed realmode_put_page and added put_page_unless_one() instead.
The name has been chosen to conform the already existing
get_page_unless_zero().
* removed realmode_get_page. Instead, get_page_unless_zero() should be used

2013/07/10:
* adjusted comment (removed sentence about virtual mode)
* get_page_unless_zero replaced with atomic_inc_not_zero to minimize
effect of a possible get_page_unless_zero() rework (if it ever happens).

2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.

2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/pgtable-ppc64.h |  2 ++
 arch/powerpc/mm/init_64.c| 50 +++-
 include/linux/mm.h   | 14 +
 include/linux/page-flags.h   |  4 ++-
 4 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 46db094..4a191c4 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -394,6 +394,8 @@ static inline void mark_hpte_slot_valid(unsigned char 
*hpte_slot_array,
hpte_slot_array[index] = hidx << 4 | 0x1 << 3;
 }
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+
 static inline char *get_hpte_slot_array(pmd_t *pmdp)
 {
/*
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index d0cd9e4..8cf345a 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -300,5 +300,53 @@ void vmemmap_free(unsigned long start, unsigned long end)
 {
 }
 
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * realmode_pfn_to_page functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct vmemmap_backing *vmem_back;
+   struct page *page;
+   unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+   unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
 
+   for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+   if (pg_va < vmem_back->virt_addr)
+   continue;
+
+   /* Check that page struct is not split between real pages */
+   if ((pg_va + sizeof(struct page)) >
+   (vmem_back->virt_addr + page_size))
+   return NULL;
+
+   page = (struct page *) (vmem_back->phys + pg_va -
+   vmem_back->virt_addr);
+   return page;
+   }
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct page *page = 

[PATCH 06/10] KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently

2013-07-31 Thread Alexey Kardashevskiy
It does not make much sense to have KVM in book3s-64bit and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s-kvm.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c55c538..3b2b761 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -59,6 +59,7 @@ config KVM_BOOK3S_64
depends on PPC_BOOK3S_64
select KVM_BOOK3S_64_HANDLER
select KVM
+   select SPAPR_TCE_IOMMU
---help---
  Support running unmodified book3s_64 and book3s_32 guest kernels
  in virtual machines on book3s_64 host processors.
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/10] hashtable: add hash_for_each_possible_rcu_notrace()

2013-07-31 Thread Alexey Kardashevskiy
This adds hash_for_each_possible_rcu_notrace() which is basically
a notrace clone of hash_for_each_possible_rcu() which cannot be
used in real mode due to its tracing/debugging capability.

Signed-off-by: Alexey Kardashevskiy 
---
 include/linux/hashtable.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index a9df51f..af8b169 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node)
member)
 
 /**
+ * hash_for_each_possible_rcu_notrace - iterate over all possible objects 
hashing
+ * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable
+ * @name: hashtable to iterate
+ * @obj: the type * to use as a loop cursor for each entry
+ * @member: the name of the hlist_node within the struct
+ * @key: the key of the objects to iterate over
+ *
+ * This is the same as hash_for_each_possible_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \
+   hlist_for_each_entry_rcu_notrace(obj, [hash_min(key, 
HASH_BITS(name))],\
+   member)
+
+/**
  * hash_for_each_possible_safe - iterate over all possible objects hashing to 
the
  * same bucket safe against removals
  * @name: hashtable to iterate
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/10] KVM: PPC: Add support for IOMMU in-kernel handling

2013-07-31 Thread Alexey Kardashevskiy
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
them to user space which saves time on switching to user space and back.

Both real and virtual modes are supported. The kernel tries to
handle a TCE request in the real mode, if fails it passes the request
to the virtual mode to complete the operation. If it a virtual mode
handler fails, the request is passed to user space.

The first user of this is VFIO on POWER. The external user API in VFIO
is required for this patch. The patch adds a new KVM_CAP_SPAPR_TCE_IOMMU
ioctl to associate a virtual PCI bus number (LIOBN) with an VFIO IOMMU
group fd and enable in-kernel handling of map/unmap requests.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Paul Mackerras 
Signed-off-by: Alexey Kardashevskiy 

---

Changes:
2013/07/11:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64
* kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
for this here but the next patch for hugepages support will use it more.

2013/07/06:
* added realmode arch_spin_lock to protect TCE table from races
in real and virtual modes
* POWERPC IOMMU API is changed to support real mode
* iommu_take_ownership and iommu_release_ownership are protected by
iommu_table's locks
* VFIO external user API use rewritten
* multiple small fixes

2013/06/27:
* tce_list page is referenced now in order to protect it from accident
invalidation during H_PUT_TCE_INDIRECT execution
* added use of the external user VFIO API

2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job

Signed-off-by: Alexey Kardashevskiy 
---
 Documentation/virtual/kvm/api.txt   |  26 
 arch/powerpc/include/asm/kvm_host.h |   3 +
 arch/powerpc/include/asm/kvm_ppc.h  |   2 +
 arch/powerpc/include/uapi/asm/kvm.h |   7 +
 arch/powerpc/kvm/book3s_64_vio.c| 296 +++-
 arch/powerpc/kvm/book3s_64_vio_hv.c | 122 +++
 arch/powerpc/kvm/powerpc.c  |  12 ++
 7 files changed, 463 insertions(+), 5 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 1c8942a..6ae65bd 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2408,6 +2408,32 @@ an implementation for these despite the in kernel 
acceleration.
 This capability is always enabled.
 
 
+4.87 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_create_spapr_tce_iommu {
+   __u64 liobn;
+   __u32 fd;
+   __u32 flags;
+};
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+User space passes VFIO group fd. Using the external user VFIO API,
+KVM tries gets IOMMU id from passed fd. If succeeded, acceleration
+turns on. If failed, map/unmap requests are passed to user space.
+
+No flag is supported at the moment.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index b8fe3de..4eeaf7d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+   struct iommu_group *grp;/* used for IOMMU groups */
+   struct vfio_group *vfio_grp;/* used for IOMMU groups */
struct page *pages[0];
 };
 
@@ -612,6 +614,7 @@ struct kvm_vcpu_arch {
u64 busy_preempt;
 
unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT hcall 
*/
+   unsigned long tce_tmp_num;  /* Number of handled TCEs in the cache 
*/
enum {
TCERM_NONE,
TCERM_GETPAGE,
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 0ce4691..297cab5 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ 

[PATCH 00/10 v7] KVM: PPC: IOMMU in-kernel handling

2013-07-31 Thread Alexey Kardashevskiy
This accelerates VFIO DMA operations on POWER by moving them
into kernel.

The changes in this series are:
1. rebased on v3.11-rc3.
2. VFIO external user API will go through VFIO tree so it is
excluded from this series.
3. As nobody ever reacted on "hashtable: add 
hash_for_each_possible_rcu_notrace()",
Ben suggested to push it via his tree so I included it to the series.
4. realmode_(get|put)_page is reworked.

More details in the individual patch comments.

Alexey Kardashevskiy (10):
  hashtable: add hash_for_each_possible_rcu_notrace()
  KVM: PPC: reserve a capability number for multitce support
  KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  powerpc: add real mode support for dma operations on powernv
  KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc/iommu: rework to support realmode
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt |  52 +++
 arch/powerpc/include/asm/iommu.h  |   9 +-
 arch/powerpc/include/asm/kvm_host.h   |  37 +++
 arch/powerpc/include/asm/kvm_ppc.h|  18 +-
 arch/powerpc/include/asm/machdep.h|  12 +
 arch/powerpc/include/asm/pgtable-ppc64.h  |   2 +
 arch/powerpc/include/uapi/asm/kvm.h   |   7 +
 arch/powerpc/kernel/iommu.c   | 202 +++
 arch/powerpc/kvm/Kconfig  |   1 +
 arch/powerpc/kvm/book3s_64_vio.c  | 533 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c   | 405 +--
 arch/powerpc/kvm/book3s_hv.c  |  41 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |   8 +-
 arch/powerpc/kvm/book3s_pr_papr.c |  35 ++
 arch/powerpc/kvm/powerpc.c|  15 +
 arch/powerpc/mm/init_64.c |  50 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c |  47 ++-
 arch/powerpc/platforms/powernv/pci.c  |  38 ++-
 arch/powerpc/platforms/powernv/pci.h  |   3 +-
 include/linux/hashtable.h |  15 +
 include/linux/mm.h|  14 +
 include/linux/page-flags.h|   4 +-
 include/uapi/linux/kvm.h  |   5 +
 23 files changed, 1430 insertions(+), 123 deletions(-)

-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/9] dma: edma: Find missed events and issue them

2013-07-31 Thread Joel Fernandes
On 07/31/2013 09:27 PM, Joel Fernandes wrote:
> On 07/31/2013 04:18 AM, Sekhar Nori wrote:
>> On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote:
>>> Hi Sekhar,
>>>
>>> On 07/30/2013 02:05 AM, Sekhar Nori wrote:
 On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote:
> In an effort to move to using Scatter gather lists of any size with
> EDMA as discussed at [1] instead of placing limitations on the driver,
> we work through the limitations of the EDMAC hardware to find missed
> events and issue them.
>
> The sequence of events that require this are:
>
> For the scenario where MAX slots for an EDMA channel is 3:
>
> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null
>
> The above SG list will have to be DMA'd in 2 sets:
>
> (1) SG1 -> SG2 -> SG3 -> Null
> (2) SG4 -> SG5 -> SG6 -> Null
>
> After (1) is succesfully transferred, the events from the MMC controller
> donot stop coming and are missed by the time we have setup the transfer
> for (2). So here, we catch the events missed as an error condition and
> issue them manually.

 Are you sure there wont be any effect of these missed events on the
 peripheral side. For example, wont McASP get into an underrun condition
 when it encounters a null PaRAM set? Even UART has to transmit to a
>>>
>>> But it will not encounter null PaRAM set because McASP uses contiguous
>>> buffers for transfer which are not scattered across physical memory.
>>> This can be accomplished with an SG of size 1. For such SGs, this patch
>>> series leaves it linked Dummy and does not link to Null set. Null set is
>>> only used for SG lists that are > MAX_NR_SG in size such as those
>>> created for example by MMC and Crypto.
>>>
 particular baud so I guess it cannot wait like the way MMC/SD can.
>>>
>>> Existing driver have to wait anyway if they hit MAX SG limit today. If
>>> they don't want to wait, they would have allocated a contiguous block of
>>> memory and DMA that in one stretch so they don't lose any events, and in
>>> such cases we are not linking to Null.
>>
>> As long as DMA driver can advertize its MAX SG limit, peripherals can
>> always work around that by limiting the number of sync events they
>> generate so as to not having any of the events getting missed. With this
>> series, I am worried that EDMA drivers is advertizing that it can handle
>> any length SG list while not taking care of missing any events while
>> doing so. This will break the assumptions that driver writers make.
> 
> This is already being done by some other DMA engine drivers ;). We can
> advertise more than we can handle at a time, that's the basis of this
> whole idea.
> 
> I understand what you're saying but events are not something that have
> be serviced immediately, they can be queued etc and the actually
> transfer from the DMA controller can be delayed. As long as we don't
> miss the event we are fine which my series takes care off.
> 
> So far I have tested this series on following modules in various
> configurations and have seen no issues:
> - Crypto AES
> - MMC/SD
> - SPI (128x160 display)
> 
 Also, wont this lead to under-utilization of the peripheral bandwith?
 Meaning, MMC/SD is ready with data but cannot transfer because the DMA
 is waiting to be set-up.
>>>
>>> But it is waiting anyway even today. Currently based on MAX segs, MMC
>>> driver/subsystem will make SG list of size max_segs. Between these
>>> sessions of creating such smaller SG-lists, if for some reason the MMC
>>> controller is sending events, these will be lost anyway.
>>
>> But if MMC/SD driver knows how many events it should generate if it
>> knows the MAX SG limit. So there should not be any missed events in
>> current code. And I am not claiming that your solution is making matters
>> worse. But its not making it much better as well.
> 
> This is not true for crypto, the events are not deasserted and crypto
> continues to send events. This is what led to the "don't trigger in
> Null" patch where I'm setting the missed flag to avoid recursion.
> 
>>> This can be used only for buffers that are contiguous in memory, not
>>> those that are scattered across memory.
>>
>> I was hinting at using the linking facility of EDMA to achieve this.
>> Each PaRAM set has full 32-bit source and destination pointers so I see
>> no reason why non-contiguous case cannot be handled.
>>
>> Lets say you need to transfer SG[0..6] on channel C. Now, PaRAM sets are
>> typically 4 times the number of channels. In this case we use one DMA
>> PaRAM set and two Link PaRAM sets per channel. P0 is the DMA PaRAM set
>> and P1 and P2 are the Link sets.
>>
>> Initial setup:
>>
>> SG0 -> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL
>>  ^  ^  ^
>>  |  |  |
>> P0  -> P1  -> P2  -> NULL
>>
>> P[0..2].TCINTEN = 1, so get an interrupt after each SG element
>> completion. On each completion interrupt, hardware 

Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

2013-07-31 Thread Rik van Riel

On 07/31/2013 10:56 PM, Minchan Kim wrote:


Yes, it's not really slow path because it could return to normal status
without calling significant slow functions by reset batchcount of
prepare_slowpath.

I think it's tradeoff and I am biased your approach although we would
lose a little performance because fair aging would recover the loss by
fastpath's overhead. But who knows? Someone has a concern.

So we should mention about such problems.


If the atomic operation in the fast path turns out to be a problem,
I suspect we may be able to fix it by using per-cpu counters, and
consolidating those every once in a while.

However, it may be good to see whether there is a problem in the
first place, before adding complexity.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] fs: Introduce new flag FALLOC_FL_COLLAPSE_RANGE

2013-07-31 Thread Dave Chinner
On Thu, Aug 01, 2013 at 12:06:05AM -0400, Theodore Ts'o wrote:
> On Thu, Aug 01, 2013 at 12:59:14PM +1000, Dave Chinner wrote:
> > 
> > This funtionality is not introducing any new problems w.r.t. mmap().
> > In terms of truncating a mmap'd file, that can already occur and
> > the behaviour is well known.
> 
> That's not race I'm worried about.  Consider the following scenario.
> We have a 10 meg file, and the first five megs are mapped read/write.
> We do a collapse range of one meg starting at the one meg boundary.
> Menwhile another process is constantly modifying the pages between the
> 3 meg and 4 meg mark, via a mmap'ed region of memory.
> 
> You can have the code which does the collapse range call
> filemap_fdatawrite_range() to flush out to disk all of the inode's
> dirty pages in the page cache, but between the call to
> filemap_fdatawrite_range() and truncate_page_cache_range(), more pages
> could have gotten dirtied, and those changes will get lost.

That's one of the many race conditions that Jan's patches are
intended to solve. Locking the range and then unmapping the pages
will then cause the refault to block until the range is unlocked.

While the range is locked, we can then writeback all the dirty
pages, invalidate the page cache and modify the underlying extents
without having to care about user page faults racing with the
operation.

Let's face it, without that we current cannot serialise *any*
multi-page page cache operation against mmap() - direct IO has
exactly the same problems as you are describing, as does hole
punching, and anything else that relies on the i_mutex or other
IO-based locks for serialisation.

> The difference between punch_hole and collapse_range is that
> previously, we only needed to call truncate_page_cache_range() on the
> pages that were going to be disappearing from the page cache, so it
> didn't matter if you had someone dirtying the pages racing with the
> punch_hole operation.  But in the collapse_range case, we need to call
> truncate_page_cache_range() on data that is not going to disappear,
> but rather be _renumbered_.

Sure - as it said up front "the range of the pages being operated
on" need to be invalidating. And, if you look at the ext4 patch, is
actaully does exactly that

> I'll also note that depending on the use case, doing the renumbering
> of the pages by throwing all of the pages from the page cache and
> forcing them to be read back in from disk might not all be friendly to
> a performance sensitive application.

Application problem.

Besides, this is for accelerating video stream editting (i.e. NLE
workflows). Such applications tend to use direct IO for recording
and playback, and they most certainly won't have concurrent access
to the video stream either by mmap or direct/buferred IO while the
non linear editor is punching the advertisements out of the recorded
video stream..

> In the case where the page size == the fs block size, instead of
> throwing away all of the pages, we could also have VM code which
> remaps the pages in the page cache (including potentially editing the
> page tables for any mmap'ed pages).

That way lies insanity. Not to mention that it's completely
unnecessary for the use case this ioctl is actually accelerating.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] fs: Introduce new flag FALLOC_FL_COLLAPSE_RANGE

2013-07-31 Thread Theodore Ts'o
On Thu, Aug 01, 2013 at 12:59:14PM +1000, Dave Chinner wrote:
> 
> This funtionality is not introducing any new problems w.r.t. mmap().
> In terms of truncating a mmap'd file, that can already occur and
> the behaviour is well known.

That's not race I'm worried about.  Consider the following scenario.
We have a 10 meg file, and the first five megs are mapped read/write.
We do a collapse range of one meg starting at the one meg boundary.
Menwhile another process is constantly modifying the pages between the
3 meg and 4 meg mark, via a mmap'ed region of memory.

You can have the code which does the collapse range call
filemap_fdatawrite_range() to flush out to disk all of the inode's
dirty pages in the page cache, but between the call to
filemap_fdatawrite_range() and truncate_page_cache_range(), more pages
could have gotten dirtied, and those changes will get lost.

The difference between punch_hole and collapse_range is that
previously, we only needed to call truncate_page_cache_range() on the
pages that were going to be disappearing from the page cache, so it
didn't matter if you had someone dirtying the pages racing with the
punch_hole operation.  But in the collapse_range case, we need to call
truncate_page_cache_range() on data that is not going to disappear,
but rather be _renumbered_.

I'll also note that depending on the use case, doing the renumbering
of the pages by throwing all of the pages from the page cache and
forcing them to be read back in from disk might not all be friendly to
a performance sensitive application.

In the case where the page size == the fs block size, instead of
throwing away all of the pages, we could also have VM code which
remaps the pages in the page cache (including potentially editing the
page tables for any mmap'ed pages).  This of course gets _much_ more
complicated, and we still have to deal with the case where the
collapse_range call is aligned with the fs block size, but which is
not aligned or is not a muliple of the page size.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/3] ALSA: Added jack detection kcontrol support

2013-07-31 Thread Felipe Ferreri Tonello

On 07/29/2013 07:19 AM, Takashi Iwai wrote:

At Fri, 26 Jul 2013 15:45:11 -0700,
Felipe F. Tonello wrote:


From: "Felipe F. Tonello" 

This patch adds jack support for alsa kcontrol.

This support is necessary since the new kcontrol is used by user-space
daemons, such as PulseAudio(>=2.0), to do jack detection.)

Signed-off-by: Felipe F. Tonello 


This patch breaks the build.


I'm planing to send a v3 in few days.

Felipe Tonello

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] ALSA: oxygen: Updating jack implementation according new ALSA Jacks

2013-07-31 Thread Felipe Ferreri Tonello

Hi Takashi,

On 07/29/2013 05:10 AM, Takashi Iwai wrote:

At Fri, 26 Jul 2013 12:02:51 -0700,
Felipe Tonello wrote:


Hi Mark,

On Fri, Jul 26, 2013 at 11:56 AM, Mark Brown  wrote:

On Fri, Jul 26, 2013 at 11:25:33AM -0700, Felipe F. Tonello wrote:

From: "Felipe F. Tonello" 

ALSA standard jacks already are implemented using ALSA KControl.
So there is no need implement that itself or to use snd_jack
for input events only.


Similar changlog comment as on the ASoC patch and...


Ok.




   snd_jack_new(chip->card, "Headphone",
-  SND_JACK_HEADPHONE, >hp_jack);
+  SND_JACK_HEADPHONE, 0, >hp_jack);
   xonar_ds_handle_hp_jack(chip);


...this really ought to be done as part of the commit that adds the
parameter since it breaks the build until this patch is applied.


But that's why is a patch series. But as you say, are you suggesting
me to propose this changes in one patch only?


The basic rule of the patch series is:
- they are split in a logical manner
- each commit must not break the build

The second rule is important especially for bisection and
backporting.  You could have done in a different way (e.g. adding a
new function like snd_jack_new_with_index() instead of changing
snd_jack_new() itself).  Which way is better, depends on the
implementation details.


In this case I think this will not solve the issue. Because the my 
propose is to make the control creation standard for jacks.


Felipe Tonello
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [RFC][PATCH 3/4] tracing/kprobes: Fail to unregister if probe event files are open

2013-07-31 Thread Masami Hiramatsu
(2013/08/01 11:50), Steven Rostedt wrote:
> Here's the new change log, but the same patch. Does this sound ok to you
> guys?

Great! This looks good for me. ;)

Thank you very much!

> 
> -- Steve
> 
>>From 40c32592668b727cbfcf7b1c0567f581bd62a5e4 Mon Sep 17 00:00:00 2001
> From: "Steven Rostedt (Red Hat)" 
> Date: Wed, 3 Jul 2013 23:33:50 -0400
> Subject: [PATCH] tracing/kprobes: Fail to unregister if probe event files are
>  in use
> 
> When a probe is being removed, it cleans up the event files that correspond
> to the probe. But there is a race between writing to one of these files
> and deleting the probe. This is especially true for the "enable" file.
> 
>   CPU 0   CPU 1
>   -   -
> 
> fd = open("enable",O_WRONLY);
> 
>   probes_open()
>   release_all_trace_probes()
>   unregister_trace_probe()
>   if (trace_probe_is_enabled(tp))
>   return -EBUSY
> 
>  write(fd, "1", 1)
>  __ftrace_set_clr_event()
>  call->class->reg()
>   (kprobe_register)
>enable_trace_probe(tp)
> 
>   __unregister_trace_probe(tp);
>   list_del(>list)
>   unregister_probe_event(tp) <-- fails!
>   free_trace_probe(tp)
> 
>  write(fd, "0", 1)
>  __ftrace_set_clr_event()
>  call->class->unreg
>   (kprobe_register)
>   disable_trace_probe(tp) <-- BOOM!
> 
> A test program was written that used two threads to simulate the
> above scenario adding a nanosleep() interval to change the timings
> and after several thousand runs, it was able to trigger this bug
> and crash:
> 
> BUG: unable to handle kernel paging request at 000500f9
> IP: [] probes_open+0x3b/0xa7
> PGD 7808a067 PUD 0
> Oops:  [#1] PREEMPT SMP
> Dumping ftrace buffer:
> -
> Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6
> CPU: 1 PID: 2070 Comm: test-kprobe-rem Not tainted 3.11.0-rc3-test+ #47
> Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by 
> O.E.M., BIOS SDBLI944.86P 05/08/2007
> task: 880077756440 ti: 880076e52000 task.ti: 880076e52000
> RIP: 0010:[]  [] probes_open+0x3b/0xa7
> RSP: 0018:880076e53c38  EFLAGS: 00010203
> RAX: 00050001 RBX: 88007844f440 RCX: 0003
> RDX: 0003 RSI: 0003 RDI: 880076e52000
> RBP: 880076e53c58 R08: 880076e53bd8 R09: 
> R10: 880077756440 R11: 0006 R12: 810dee35
> R13: 880079250418 R14:  R15: 88007844f450
> FS:  7f87a276f700() GS:88007d48() knlGS:
> CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: 000500f9 CR3: 77262000 CR4: 07e0
> Stack:
>  880076e53c58 81219ea0 88007844f440 810dee35
>  880076e53ca8 81130f78 8800772986c0 8800796f93a0
>  81d1b5d8 880076e53e04  88007844f440
> Call Trace:
>  [] ? security_file_open+0x2c/0x30
>  [] ? unregister_trace_probe+0x4b/0x4b
>  [] do_dentry_open+0x162/0x226
>  [] finish_open+0x46/0x54
>  [] do_last+0x7f6/0x996
>  [] ? inode_permission+0x42/0x44
>  [] path_openat+0x232/0x496
>  [] do_filp_open+0x3a/0x8a
>  [] ? __alloc_fd+0x168/0x17a
>  [] do_sys_open+0x70/0x102
>  [] ? trace_hardirqs_on_caller+0x160/0x197
>  [] SyS_open+0x1e/0x20
>  [] system_call_fastpath+0x16/0x1b
> Code: e5 41 54 53 48 89 f3 48 83 ec 10 48 23 56 78 48 39 c2 75 6c 31 f6 48 c7
> RIP  [] probes_open+0x3b/0xa7
>  RSP 
> CR2: 000500f9
> ---[ end trace 35f17d68fc569897 ]---
> 
> The unregister_trace_probe() must be done first, and if it fails it must
> fail the removal of the kprobe.
> 
> Several changes have already been made by Oleg Nesterov and Masami Hiramatsu
> to allow moving the unregister_probe_event() before the removal of
> the probe and exit the function if it fails. This prevents the tp
> structure from being used after it is freed.
> 
> Link: http://lkml.kernel.org/r/20130704034038.819592...@goodmis.org
> 
> Acked-by: Masami Hiramatsu 
> Signed-off-by: Steven Rostedt 
> ---
>  kernel/trace/trace_kprobe.c | 21 +++--
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 3811487..243f683 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -95,7 +95,7 @@ static __kprobes bool trace_probe_is_on_module(struct 
> trace_probe *tp)
>  }
>  
>  static int register_probe_event(struct trace_probe *tp);
> -static void unregister_probe_event(struct trace_probe *tp);
> +static int unregister_probe_event(struct 

Re: [PATCH 1/4] ALSA: Added jack detection kcontrol support

2013-07-31 Thread Felipe Ferreri Tonello

Hi Takashi,

On 07/29/2013 05:05 AM, Takashi Iwai wrote:

At Sat, 27 Jul 2013 13:25:24 +0100,
Mark Brown wrote:


On Fri, Jul 26, 2013 at 04:13:40PM -0700, Felipe Tonello wrote:

On Fri, Jul 26, 2013 at 3:48 PM, Mark Brown  wrote:



What I'd expect to happen here is that for multi function jacks we
create a control per function if the controls are valid.



Ok, so the idea is just to change the control to type integer instead
of boolean, right?
Because as you say, the user will be able to check the type of jack
based on the status value, right?


It might be more idiomatic and more compatible with userspace to create
multiple controls for the jack, there was some discussion of this in the
past I think but I can't remember the result.


I also forget the end result :)

Another option would be to make an enum control, but multiple boolean
controls are easier in the end.  The headset controls of HD-audio are
represented in that way, too (because HD-audio specification can't
tell better).


Yes. Thanks for the comment.

Felipe Tonello
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] ALSA: Added jack detection kcontrol support

2013-07-31 Thread Felipe Ferreri Tonello

On 07/27/2013 05:25 AM, Mark Brown wrote:

On Fri, Jul 26, 2013 at 04:13:40PM -0700, Felipe Tonello wrote:

On Fri, Jul 26, 2013 at 3:48 PM, Mark Brown  wrote:



What I'd expect to happen here is that for multi function jacks we
create a control per function if the controls are valid.



Ok, so the idea is just to change the control to type integer instead
of boolean, right?
Because as you say, the user will be able to check the type of jack
based on the status value, right?


It might be more idiomatic and more compatible with userspace to create
multiple controls for the jack, there was some discussion of this in the
past I think but I can't remember the result.


Yes.  If there's only one function supported the current code is fine
but for multiple functions it's going to discard useful information.



So, what do you suggest to do that? I'm not sure if I understand what
you are saying.
When you mean function, do you mean the SND_JACK_BTN_n or the the jack
types, such as SND_JACK_HEADPHONE, and so on?


The jack types, the buttons definitely sohuld be going up as input
events.


If a codec creates a jack type SND_JACK_HEADSET (= SND_JACK_HEADPHONE
| SND_JACK_MICROPHONE). It should be created two controls, name +
"Headphone Jack" and name + "Microphone Jack"? If so, what about the
status to report? How to know which control to report?


The drivers report a bitmask for status.



I did that but I'm not happy with the control name. Usually drivers add 
jacks like: "Headset" for a headset, "Headphone" for a headphone and so on.
I did the following: control name is jack name + (jack type) + Jack. If 
jack type == jack name, don't add (jack type) to the name.


Any suggestion how it should be?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/9] dma: edma: Find missed events and issue them

2013-07-31 Thread Joel Fernandes
On 07/31/2013 09:27 PM, Joel Fernandes wrote:
> On 07/31/2013 04:18 AM, Sekhar Nori wrote:
>> On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote:
>>> Hi Sekhar,
>>>
>>> On 07/30/2013 02:05 AM, Sekhar Nori wrote:
 On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote:
> In an effort to move to using Scatter gather lists of any size with
> EDMA as discussed at [1] instead of placing limitations on the driver,
> we work through the limitations of the EDMAC hardware to find missed
> events and issue them.
>
> The sequence of events that require this are:
>
> For the scenario where MAX slots for an EDMA channel is 3:
>
> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null
>
> The above SG list will have to be DMA'd in 2 sets:
>
> (1) SG1 -> SG2 -> SG3 -> Null
> (2) SG4 -> SG5 -> SG6 -> Null
>
> After (1) is succesfully transferred, the events from the MMC controller
> donot stop coming and are missed by the time we have setup the transfer
> for (2). So here, we catch the events missed as an error condition and
> issue them manually.

 Are you sure there wont be any effect of these missed events on the
 peripheral side. For example, wont McASP get into an underrun condition
 when it encounters a null PaRAM set? Even UART has to transmit to a
>>>
>>> But it will not encounter null PaRAM set because McASP uses contiguous
>>> buffers for transfer which are not scattered across physical memory.
>>> This can be accomplished with an SG of size 1. For such SGs, this patch
>>> series leaves it linked Dummy and does not link to Null set. Null set is
>>> only used for SG lists that are > MAX_NR_SG in size such as those
>>> created for example by MMC and Crypto.
>>>
 particular baud so I guess it cannot wait like the way MMC/SD can.
>>>
>>> Existing driver have to wait anyway if they hit MAX SG limit today. If
>>> they don't want to wait, they would have allocated a contiguous block of
>>> memory and DMA that in one stretch so they don't lose any events, and in
>>> such cases we are not linking to Null.
>>
>> As long as DMA driver can advertize its MAX SG limit, peripherals can
>> always work around that by limiting the number of sync events they
>> generate so as to not having any of the events getting missed. With this
>> series, I am worried that EDMA drivers is advertizing that it can handle
>> any length SG list while not taking care of missing any events while
>> doing so. This will break the assumptions that driver writers make.

Sorry, just forgot to respond to "not taking care of missing any events
while doing so". Can you clarify this? DMA engine driver is taking care
of missed events.

Also- missing of events doesn't result in feedback to the peripheral.
Peripheral sends even to DMA controller, event is missed. Peripheral
doesn't know anything about what happened and is waiting for transfer
from the DMA controller.

Thanks,

-Joel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpufreq: Re-arrange header files

2013-07-31 Thread Viresh Kumar
On 1 August 2013 03:08, Rafael J. Wysocki  wrote:
> Normally, I'd regard that as pointless churn, but why don't you actually
> remove direct #includes already included through cpufreq.h when you're at it?
>
> For example, what's the point of including cpumask.h directly from
> cpufreq_governor.c if it includes cpufreq.h too?
>
> Also it doesn't look like all of the includes are really necessary.

Okay, all cleaned up now.. (attached for applying):

--x-x

From: Viresh Kumar 
Date: Wed, 31 Jul 2013 20:47:01 +0530
Subject: [PATCH V2] cpufreq: Cleanup header files included in core

This patch intends to cleanup following issues in the header files included in
cpufreq core layers:
- Include headers in ascending order, so that we don't add same multiple times
  by mistake.
-  must be included after , so that they override whatever they
  need.
- Remove unnecessary header files
- Don't include files already included by cpufreq.h or cpufreq_governor.h

Signed-off-by: Viresh Kumar 
---
Changes since V2
- Remove unnecessary header files.

 drivers/cpufreq/cpufreq.c  | 20 ++--
 drivers/cpufreq/cpufreq_conservative.c | 12 
 drivers/cpufreq/cpufreq_governor.c |  6 --
 drivers/cpufreq/cpufreq_governor.h |  5 ++---
 drivers/cpufreq/cpufreq_ondemand.c | 12 +---
 drivers/cpufreq/cpufreq_performance.c  |  3 +--
 drivers/cpufreq/cpufreq_powersave.c|  3 +--
 drivers/cpufreq/cpufreq_stats.c|  9 +
 drivers/cpufreq/freq_table.c   |  4 +---
 include/linux/cpufreq.h| 11 +++
 10 files changed, 16 insertions(+), 69 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 170d344..473f2ad 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -17,24 +17,16 @@

 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
+#include 
 #include 
-#include 
-#include 
-#include 
-#include 
 #include 
-#include 
-#include 
-#include 
+#include 
+#include 
+#include 
 #include 
+#include 
 #include 
-
+#include 
 #include 

 /**
diff --git a/drivers/cpufreq/cpufreq_conservative.c
b/drivers/cpufreq/cpufreq_conservative.c
index 0ceb2ef..841e256 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -11,19 +11,7 @@
  * published by the Free Software Foundation.
  */

-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
 #include 
-#include 
-#include 
-
 #include "cpufreq_governor.h"

 /* Conservative governor macros */
diff --git a/drivers/cpufreq/cpufreq_governor.c
b/drivers/cpufreq/cpufreq_governor.c
index 7409dbd..556064e 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -16,15 +16,9 @@

 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

-#include 
-#include 
-#include 
 #include 
 #include 
-#include 
 #include 
-#include 
-#include 

 #include "cpufreq_governor.h"

diff --git a/drivers/cpufreq/cpufreq_governor.h
b/drivers/cpufreq/cpufreq_governor.h
index 0e0dd4c..cf0b7a4 100644
--- a/drivers/cpufreq/cpufreq_governor.h
+++ b/drivers/cpufreq/cpufreq_governor.h
@@ -18,10 +18,9 @@
 #define _CPUFREQ_GOVERNOR_H

 #include 
-#include 
+#include 
+#include 
 #include 
-#include 
-#include 

 /*
  * The polling frequency depends on the capability of the processor. Default
diff --git a/drivers/cpufreq/cpufreq_ondemand.c
b/drivers/cpufreq/cpufreq_ondemand.c
index a3c5574..27c732e 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -12,20 +12,10 @@

 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
+#include 
 #include 
 #include 
-#include 
 #include 
-#include 
-#include 
-
 #include "cpufreq_governor.h"

 /* On-demand governor macros */
diff --git a/drivers/cpufreq/cpufreq_performance.c
b/drivers/cpufreq/cpufreq_performance.c
index 9fef7d6..cf117de 100644
--- a/drivers/cpufreq/cpufreq_performance.c
+++ b/drivers/cpufreq/cpufreq_performance.c
@@ -12,10 +12,9 @@

 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

-#include 
-#include 
 #include 
 #include 
+#include 

 static int cpufreq_governor_performance(struct cpufreq_policy *policy,
unsigned int event)
diff --git a/drivers/cpufreq/cpufreq_powersave.c
b/drivers/cpufreq/cpufreq_powersave.c
index 32109a1..e3b874c 100644
--- a/drivers/cpufreq/cpufreq_powersave.c
+++ b/drivers/cpufreq/cpufreq_powersave.c
@@ -12,10 +12,9 @@

 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

-#include 
-#include 
 #include 
 #include 
+#include 

 static int cpufreq_governor_powersave(struct cpufreq_policy *policy,
unsigned int event)
diff --git a/drivers/cpufreq/cpufreq_stats.c b/drivers/cpufreq/cpufreq_stats.c
index cb38413..a7143b0 100644
--- 

Re: [RFC][PATCH 4/4] tracing/uprobes: Fail to unregister if probe event files are open

2013-07-31 Thread Steven Rostedt
Srikar,

Can you give your Acked-by to this patch.

Thanks!

-- Steve


On Wed, 2013-07-03 at 23:33 -0400, Steven Rostedt wrote:
> plain text document attachment
> (0004-tracing-uprobes-Fail-to-unregister-if-probe-event-fi.patch)
> From: "Steven Rostedt (Red Hat)" 
> 
> When one of the event files is opened, we need to prevent them from
> being removed. Modules do with with the module owner set (automated
> from the VFS layer).  The ftrace buffer instances have a ref count added
> to the trace_array when the enabled file is opened (the others are not
> that big of a deal, as they only reference the event calls which
> still exist when an instance disappears). But kprobes and uprobes
> do not have any protection.
> 
> Have the unregister probe fail when the event files are open, in use
> are used by perf.
> 
> Signed-off-by: Steven Rostedt 
> ---
>  kernel/trace/trace_uprobe.c |   48 
> ---
>  1 file changed, 36 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index d5d0cd3..4916da5 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -70,7 +70,7 @@ struct trace_uprobe {
>   (sizeof(struct probe_arg) * (n)))
>  
>  static int register_uprobe_event(struct trace_uprobe *tu);
> -static void unregister_uprobe_event(struct trace_uprobe *tu);
> +static int unregister_uprobe_event(struct trace_uprobe *tu);
>  
>  static DEFINE_MUTEX(uprobe_lock);
>  static LIST_HEAD(uprobe_list);
> @@ -164,11 +164,17 @@ static struct trace_uprobe *find_probe_event(const char 
> *event, const char *grou
>  }
>  
>  /* Unregister a trace_uprobe and probe_event: call with locking uprobe_lock 
> */
> -static void unregister_trace_uprobe(struct trace_uprobe *tu)
> +static int unregister_trace_uprobe(struct trace_uprobe *tu)
>  {
> + int ret;
> +
> + ret = unregister_uprobe_event(tu);
> + if (ret)
> + return ret;
> +
>   list_del(>list);
> - unregister_uprobe_event(tu);
>   free_trace_uprobe(tu);
> + return 0;
>  }
>  
>  /* Register a trace_uprobe and probe_event */
> @@ -181,9 +187,12 @@ static int register_trace_uprobe(struct trace_uprobe *tu)
>  
>   /* register as an event */
>   old_tp = find_probe_event(tu->call.name, tu->call.class->system);
> - if (old_tp)
> + if (old_tp) {
>   /* delete old event */
> - unregister_trace_uprobe(old_tp);
> + ret = unregister_trace_uprobe(old_tp);
> + if (ret)
> + goto end;
> + }
>  
>   ret = register_uprobe_event(tu);
>   if (ret) {
> @@ -256,6 +265,8 @@ static int create_trace_uprobe(int argc, char **argv)
>   group = UPROBE_EVENT_SYSTEM;
>  
>   if (is_delete) {
> + int ret;
> +
>   if (!event) {
>   pr_info("Delete command needs an event name.\n");
>   return -EINVAL;
> @@ -269,9 +280,9 @@ static int create_trace_uprobe(int argc, char **argv)
>   return -ENOENT;
>   }
>   /* delete an event */
> - unregister_trace_uprobe(tu);
> + ret = unregister_trace_uprobe(tu);
>   mutex_unlock(_lock);
> - return 0;
> + return ret;
>   }
>  
>   if (argc < 2) {
> @@ -408,16 +419,20 @@ fail_address_parse:
>   return ret;
>  }
>  
> -static void cleanup_all_probes(void)
> +static int cleanup_all_probes(void)
>  {
>   struct trace_uprobe *tu;
> + int ret = 0;
>  
>   mutex_lock(_lock);
>   while (!list_empty(_list)) {
>   tu = list_entry(uprobe_list.next, struct trace_uprobe, list);
> - unregister_trace_uprobe(tu);
> + ret = unregister_trace_uprobe(tu);
> + if (ret)
> + break;
>   }
>   mutex_unlock(_lock);
> + return ret;
>  }
>  
>  /* Probes listing interfaces */
> @@ -462,8 +477,12 @@ static const struct seq_operations probes_seq_op = {
>  
>  static int probes_open(struct inode *inode, struct file *file)
>  {
> + int ret = 0;
> +
>   if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC))
> - cleanup_all_probes();
> + ret = cleanup_all_probes();
> + if (ret)
> + return ret;
>  
>   return seq_open(file, _seq_op);
>  }
> @@ -970,12 +989,17 @@ static int register_uprobe_event(struct trace_uprobe 
> *tu)
>   return ret;
>  }
>  
> -static void unregister_uprobe_event(struct trace_uprobe *tu)
> +static int unregister_uprobe_event(struct trace_uprobe *tu)
>  {
> + int ret;
> +
>   /* tu->event is unregistered in trace_remove_event_call() */
> - trace_remove_event_call(>call);
> + ret = trace_remove_event_call(>call);
> + if (ret)
> + return ret;
>   kfree(tu->call.print_fmt);
>   tu->call.print_fmt = NULL;
> + return 0;
>  }
>  
>  /* Make a trace 

Re: [PATCH] Xen: Fix retry calls into PRIVCMD_MMAPBATCH*.

2013-07-31 Thread Andres Lagar-Cavilla
-- Resend as I haven't seen this hit the lists. Maybe some smtp misconfig. 
Apologies. Also expanded cc --

When a foreign mapper attempts to map guest frames that are paged out,
the mapper receives an ENOENT response and will have to try again
while a helper process pages the target frame back in.

Gating checks on PRIVCMD_MMAPBATCH* ioctl args were preventing retries
of mapping calls.

Signed-off-by: Andres Lagar-Cavilla 
---
drivers/xen/privcmd.c |   32 +++-
1 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index f8e5dd7..44a26c6 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -43,9 +43,12 @@ MODULE_LICENSE("GPL");

#define PRIV_VMA_LOCKED ((void *)1)

-#ifndef HAVE_ARCH_PRIVCMD_MMAP
static int privcmd_enforce_singleshot_mapping(struct vm_area_struct *vma);
-#endif
+
+static int privcmd_enforce_singleshot_mapping_granular(
+   struct vm_area_struct *vma,
+   unsigned long addr,
+   unsigned long nr_pages);

static long privcmd_ioctl_hypercall(void __user *udata)
{
@@ -422,9 +425,9 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
int version)
vma = find_vma(mm, m.addr);
if (!vma ||
vma->vm_ops != _vm_ops ||
-   (m.addr != vma->vm_start) ||
-   ((m.addr + (nr_pages << PAGE_SHIFT)) != vma->vm_end) ||
-   !privcmd_enforce_singleshot_mapping(vma)) {
+   (m.addr < vma->vm_start) ||
+   ((m.addr + (nr_pages << PAGE_SHIFT)) > vma->vm_end) ||
+   !privcmd_enforce_singleshot_mapping_granular(vma, m.addr, 
nr_pages)) {
up_write(>mmap_sem);
ret = -EINVAL;
goto out;
@@ -540,11 +543,30 @@ static int privcmd_mmap(struct file *file, struct 
vm_area_struct *vma)
return 0;
}

+/* For MMAP */
static int privcmd_enforce_singleshot_mapping(struct vm_area_struct *vma)
{
return !cmpxchg(>vm_private_data, NULL, PRIV_VMA_LOCKED);
}

+/* For MMAPBATCH*. This allows asserting the singleshot mapping
+ * on a per pfn/pte basis. Mapping calls that fail with ENOENT
+ * can be then retried until success. */
+static int enforce_singleshot_mapping_fn(pte_t *pte, struct page *pmd_page,
+   unsigned long addr, void *data)
+{
+   return pte_none(*pte) ? 0 : -EBUSY;
+}
+
+static int privcmd_enforce_singleshot_mapping_granular(
+  struct vm_area_struct *vma,
+  unsigned long addr,
+  unsigned long nr_pages)
+{
+   return apply_to_page_range(vma->vm_mm, addr, nr_pages << PAGE_SHIFT,
+ enforce_singleshot_mapping_fn, NULL) == 0;
+}
+
const struct file_operations xen_privcmd_fops = {
.owner = THIS_MODULE,
.unlocked_ioctl = privcmd_ioctl,
-- 
1.7.1
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] f2fs: add sysfs entries to select the gc policy

2013-07-31 Thread Gu Zheng
Hi Jeon,

On 07/31/2013 10:33 PM, Namjae Jeon wrote:

> From: Namjae Jeon 
> 
> Add sysfs entries namely gc_long_idle and gc_short_idle to control the
> gc policy. Where long idle corresponds to selecting a cost benefit approach,
> while short idle corresponds to selecting a greedy approach to garbage
> collection. The selection is mutually exclusive one approach will work at
> any point.
> 
> Signed-off-by: Namjae Jeon 
> Signed-off-by: Pankaj Kumar 
> ---
>  Documentation/ABI/testing/sysfs-fs-f2fs |   12 +++
>  Documentation/filesystems/f2fs.txt  |8 +
>  fs/f2fs/gc.c|   22 ++--
>  fs/f2fs/gc.h|4 +++
>  fs/f2fs/super.c |   59 
> +--
>  5 files changed, 99 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs 
> b/Documentation/ABI/testing/sysfs-fs-f2fs
> index 5f44095..96b62ea 100644
> --- a/Documentation/ABI/testing/sysfs-fs-f2fs
> +++ b/Documentation/ABI/testing/sysfs-fs-f2fs
> @@ -19,4 +19,16 @@ Description:
>Controls the default sleep time for gc_thread. Time
>is in milliseconds.
>  
> +What:/sys/fs/f2fs//gc_long_idle
> +Date:July 2013
> +Contact: "Namjae Jeon" 
> +Description:
> +  Controls the selection of gc policy. long_idle is used
> +  to select the cost benefit approach for garbage collection.
>  
> +What:/sys/fs/f2fs//gc_short_idle
> +Date:July 2013
> +Contact: "Namjae Jeon" 
> +Description:
> +  Controls the selection of gc policy. short_idle is used
> +  to select the greedy approach for garbage collection.
> diff --git a/Documentation/filesystems/f2fs.txt 
> b/Documentation/filesystems/f2fs.txt
> index 2e9e873..06dd5d7 100644
> --- a/Documentation/filesystems/f2fs.txt
> +++ b/Documentation/filesystems/f2fs.txt
> @@ -158,6 +158,14 @@ Files in /sys/fs/f2fs/
>time for the garbage collection thread. Time is
>in milliseconds.
>  
> + gc_long_idle This parameter controls the selection of cost
> +  benefit approach for garbage collectoin. 
> Writing
> +  1 to this file will select the cost benefit 
> policy.
> +
> + gc_short_idleThis parameter controls the selection of greedy
> +  approach for the garbage collection. Writing 1
> +  to this file will select the greedy policy.

Why introduce two opposite attributes? It'll cause some confusion condition if 
we
double enable/disable them.

> +
>  
> 
>  USAGE
>  
> 
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 60d4f67..af2d9d7 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -106,6 +106,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>   gc_th->max_sleep_time = DEF_GC_THREAD_MAX_SLEEP_TIME;
>   gc_th->no_gc_sleep_time = DEF_GC_THREAD_NOGC_SLEEP_TIME;
>  
> + gc_th->long_idle = gc_th->short_idle = 0;
> +
>   sbi->gc_thread = gc_th;
>   init_waitqueue_head(>gc_thread->gc_wait_queue_head);
>   sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
> @@ -130,9 +132,23 @@ void stop_gc_thread(struct f2fs_sb_info *sbi)
>   sbi->gc_thread = NULL;
>  }
>  
> -static int select_gc_type(int gc_type)
> +static int select_gc_type(struct f2fs_gc_kthread *gc_th, int gc_type)
>  {
> - return (gc_type == BG_GC) ? GC_CB : GC_GREEDY;
> + int gc_mode;
> +
> + if (gc_th) {
> + if (gc_th->long_idle) {
> + gc_mode = GC_CB;
> + goto out;
> + } else if (gc_th->short_idle) {
> + gc_mode = GC_GREEDY;
> + goto out;
> + }
> + }
> +
> + gc_mode = (gc_type == BG_GC) ? GC_CB : GC_GREEDY;
> +out:
> + return gc_mode;
>  }
>  
>  static void select_policy(struct f2fs_sb_info *sbi, int gc_type,
> @@ -145,7 +161,7 @@ static void select_policy(struct f2fs_sb_info *sbi, int 
> gc_type,
>   p->dirty_segmap = dirty_i->dirty_segmap[type];
>   p->ofs_unit = 1;
>   } else {
> - p->gc_mode = select_gc_type(gc_type);
> + p->gc_mode = select_gc_type(sbi->gc_thread, gc_type);
>   p->dirty_segmap = dirty_i->dirty_segmap[DIRTY];
>   p->ofs_unit = sbi->segs_per_sec;
>   }
> diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h
> index f4bf44c..b2faae5 100644
> --- a/fs/f2fs/gc.h
> +++ b/fs/f2fs/gc.h
> @@ -30,6 +30,10 @@ struct f2fs_gc_kthread {
>   unsigned int min_sleep_time;
>   unsigned int max_sleep_time;
>   unsigned int no_gc_sleep_time;
> +
> + /* 

Re: [PATCH resend] drop_caches: add some documentation and info message

2013-07-31 Thread Andrew Morton
On Wed, 31 Jul 2013 23:11:50 -0400 KOSAKI Motohiro 
 wrote:

> >> --- a/fs/drop_caches.c
> >> +++ b/fs/drop_caches.c
> >> @@ -59,6 +59,8 @@ int drop_caches_sysctl_handler(ctl_table *table, int 
> >> write,
> >>if (ret)
> >>return ret;
> >>if (write) {
> >> +  printk(KERN_INFO "%s (%d): dropped kernel caches: %d\n",
> >> + current->comm, task_pid_nr(current), sysctl_drop_caches);
> >>if (sysctl_drop_caches & 1)
> >>iterate_supers(drop_pagecache_sb, NULL);
> >>if (sysctl_drop_caches & 2)
> > 
> > How about we do
> > 
> > if (!(sysctl_drop_caches & 4))
> > printk()
> > 
> > so people can turn it off if it's causing problems?
> 
> The best interface depends on the purpose. If you want to detect crazy 
> application,
> we can't assume an application co-operate us. So, I doubt this works.

You missed the "!".  I'm proposing that setting the new bit 2 will
permit people to prevent the new printk if it is causing them problems.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] f2fs: add sysfs support for controlling the gc_thread

2013-07-31 Thread Gu Zheng
Hi Jeon,

On 07/31/2013 10:33 PM, Namjae Jeon wrote:

> From: Namjae Jeon 
> 
> Add sysfs entries to control the timing parameters for
> f2fs gc thread.
> 
> Various Sysfs options introduced are:
> gc_min_sleep_time: Min Sleep time for GC in ms
> gc_max_sleep_time: Max Sleep time for GC in ms
> gc_no_gc_sleep_time: Default Sleep time for GC in ms
> 
> Signed-off-by: Namjae Jeon 
> Signed-off-by: Pankaj Kumar 
> ---
>  Documentation/ABI/testing/sysfs-fs-f2fs |   22 ++
>  Documentation/filesystems/f2fs.txt  |   26 +++
>  fs/f2fs/f2fs.h  |4 +
>  fs/f2fs/gc.c|   17 +++--
>  fs/f2fs/gc.h|   33 
>  fs/f2fs/super.c |  124 
> +++
>  6 files changed, 206 insertions(+), 20 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-fs-f2fs
> 
> diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs 
> b/Documentation/ABI/testing/sysfs-fs-f2fs
> new file mode 100644
> index 000..5f44095
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-fs-f2fs
> @@ -0,0 +1,22 @@
> +What:/sys/fs/f2fs//gc_max_sleep_time
> +Date:July 2013
> +Contact: "Namjae Jeon" 
> +Description:
> +  Controls the maximun sleep time for gc_thread. Time
> +  is in milliseconds.
> +
> +What:/sys/fs/f2fs//gc_min_sleep_time
> +Date:July 2013
> +Contact: "Namjae Jeon" 
> +Description:
> +  Controls the minimum sleep time for gc_thread. Time
> +  is in milliseconds.
> +
> +What:/sys/fs/f2fs//gc_no_gc_sleep_time
> +Date:July 2013
> +Contact: "Namjae Jeon" 
> +Description:
> +  Controls the default sleep time for gc_thread. Time
> +  is in milliseconds.
> +
> +
> diff --git a/Documentation/filesystems/f2fs.txt 
> b/Documentation/filesystems/f2fs.txt
> index 0500c19..2e9e873 100644
> --- a/Documentation/filesystems/f2fs.txt
> +++ b/Documentation/filesystems/f2fs.txt
> @@ -133,6 +133,32 @@ f2fs. Each file shows the whole f2fs information.
>   - current memory footprint consumed by f2fs.
>  
>  
> 
> +SYSFS ENTRIES
> +
> +
> +Information about mounted f2fs file systems can be found in
> +/sys/fs/f2fs.  Each mounted filesystem will have a directory in
> +/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda).
> +The files in each per-device directory are shown in table below.
> +
> +Files in /sys/fs/f2fs/
> +(see also Documentation/ABI/testing/sysfs-fs-f2fs)
> +..
> + File Content
> +
> + gc_max_sleep_timeThis tuning parameter controls the maximum 
> sleep
> +  time for the garbage collection thread. Time is
> +  in milliseconds.
> +
> + gc_min_sleep_timeThis tuning parameter controls the minimum 
> sleep
> +  time for the garbage collection thread. Time is
> +  in milliseconds.
> +
> + gc_no_gc_sleep_time  This tuning parameter controls the default 
> sleep
> +  time for the garbage collection thread. Time is
> +  in milliseconds.
> +
> +
>  USAGE
>  
> 
>  
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 78777cd..63813be 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -430,6 +430,10 @@ struct f2fs_sb_info {
>  #endif
>   unsigned int last_victim[2];/* last victim segment # */
>   spinlock_t stat_lock;   /* lock for stat operations */
> +
> + /* For sysfs suppport */
> + struct kobject s_kobj;
> + struct completion s_kobj_unregister;

What is this completion used for? Or it's an ahead design? I do not find 
synchronization
routines use it. Am I missing something?


>  };
>  
>  /*
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 35f9b1a..60d4f67 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -29,10 +29,11 @@ static struct kmem_cache *winode_slab;
>  static int gc_thread_func(void *data)
>  {
>   struct f2fs_sb_info *sbi = data;
> + struct f2fs_gc_kthread *gc_th = sbi->gc_thread;
>   wait_queue_head_t *wq = >gc_thread->gc_wait_queue_head;
>   long wait_ms;
>  
> - wait_ms = GC_THREAD_MIN_SLEEP_TIME;
> + wait_ms = gc_th->min_sleep_time;
>  
>   do {
>   if (try_to_freeze())
> @@ -45,7 +46,7 @@ static int gc_thread_func(void *data)
>   break;
>  
>   if 

Re: [PATCH resend] drop_caches: add some documentation and info message

2013-07-31 Thread KOSAKI Motohiro
>> --- a/fs/drop_caches.c
>> +++ b/fs/drop_caches.c
>> @@ -59,6 +59,8 @@ int drop_caches_sysctl_handler(ctl_table *table, int write,
>>  if (ret)
>>  return ret;
>>  if (write) {
>> +printk(KERN_INFO "%s (%d): dropped kernel caches: %d\n",
>> +   current->comm, task_pid_nr(current), sysctl_drop_caches);
>>  if (sysctl_drop_caches & 1)
>>  iterate_supers(drop_pagecache_sb, NULL);
>>  if (sysctl_drop_caches & 2)
> 
> How about we do
> 
>   if (!(sysctl_drop_caches & 4))
>   printk()
> 
> so people can turn it off if it's causing problems?

The best interface depends on the purpose. If you want to detect crazy 
application,
we can't assume an application co-operate us. So, I doubt this works.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/9] syslog_ns: make syslog handling per namespace

2013-07-31 Thread Rui Xiang
On 2013/8/1 9:36, Gao feng wrote:
> On 07/29/2013 10:31 AM, Rui Xiang wrote:
>> This patch makes syslog buf and other fields per
>> namespace.
>>
>> Here use ns->log_buf(log_buf_len, logbuf_lock,
>> log_first_seq, logbuf_lock, and so on) fields
>> instead of global ones to handle syslog.
>>
>> Syslog interfaces such as /dev/kmsg, /proc/kmsg,
>> and syslog syscall are all containerized for
>> container users.
>>
> 
> /dev/kmsg is used by the syslog api closelog, openlog, syslog, vsyslog,
> this should be per user namespace, but seems in your patch,

Yes, /dev/kmsg is per user namespace, and per syslog ns, too.

> the syslog message generated through these APIs on host can be exported
> to the /dev/kmsg of container, is this want we want?
>
Ah.. I think your question targets at devkmsg_writev function, right? 
You remind me that it's really an issue. Printk_emit in devkmsg_writev 
should not use init_syslog_ns as its syslog_ns but current_user_ns->syslog_ns.

In 1st version, current_syslog_ns was used in vprintk_emit. In this version,
the interface vprintk_emit has changed, but this patch misses that.
I will fix it.


Thanks for your reminder. :)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] asymmetric keys: explicitly add the leading zero byte to encoded message

2013-07-31 Thread Lee, Chun-Yi
From: Chun-Yi Lee 

Per PKCS1 spec, the EMSA-PKCS1-v1_5 encoded message is leading by 0x00 0x01 in
its first 2 bytes. The leading zero byte is suppressed by MPI so we pass a
pointer to the _preceding_ byte to RSA_verify() in original code, but it has
risk for the byte is not zero because it's not in EM buffer's scope, neither
RSA_verify() nor mpi_get_buffer() didn't take care the leading byte.

To avoid the risk, that's better we explicitly add the leading zero byte to EM
for pass to RSA_verify(). This patch allocate a _EM buffer to capture the
result from RSA_I2OSP(), then set the first byte to zero in EM and copy the
remaining bytes from _EM.

Cc: Rusty Russell 
Cc: Josh Boyer 
Cc: Randy Dunlap 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: David Howells 
Signed-off-by: Chun-Yi Lee 
---
 crypto/asymmetric_keys/rsa.c |   14 ++
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/crypto/asymmetric_keys/rsa.c b/crypto/asymmetric_keys/rsa.c
index ca1a4f3..7bc99d2 100644
--- a/crypto/asymmetric_keys/rsa.c
+++ b/crypto/asymmetric_keys/rsa.c
@@ -303,6 +303,7 @@ static int RSA_verify_signature(const struct public_key 
*key,
/* Variables as per RFC3447 sec 8.2.2 */
const u8 *H = sig->digest;
u8 *EM = NULL;
+   u8 *_EM = NULL;
MPI m = NULL;
size_t k;
 
@@ -337,14 +338,19 @@ static int RSA_verify_signature(const struct public_key 
*key,
/* (2c) Convert the message representative (m) to an encoded message
 *  (EM) of length k octets.
 *
-*  NOTE!  The leading zero byte is suppressed by MPI, so we pass a
-*  pointer to the _preceding_ byte to RSA_verify()!
+*  NOTE!  The leading zero byte is suppressed by MPI, so we add it
+*  back to EM before input to RSA_verify()!
 */
-   ret = RSA_I2OSP(m, k, );
+   ret = RSA_I2OSP(m, k, &_EM);
if (ret < 0)
goto error;
 
-   ret = RSA_verify(H, EM - 1, k, sig->digest_size,
+   EM = kmalloc(k, GFP_KERNEL);
+   memset(EM, 0, 1);
+   memcpy(EM + 1, _EM, k-1);
+   kfree(_EM);
+
+   ret = RSA_verify(H, EM, k, sig->digest_size,
 RSA_ASN1_templates[sig->pkey_hash_algo].data,
 RSA_ASN1_templates[sig->pkey_hash_algo].size);
 
-- 
1.6.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ksummit-2013-discuss] [ARM ATTEND] catching up on exploit mitigations

2013-07-31 Thread Dave Jones
On Wed, Jul 31, 2013 at 07:47:30PM -0700, Olof Johansson wrote:

 > >  > Hmm, really?  Did you reported these bugs?  I'm not aware of mainline
 > >  > having any changes related to bug reports on PTEs on ARM.
 > >
 > > I wasn't sure if it was a googleism, or happens on mainline, so no.
 > 
 > As of 3.10, it's not actually hard to run a mainline kernel on the
 > chromebook, but you have limited functionality (no wifi, no USB3, no
 > accellerated graphics). Easiest is to do it by booting from SD card.
 > See instructions at
 > http://www.chromium.org/chromium-os/u-boot-porting-guide/using-nv-u-boot-on-the-samsung-arm-chromebook.
 
Yeah, that's how I got Fedora running on it (I think the people who did the
Fedora spin for the chromebook chose the google kernel for exactly
the reasons you mention above).  It felt a little sluggish for a native
kernel build though, and I'm not setup for cross-building so I stuck with
the 3.4 build.
 
 > How long would a useful run of trinity take?

It blew up in a minute or so of runtime when I tried last week.

 > I've got an autobuilder/autobooter setup here with a cross-section of 
 > current ARM
 > hardware (7 platforms and counting) that I mostly do boot testing on,
 > and I should add a smallish suite of testcases to the same. Trinity
 > would be a good candidate for that.

Feel free to mail me off-list if I can do anything to help out, though
hopefully things should build & run 'out of the box'.  You shouldn't
need any special command line args (though you might get inspiration
for ideas from the scripts/test-* harnesses)

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] fs: Introduce new flag FALLOC_FL_COLLAPSE_RANGE

2013-07-31 Thread Dave Chinner
On Wed, Jul 31, 2013 at 09:07:52PM -0400, Theodore Ts'o wrote:
> On Thu, Aug 01, 2013 at 10:54:47AM +1000, Dave Chinner wrote:
> > > It's not just the range that it's operating on, but also the region
> > > beyond the range that's been collapsed out.
> > 
> > Yes, that's part of "the range that it is operating over".
> > 
> > > A quick eyeball of the patch didn't seem to show any code that handled
> > > this, which is why I asked the question.
> > 
> > Right, but really it's the least of the problems I've noticed - the
> > XFS code is fundamentally broken in many ways - once I've finished
> > commenting on it, I'll have a quick look to see if the ext4 code has
> > the same fundamental flaws
> 
> Well, the fundamental flaw of potential races if the file being
> collapsed has been mmap'ed and there is another process making changes
> beyond the range that is being collapsed is I suspect one that is
> going to be very hard to solve, short of not allowing the collapse
> while there are any read/write mmap's for the file in question.

This funtionality is not introducing any new problems w.r.t. mmap().
In terms of truncating a mmap'd file, that can already occur and
the behaviour is well known. The current patches don't do the
operations in the correct order to ensure this is handled correctly
(i.e. the inode size has to be changed before the page cache
invalidation, not after), but otherwise there is no new issues
introduced here.

The real problem is that it has the same problem as hole punching
w.r.t. not being able to serialise page faults against the extent
manipulations after the page cache invalidation has occurred.
That's what Jan Kara's range locking patches we talked about at
LSFMM will address, and so is beyond the scope of this patchset.

> And I would think these sorts of VM issues should be handled with some
> generic library functions, instead of reimplementing them from scratch
> in each file system.

Well, yes, we have one - truncate_page_cache_range(), and it's used
correctly in the ext4 patch

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/9] perf hists: Free srcline when freeing hist_entry

2013-07-31 Thread Namhyung Kim
From: Namhyung Kim 

We've been leaked srcline of hist_entry, it should be freed also.

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/hist.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 46a0d35a05e1..e6c11ccf138c 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -530,6 +530,7 @@ void hist_entry__free(struct hist_entry *he)
 {
free(he->branch_info);
free(he->mem_info);
+   free(he->srcline);
free(he);
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/9] perf sort: Fix a memory leak on srcline

2013-07-31 Thread Namhyung Kim
From: Namhyung Kim 

In the hist_entry__srcline_snprintf(), path and self->srcline are
pointing the same memory region, but they are doubly allocated.

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/sort.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 5f118a089519..5ba29fc679c5 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -269,10 +269,7 @@ static int hist_entry__srcline_snprintf(struct hist_entry 
*self, char *bf,
if (!fp)
goto out_ip;
 
-   if (getline(, _len, fp) < 0 || !line_len)
-   goto out_ip;
-   self->srcline = strdup(path);
-   if (self->srcline == NULL)
+   if (getline(>srcline, _len, fp) < 0 || !line_len)
goto out_ip;
 
nl = strchr(self->srcline, '\n');
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/9] perf tools: Enhance and correct srcline behavior

2013-07-31 Thread Namhyung Kim
Hello,

This patchset tries to fix and enhance current srcline behavior.

Firstly it doesn't actually sort by srcline info but by ip.  I suspect
it was because of a performance reason to run external addr2line
utility.  It showed the srcline info after hist entries were
collapsed.  Thanks to Roberto, we now have internal implementation of
addr2line using libbfd so can sort/compare by srcline of entries.

Second problem for me was it sometimes printed "??:0" and sometimes
printed a raw ip value for unknown srcline info.  I changed to print
the former consistently.

While at it, I found some bugs/leaks in srcline handling.  Patch 1-3
are fixes for those and can be merged separately.

You can get this series on my 'perf/srcline-v1' branch in my tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Any comments are welcome, thanks.
Namhyung


Namhyung Kim (8):
  perf sort: Fix a memory leak on srcline
  perf annotate: Reuse path from the result of addr2line
  perf hists: Free srcline when freeing hist_entry
  perf tools: Factor out get/free_srcline()
  perf tools: Do not try to call addr2line for non-binary files
  perf tools: Pass dso instead of dso_name to get_srcline()
  perf tools: Save failed result of get_srcline()
  perf tools: Fix srcline sort key behavior

Roberto Vitillo (1):
  perf tools: Implement addr2line directly using libbfd

 tools/perf/Makefile|   1 +
 tools/perf/config/Makefile |   4 +
 tools/perf/util/annotate.c |  33 +-
 tools/perf/util/dso.c  |   1 +
 tools/perf/util/dso.h  |   2 +
 tools/perf/util/hist.c |   1 +
 tools/perf/util/sort.c |  58 --
 tools/perf/util/srcline.c  | 265 +
 tools/perf/util/util.h |   8 ++
 9 files changed, 308 insertions(+), 65 deletions(-)
 create mode 100644 tools/perf/util/srcline.c

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/9] perf tools: Save failed result of get_srcline()

2013-07-31 Thread Namhyung Kim
From: Namhyung Kim 

Some dso's lack srcline info, so there's no point to keep trying on
them.  Just save failture status and skip them.

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/dso.c |  1 +
 tools/perf/util/dso.h |  1 +
 tools/perf/util/srcline.c | 10 --
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index c4374f07603c..cc959d2fa5da 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -421,6 +421,7 @@ struct dso *dso__new(const char *name)
dso->loaded = 0;
dso->sorted_by_name = 0;
dso->has_build_id = 0;
+   dso->has_srcline = 1;
dso->kernel = DSO_TYPE_USER;
dso->needs_swap = DSO_SWAP__UNSET;
INIT_LIST_HEAD(>node);
diff --git a/tools/perf/util/dso.h b/tools/perf/util/dso.h
index 94c2f1a8b116..2a1ae2a8d39c 100644
--- a/tools/perf/util/dso.h
+++ b/tools/perf/util/dso.h
@@ -79,6 +79,7 @@ struct dso {
enum dso_binary_typedata_type;
u8   adjust_symbols:1;
u8   has_build_id:1;
+   u8   has_srcline:1;
u8   hit:1;
u8   annotate_warned:1;
u8   sname_alloc:1;
diff --git a/tools/perf/util/srcline.c b/tools/perf/util/srcline.c
index c736d9428cf2..dcff10bed7da 100644
--- a/tools/perf/util/srcline.c
+++ b/tools/perf/util/srcline.c
@@ -58,10 +58,13 @@ char *get_srcline(struct dso *dso, unsigned long addr)
 {
char *file;
unsigned line;
-   char *srcline = SRCLINE_UNKNOWN;
+   char *srcline;
char *dso_name = dso->long_name;
size_t size;
 
+   if (!dso->has_srcline)
+   return SRCLINE_UNKNOWN;
+
if (dso_name[0] == '[')
goto out;
 
@@ -81,8 +84,11 @@ char *get_srcline(struct dso *dso, unsigned long addr)
srcline = SRCLINE_UNKNOWN;
 
free(file);
-out:
return srcline;
+
+out:
+   dso->has_srcline = 0;
+   return SRCLINE_UNKNOWN;
 }
 
 void free_srcline(char *srcline)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 8/9] perf tools: Implement addr2line directly using libbfd

2013-07-31 Thread Namhyung Kim
From: Roberto Vitillo 

When the srcline sort key is used , the external addr2line utility
needs to be run for each hist entry to get the srcline info.  This can
consume quite a time if one has a huge perf.data file.

So rather than executing the external utility, implement it internally
and just call it.  We can do it since we've linked with libbfd
already.

Signed-off-by: Roberto Agostino Vitillo 
[namhy...@kernel.org: Use a2l_data struct instead of static globals]
Signed-off-by: Namhyung Kim 
---
 tools/perf/config/Makefile |   4 ++
 tools/perf/util/srcline.c  | 167 +
 2 files changed, 171 insertions(+)

diff --git a/tools/perf/config/Makefile b/tools/perf/config/Makefile
index 214e17e97e5c..128d580ca397 100644
--- a/tools/perf/config/Makefile
+++ b/tools/perf/config/Makefile
@@ -396,6 +396,10 @@ else
   endif
 endif
 
+ifndef ($(filter -lbfd,$(EXTLIBS)),)
+  CFLAGS += -DLIBBFD_SUPPORT
+endif
+
 ifndef NO_STRLCPY
   ifeq ($(call try-cc,$(SOURCE_STRLCPY),,-DHAVE_STRLCPY),y)
 CFLAGS += -DHAVE_STRLCPY
diff --git a/tools/perf/util/srcline.c b/tools/perf/util/srcline.c
index dcff10bed7da..10983a92bb5e 100644
--- a/tools/perf/util/srcline.c
+++ b/tools/perf/util/srcline.c
@@ -8,6 +8,172 @@
 #include "util/util.h"
 #include "util/debug.h"
 
+#ifdef LIBBFD_SUPPORT
+
+/*
+ * Implement addr2line using libbfd.
+ */
+#define PACKAGE "perf"
+#include 
+
+struct a2l_data {
+   const char  *input;
+   unsigned long   addr;
+
+   boolfound;
+   const char  *filename;
+   const char  *funcname;
+   unsignedline;
+
+   bfd *abfd;
+   asymbol **syms;
+};
+
+static int bfd_error(const char *string)
+{
+   const char *errmsg;
+
+   errmsg = bfd_errmsg(bfd_get_error());
+   fflush(stdout);
+
+   if (string)
+   pr_debug("%s: %s\n", string, errmsg);
+   else
+   pr_debug("%s\n", errmsg);
+
+   return -1;
+}
+
+static int slurp_symtab(bfd *abfd, struct a2l_data *a2l)
+{
+   long storage;
+   long symcount;
+   asymbol **syms;
+   bfd_boolean dynamic = FALSE;
+
+   if ((bfd_get_file_flags(abfd) & HAS_SYMS) == 0)
+   return bfd_error(bfd_get_filename(abfd));
+
+   storage = bfd_get_symtab_upper_bound(abfd);
+   if (storage == 0L) {
+   storage = bfd_get_dynamic_symtab_upper_bound(abfd);
+   dynamic = TRUE;
+   }
+   if (storage < 0L)
+   return bfd_error(bfd_get_filename(abfd));
+
+   syms = malloc(storage);
+   if (dynamic)
+   symcount = bfd_canonicalize_dynamic_symtab(abfd, syms);
+   else
+   symcount = bfd_canonicalize_symtab(abfd, syms);
+
+   if (symcount < 0) {
+   free(syms);
+   return bfd_error(bfd_get_filename(abfd));
+   }
+
+   a2l->syms = syms;
+   return 0;
+}
+
+static void find_address_in_section(bfd *abfd, asection *section, void *data)
+{
+   bfd_vma pc, vma;
+   bfd_size_type size;
+   struct a2l_data *a2l = data;
+
+   if (a2l->found)
+   return;
+
+   if ((bfd_get_section_flags(abfd, section) & SEC_ALLOC) == 0)
+   return;
+
+   pc = a2l->addr;
+   vma = bfd_get_section_vma(abfd, section);
+   size = bfd_get_section_size(section);
+
+   if (pc < vma || pc >= vma + size)
+   return;
+
+   a2l->found = bfd_find_nearest_line(abfd, section, a2l->syms, pc - vma,
+  >filename, >funcname,
+  >line);
+}
+
+static struct a2l_data *addr2line_init(const char *path)
+{
+   bfd *abfd;
+   struct a2l_data *a2l = NULL;
+
+   abfd = bfd_openr(path, NULL);
+   if (abfd == NULL)
+   return NULL;
+
+   if (!bfd_check_format(abfd, bfd_object))
+   goto out;
+
+   a2l = zalloc(sizeof(*a2l));
+   if (a2l == NULL)
+   goto out;
+
+   a2l->abfd = abfd;
+   a2l->input = strdup(path);
+   if (a2l->input == NULL)
+   goto out;
+
+   if (slurp_symtab(abfd, a2l))
+   goto out;
+
+   return a2l;
+
+out:
+   if (a2l) {
+   free((void *)a2l->input);
+   free(a2l);
+   }
+   bfd_close(abfd);
+   return NULL;
+}
+
+static void addr2line_cleanup(struct a2l_data *a2l)
+{
+   if (a2l->abfd)
+   bfd_close(a2l->abfd);
+   free((void *)a2l->input);
+   free(a2l->syms);
+   free(a2l);
+}
+
+static int addr2line(const char *dso_name, unsigned long addr,
+char **file, unsigned int *line)
+{
+   int ret = 0;
+   struct a2l_data *a2l;
+
+   a2l = addr2line_init(dso_name);
+   if (a2l == NULL) {
+   pr_warning("addr2line_init failed for %s\n", dso_name);
+   return 0;
+   }
+
+   a2l->addr = addr;
+   

[PATCH 2/9] perf annotate: Reuse path from the result of addr2line

2013-07-31 Thread Namhyung Kim
From: Namhyung Kim 

In the symbol__get_source_line(), path and src_line->path will have
same value, but they were allocated separately, and leaks one.
Just share path to src_line->path.

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/annotate.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index d102716c43a1..8a5babc71d53 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -1091,11 +1091,7 @@ static int symbol__get_source_line(struct symbol *sym, 
struct map *map,
if (getline(, _len, fp) < 0 || !line_len)
goto next_close;
 
-   src_line->path = malloc(sizeof(char) * line_len + 1);
-   if (!src_line->path)
-   goto next_close;
-
-   strcpy(src_line->path, path);
+   src_line->path = path;
insert_source_line(_root, src_line);
 
next_close:
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/9] perf tools: Factor out get/free_srcline()

2013-07-31 Thread Namhyung Kim
From: Namhyung Kim 

Currently external addr2line tool is used for srcline sort key and
annotate with srcline info.  Separate the common code to prepare
upcoming enhancements.

Signed-off-by: Namhyung Kim 
---
 tools/perf/Makefile|  1 +
 tools/perf/util/annotate.c | 20 ++-
 tools/perf/util/hist.c |  2 +-
 tools/perf/util/sort.c | 17 ++
 tools/perf/util/srcline.c  | 83 ++
 tools/perf/util/util.h |  6 
 6 files changed, 97 insertions(+), 32 deletions(-)
 create mode 100644 tools/perf/util/srcline.c

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index bfd12d02a304..e9e496e01914 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -360,6 +360,7 @@ LIB_OBJS += $(OUTPUT)util/rblist.o
 LIB_OBJS += $(OUTPUT)util/intlist.o
 LIB_OBJS += $(OUTPUT)util/vdso.o
 LIB_OBJS += $(OUTPUT)util/stat.o
+LIB_OBJS += $(OUTPUT)util/srcline.o
 
 LIB_OBJS += $(OUTPUT)ui/setup.o
 LIB_OBJS += $(OUTPUT)ui/helpline.o
diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index 8a5babc71d53..c8a0d5762b1d 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -1018,7 +1018,7 @@ static void symbol__free_source_line(struct symbol *sym, 
int len)
  (sizeof(src_line->p) * (src_line->nr_pcnt - 1));
 
for (i = 0; i < len; i++) {
-   free(src_line->path);
+   free_srcline(src_line->path);
src_line = (void *)src_line + sizeof_src_line;
}
 
@@ -1035,7 +1035,6 @@ static int symbol__get_source_line(struct symbol *sym, 
struct map *map,
u64 start;
int i, k;
int evidx = evsel->idx;
-   char cmd[PATH_MAX * 2];
struct source_line *src_line;
struct annotation *notes = symbol__annotation(sym);
struct sym_hist *h = annotation__histogram(notes, evidx);
@@ -1063,10 +1062,7 @@ static int symbol__get_source_line(struct symbol *sym, 
struct map *map,
start = map__rip_2objdump(map, sym->start);
 
for (i = 0; i < len; i++) {
-   char *path = NULL;
-   size_t line_len;
u64 offset;
-   FILE *fp;
double percent_max = 0.0;
 
src_line->nr_pcnt = nr_pcnt;
@@ -1083,19 +1079,9 @@ static int symbol__get_source_line(struct symbol *sym, 
struct map *map,
goto next;
 
offset = start + i;
-   sprintf(cmd, "addr2line -e %s %016" PRIx64, filename, offset);
-   fp = popen(cmd, "r");
-   if (!fp)
-   goto next;
-
-   if (getline(, _len, fp) < 0 || !line_len)
-   goto next_close;
-
-   src_line->path = path;
+   src_line->path = get_srcline(filename, offset);
insert_source_line(_root, src_line);
 
-   next_close:
-   pclose(fp);
next:
src_line = (void *)src_line + sizeof_src_line;
}
@@ -1136,7 +1122,7 @@ static void print_summary(struct rb_root *root, const 
char *filename)
 
path = src_line->path;
color = get_percent_color(percent_max);
-   color_fprintf(stdout, color, " %s", path);
+   color_fprintf(stdout, color, " %s\n", path);
 
node = rb_next(node);
}
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index e6c11ccf138c..fa695ce1662a 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -530,7 +530,7 @@ void hist_entry__free(struct hist_entry *he)
 {
free(he->branch_info);
free(he->mem_info);
-   free(he->srcline);
+   free_srcline(he->srcline);
free(he);
 }
 
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 5ba29fc679c5..0d633a059df4 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -251,8 +251,7 @@ static int hist_entry__srcline_snprintf(struct hist_entry 
*self, char *bf,
unsigned int width __maybe_unused)
 {
FILE *fp = NULL;
-   char cmd[PATH_MAX + 2], *path = self->srcline, *nl;
-   size_t line_len;
+   char *path = self->srcline;
 
if (path != NULL)
goto out_path;
@@ -263,19 +262,9 @@ static int hist_entry__srcline_snprintf(struct hist_entry 
*self, char *bf,
if (!strncmp(self->ms.map->dso->long_name, "/tmp/perf-", 10))
goto out_ip;
 
-   snprintf(cmd, sizeof(cmd), "addr2line -e %s %016" PRIx64,
-self->ms.map->dso->long_name, self->ip);
-   fp = popen(cmd, "r");
-   if (!fp)
-   goto out_ip;
-
-   if (getline(>srcline, _len, fp) < 0 || !line_len)
-   goto out_ip;
+   path = get_srcline(self->ms.map->dso->long_name, self->ip);
+   self->srcline = path;
 
-   nl = strchr(self->srcline, '\n');
-   if (nl != NULL)
-   *nl = '\0';
-   

[PATCH 9/9] perf tools: Fix srcline sort key behavior

2013-07-31 Thread Namhyung Kim
From: Namhyung Kim 

Currently the srcline sort key compares ip rather than srcline info.
I guess this was due to a performance reason to run external addr2line
utility.  Now we have implemented the functionality inside, use the
srcline info when comparing hist entries.

Also constantly print "??:0" string for unknown srcline rather than
printing ip.

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/sort.c | 41 -
 1 file changed, 20 insertions(+), 21 deletions(-)

diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 58cfe69d3a6c..082480579861 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -243,33 +243,32 @@ struct sort_entry sort_sym = {
 static int64_t
 sort__srcline_cmp(struct hist_entry *left, struct hist_entry *right)
 {
-   return (int64_t)(right->ip - left->ip);
+   if (!left->srcline) {
+   if (!left->ms.map)
+   left->srcline = SRCLINE_UNKNOWN;
+   else {
+   struct map *map = left->ms.map;
+   left->srcline = get_srcline(map->dso,
+   map__rip_2objdump(map, left->ip));
+   }
+   }
+   if (!right->srcline) {
+   if (!right->ms.map)
+   right->srcline = SRCLINE_UNKNOWN;
+   else {
+   struct map *map = right->ms.map;
+   right->srcline = get_srcline(map->dso,
+   map__rip_2objdump(map, right->ip));
+   }
+   }
+   return strcmp(left->srcline, right->srcline);
 }
 
 static int hist_entry__srcline_snprintf(struct hist_entry *self, char *bf,
size_t size,
unsigned int width __maybe_unused)
 {
-   FILE *fp = NULL;
-   char *path = self->srcline;
-
-   if (path != NULL)
-   goto out_path;
-
-   if (!self->ms.map)
-   goto out_ip;
-
-   path = get_srcline(self->ms.map->dso, self->ip);
-   self->srcline = path;
-
-out_path:
-   if (fp)
-   pclose(fp);
-   return repsep_snprintf(bf, size, "%s", path);
-out_ip:
-   if (fp)
-   pclose(fp);
-   return repsep_snprintf(bf, size, "%-#*llx", BITS_PER_LONG / 4, 
self->ip);
+   return repsep_snprintf(bf, size, "%s", self->srcline);
 }
 
 struct sort_entry sort_srcline = {
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/9] perf tools: Do not try to call addr2line for non-binary files

2013-07-31 Thread Namhyung Kim
From: Namhyung Kim 

No need to call addr2line since they don't have such information.

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/sort.c|  3 ---
 tools/perf/util/srcline.c | 11 +--
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 0d633a059df4..113aa92b1191 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -259,9 +259,6 @@ static int hist_entry__srcline_snprintf(struct hist_entry 
*self, char *bf,
if (!self->ms.map)
goto out_ip;
 
-   if (!strncmp(self->ms.map->dso->long_name, "/tmp/perf-", 10))
-   goto out_ip;
-
path = get_srcline(self->ms.map->dso->long_name, self->ip);
self->srcline = path;
 
diff --git a/tools/perf/util/srcline.c b/tools/perf/util/srcline.c
index 7e92cca6f502..777f91880cdb 100644
--- a/tools/perf/util/srcline.c
+++ b/tools/perf/util/srcline.c
@@ -57,11 +57,17 @@ char *get_srcline(const char *dso_name, unsigned long addr)
 {
char *file;
unsigned line;
-   char *srcline;
+   char *srcline = SRCLINE_UNKNOWN;
size_t size;
 
+   if (dso_name[0] == '[')
+   goto out;
+
+   if (!strncmp(dso_name, "/tmp/perf-", 10))
+   goto out;
+
if (!addr2line(dso_name, addr, , ))
-   return SRCLINE_UNKNOWN;
+   goto out;
 
/* just calculate actual length */
size = snprintf(NULL, 0, "%s:%u", file, line) + 1;
@@ -73,6 +79,7 @@ char *get_srcline(const char *dso_name, unsigned long addr)
srcline = SRCLINE_UNKNOWN;
 
free(file);
+out:
return srcline;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

2013-07-31 Thread Minchan Kim
Hi Hannes,

On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> Each zone that holds userspace pages of one workload must be aged at a
> speed proportional to the zone size.  Otherwise, the time an
> individual page gets to stay in memory depends on the zone it happened
> to be allocated in.  Asymmetry in the zone aging creates rather
> unpredictable aging behavior and results in the wrong pages being
> reclaimed, activated etc.
> 
> But exactly this happens right now because of the way the page
> allocator and kswapd interact.  The page allocator uses per-node lists
> of all zones in the system, ordered by preference, when allocating a
> new page.  When the first iteration does not yield any results, kswapd
> is woken up and the allocator retries.  Due to the way kswapd reclaims
> zones below the high watermark while a zone can be allocated from when
> it is above the low watermark, the allocator may keep kswapd running
> while kswapd reclaim ensures that the page allocator can keep
> allocating from the first zone in the zonelist for extended periods of
> time.  Meanwhile the other zones rarely see new allocations and thus
> get aged much slower in comparison.
> 
> The result is that the occasional page placed in lower zones gets
> relatively more time in memory, even get promoted to the active list
> after its peers have long been evicted.  Meanwhile, the bulk of the
> working set may be thrashing on the preferred zone even though there
> may be significant amounts of memory available in the lower zones.
> 
> Even the most basic test -- repeatedly reading a file slightly bigger
> than memory -- shows how broken the zone aging is.  In this scenario,
> no single page should be able stay in memory long enough to get
> referenced twice and activated, but activation happens in spades:
> 
>   $ grep active_file /proc/zoneinfo
>   nr_inactive_file 0
>   nr_active_file 0
>   nr_inactive_file 0
>   nr_active_file 8
>   nr_inactive_file 1582
>   nr_active_file 11994
>   $ cat data data data data >/dev/null
>   $ grep active_file /proc/zoneinfo
>   nr_inactive_file 0
>   nr_active_file 70
>   nr_inactive_file 258753
>   nr_active_file 443214
>   nr_inactive_file 149793
>   nr_active_file 12021
> 
> Fix this with a very simple round robin allocator.  Each zone is
> allowed a batch of allocations that is proportional to the zone's
> size, after which it is treated as full.  The batch counters are reset
> when all zones have been tried and the allocator enters the slowpath
> and kicks off kswapd reclaim:
> 
>   $ grep active_file /proc/zoneinfo
>   nr_inactive_file 0
>   nr_active_file 0
>   nr_inactive_file 174
>   nr_active_file 4865
>   nr_inactive_file 53
>   nr_active_file 860
>   $ cat data data data data >/dev/null
>   $ grep active_file /proc/zoneinfo
>   nr_inactive_file 0
>   nr_active_file 0
>   nr_inactive_file 22
>   nr_active_file 4988
>   nr_inactive_file 190969
>   nr_active_file 937

First of all, I should appreciate your great work!
It's amazing and I saw Zlatko proved it enhances real works.
Thanks Zlatko, too!

So, I don't want to prevent merging but I think at least, we should
discuss some issues.

The concern I have is that it could accelerate low memory pinning
problems like mlock. Actually, I don't have such workload that makes
pin lots of pages but that's why we introduced lowmem_reserve_ratio,
as you know well so we should cover this issue, at least.

Other thing of my concerns is to add overhead in fast path.
Sometime, we are really reluctant to add simple even "if" condition
in fastpath but you are adding atomic op whenever page is allocated and
enter slowpath whenever all of given zones's batchcount is zero.
Yes, it's not really slow path because it could return to normal status
without calling significant slow functions by reset batchcount of
prepare_slowpath.

I think it's tradeoff and I am biased your approach although we would
lose a little performance because fair aging would recover the loss by
fastpath's overhead. But who knows? Someone has a concern.

So we should mention about such problems.

> 
> Signed-off-by: Johannes Weiner 
> ---
>  include/linux/mmzone.h |  1 +
>  mm/page_alloc.c| 39 +--
>  2 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index af4a3b7..0c41d59 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -352,6 +352,7 @@ struct zone {
>* free areas of different sizes
>*/
>   spinlock_t  lock;
> + atomic_talloc_batch;
>   int all_unreclaimable; /* All pages pinned */
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>   /* Set to true when the PG_migrate_skip bits should be cleared */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> 

[PATCH 6/9] perf tools: Pass dso instead of dso_name to get_srcline()

2013-07-31 Thread Namhyung Kim
From: Namhyung Kim 

This is a preparation of next change.  No functional changes are
intended.

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/annotate.c | 11 ---
 tools/perf/util/dso.h  |  1 +
 tools/perf/util/sort.c |  2 +-
 tools/perf/util/srcline.c  |  4 +++-
 tools/perf/util/util.h |  4 +++-
 5 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index c8a0d5762b1d..0137f028c0d6 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -1029,8 +1029,7 @@ static void symbol__free_source_line(struct symbol *sym, 
int len)
 /* Get the filename:line for the colored entries */
 static int symbol__get_source_line(struct symbol *sym, struct map *map,
   struct perf_evsel *evsel,
-  struct rb_root *root, int len,
-  const char *filename)
+  struct rb_root *root, int len)
 {
u64 start;
int i, k;
@@ -1079,7 +1078,7 @@ static int symbol__get_source_line(struct symbol *sym, 
struct map *map,
goto next;
 
offset = start + i;
-   src_line->path = get_srcline(filename, offset);
+   src_line->path = get_srcline(map->dso, offset);
insert_source_line(_root, src_line);
 
next:
@@ -1286,7 +1285,6 @@ int symbol__tty_annotate(struct symbol *sym, struct map 
*map,
 bool full_paths, int min_pcnt, int max_lines)
 {
struct dso *dso = map->dso;
-   const char *filename = dso->long_name;
struct rb_root source_line = RB_ROOT;
u64 len;
 
@@ -1296,9 +1294,8 @@ int symbol__tty_annotate(struct symbol *sym, struct map 
*map,
len = symbol__size(sym);
 
if (print_lines) {
-   symbol__get_source_line(sym, map, evsel, _line,
-   len, filename);
-   print_summary(_line, filename);
+   symbol__get_source_line(sym, map, evsel, _line, len);
+   print_summary(_line, dso->long_name);
}
 
symbol__annotate_printf(sym, map, evsel, full_paths,
diff --git a/tools/perf/util/dso.h b/tools/perf/util/dso.h
index d51aaf272c68..94c2f1a8b116 100644
--- a/tools/perf/util/dso.h
+++ b/tools/perf/util/dso.h
@@ -5,6 +5,7 @@
 #include 
 #include "types.h"
 #include "map.h"
+#include "build-id.h"
 
 enum dso_binary_type {
DSO_BINARY_TYPE__KALLSYMS = 0,
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 113aa92b1191..58cfe69d3a6c 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -259,7 +259,7 @@ static int hist_entry__srcline_snprintf(struct hist_entry 
*self, char *bf,
if (!self->ms.map)
goto out_ip;
 
-   path = get_srcline(self->ms.map->dso->long_name, self->ip);
+   path = get_srcline(self->ms.map->dso, self->ip);
self->srcline = path;
 
 out_path:
diff --git a/tools/perf/util/srcline.c b/tools/perf/util/srcline.c
index 777f91880cdb..c736d9428cf2 100644
--- a/tools/perf/util/srcline.c
+++ b/tools/perf/util/srcline.c
@@ -4,6 +4,7 @@
 
 #include 
 
+#include "util/dso.h"
 #include "util/util.h"
 #include "util/debug.h"
 
@@ -53,11 +54,12 @@ out:
return ret;
 }
 
-char *get_srcline(const char *dso_name, unsigned long addr)
+char *get_srcline(struct dso *dso, unsigned long addr)
 {
char *file;
unsigned line;
char *srcline = SRCLINE_UNKNOWN;
+   char *dso_name = dso->long_name;
size_t size;
 
if (dso_name[0] == '[')
diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h
index 9a0bbe45f642..20dd24d8f338 100644
--- a/tools/perf/util/util.h
+++ b/tools/perf/util/util.h
@@ -282,7 +282,9 @@ void get_term_dimensions(struct winsize *ws);
 
 #define SRCLINE_UNKNOWN  ((char *) "??:0")
 
-char *get_srcline(const char *dso_name, unsigned long addr);
+struct dso;
+
+char *get_srcline(struct dso *dso, unsigned long addr);
 void free_srcline(char *srcline);
 
 #endif /* GIT_COMPAT_UTIL_H */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 3/4] tracing/kprobes: Fail to unregister if probe event files are open

2013-07-31 Thread Steven Rostedt
Here's the new change log, but the same patch. Does this sound ok to you
guys?

-- Steve

>From 40c32592668b727cbfcf7b1c0567f581bd62a5e4 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Red Hat)" 
Date: Wed, 3 Jul 2013 23:33:50 -0400
Subject: [PATCH] tracing/kprobes: Fail to unregister if probe event files are
 in use

When a probe is being removed, it cleans up the event files that correspond
to the probe. But there is a race between writing to one of these files
and deleting the probe. This is especially true for the "enable" file.

CPU 0   CPU 1
-   -

  fd = open("enable",O_WRONLY);

  probes_open()
  release_all_trace_probes()
  unregister_trace_probe()
  if (trace_probe_is_enabled(tp))
return -EBUSY

   write(fd, "1", 1)
   __ftrace_set_clr_event()
   call->class->reg()
(kprobe_register)
 enable_trace_probe(tp)

  __unregister_trace_probe(tp);
  list_del(>list)
  unregister_probe_event(tp) <-- fails!
  free_trace_probe(tp)

   write(fd, "0", 1)
   __ftrace_set_clr_event()
   call->class->unreg
(kprobe_register)
disable_trace_probe(tp) <-- BOOM!

A test program was written that used two threads to simulate the
above scenario adding a nanosleep() interval to change the timings
and after several thousand runs, it was able to trigger this bug
and crash:

BUG: unable to handle kernel paging request at 000500f9
IP: [] probes_open+0x3b/0xa7
PGD 7808a067 PUD 0
Oops:  [#1] PREEMPT SMP
Dumping ftrace buffer:
-
Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6
CPU: 1 PID: 2070 Comm: test-kprobe-rem Not tainted 3.11.0-rc3-test+ #47
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by 
O.E.M., BIOS SDBLI944.86P 05/08/2007
task: 880077756440 ti: 880076e52000 task.ti: 880076e52000
RIP: 0010:[]  [] probes_open+0x3b/0xa7
RSP: 0018:880076e53c38  EFLAGS: 00010203
RAX: 00050001 RBX: 88007844f440 RCX: 0003
RDX: 0003 RSI: 0003 RDI: 880076e52000
RBP: 880076e53c58 R08: 880076e53bd8 R09: 
R10: 880077756440 R11: 0006 R12: 810dee35
R13: 880079250418 R14:  R15: 88007844f450
FS:  7f87a276f700() GS:88007d48() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 000500f9 CR3: 77262000 CR4: 07e0
Stack:
 880076e53c58 81219ea0 88007844f440 810dee35
 880076e53ca8 81130f78 8800772986c0 8800796f93a0
 81d1b5d8 880076e53e04  88007844f440
Call Trace:
 [] ? security_file_open+0x2c/0x30
 [] ? unregister_trace_probe+0x4b/0x4b
 [] do_dentry_open+0x162/0x226
 [] finish_open+0x46/0x54
 [] do_last+0x7f6/0x996
 [] ? inode_permission+0x42/0x44
 [] path_openat+0x232/0x496
 [] do_filp_open+0x3a/0x8a
 [] ? __alloc_fd+0x168/0x17a
 [] do_sys_open+0x70/0x102
 [] ? trace_hardirqs_on_caller+0x160/0x197
 [] SyS_open+0x1e/0x20
 [] system_call_fastpath+0x16/0x1b
Code: e5 41 54 53 48 89 f3 48 83 ec 10 48 23 56 78 48 39 c2 75 6c 31 f6 48 c7
RIP  [] probes_open+0x3b/0xa7
 RSP 
CR2: 000500f9
---[ end trace 35f17d68fc569897 ]---

The unregister_trace_probe() must be done first, and if it fails it must
fail the removal of the kprobe.

Several changes have already been made by Oleg Nesterov and Masami Hiramatsu
to allow moving the unregister_probe_event() before the removal of
the probe and exit the function if it fails. This prevents the tp
structure from being used after it is freed.

Link: http://lkml.kernel.org/r/20130704034038.819592...@goodmis.org

Acked-by: Masami Hiramatsu 
Signed-off-by: Steven Rostedt 
---
 kernel/trace/trace_kprobe.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 3811487..243f683 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -95,7 +95,7 @@ static __kprobes bool trace_probe_is_on_module(struct 
trace_probe *tp)
 }
 
 static int register_probe_event(struct trace_probe *tp);
-static void unregister_probe_event(struct trace_probe *tp);
+static int unregister_probe_event(struct trace_probe *tp);
 
 static DEFINE_MUTEX(probe_lock);
 static LIST_HEAD(probe_list);
@@ -351,9 +351,12 @@ static int unregister_trace_probe(struct trace_probe *tp)
if (trace_probe_is_enabled(tp))
return -EBUSY;
 
+   /* Will fail if probe is being used by ftrace or perf */
+   if 

Re: [PATCH 3/3] ext4: Implement FALLOC_FL_COLLAPSE_RANGE

2013-07-31 Thread Dave Chinner
On Wed, Jul 31, 2013 at 11:42:26PM +0900, Namjae Jeon wrote:
> From: Namjae Jeon 
> 
> New fallocate flag FALLOC_FL_COLLAPSE_RANGE implementation for Ext4
.
> +
> + punch_start = offset >> EXT4_BLOCK_SIZE_BITS(sb);
> + punch_stop = (offset + len) >> EXT4_BLOCK_SIZE_BITS(sb);
> +
> + rounding = max_t(uint, 1 << EXT4_BLOCK_SIZE_BITS(sb), PAGE_CACHE_SIZE);
> + ioffset = offset & ~(rounding - 1);
> +
> + /* Write out all dirty pages */
> + ret = filemap_write_and_wait_range(inode->i_mapping, ioffset, -1);
> + if (ret)
> + return ret;
> +
> + /* Take mutex lock */
> + mutex_lock(>i_mutex);
> +
> + truncate_pagecache_range(inode, ioffset, -1);

Ted, that's invalidating the page cache from the start of the
collaspse range to the end of the file. So the ext4 code is doing
this bit correctly.  Why isn't this in the XFS patches? Clearly the
need for this was understood, and, well, this code is obviously
copied from the XFS hole punching code. i.e. from
xfs_free_file_space().

> + /* Wait for existing dio to complete */
> + ext4_inode_block_unlocked_dio(inode);
> + inode_dio_wait(inode);

That should be done before invalidating the pagecache

> + credits = ext4_writepage_trans_blocks(inode);
> + handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out_dio;
> + }
> +
> + down_write(_I(inode)->i_data_sem);
> +
> + ext4_discard_preallocations(inode);
> +
> + ret = ext4_es_remove_extent(inode, punch_start,
> + EXT_MAX_BLOCKS - punch_start - 1);
> + if (ret)
> + goto journal_stop;
> +
> + ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1);
> + if (ret)
> + goto journal_stop;

So, this code punches out the existing space in the file so that the
extent shifting is moving extents into a hole. Why is this in the
ext4 code, but not the XFS code?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v14 0/6] LSM: Multiple concurrent LSMs

2013-07-31 Thread Balbir Singh
On Thu, Jul 25, 2013 at 11:52 PM, Casey Schaufler
 wrote:
> Subject: [PATCH v14 0/6] LSM: Multiple concurrent LSMs
>
> Version 14 of this patchset is based on v3.10.
> It required significant change from version 13 due to changes
> in the audit code. It came out cleaner, especially in the changes
> to NetLabel. This version supports all existing LSMs running
> together at the same time. The combinations tested most completely
> are:
>
> apparmor,tomoyo,smack,yama  - Ubuntu
> apparmor,selinux,smack,yama - Fedora
>

Does this change the way one would develop a new LSM module? I presume
it does not

> I have been unable to figure out how to configure SELinux on
> Ubuntu and TOMOYO on Fedora. That's the only reason the list
> does not include all five LSMs at once. Combining LSMs that
> use networking is tricky, but can be done. There are changes
> coming from AppArmor that might make it even trickier, but
> that's a problem for the future.
>
>
> Change the infrastructure for Linux Security Modules (LSM)s from a
> single vector of hook handlers to a list based method for handling
> multiple concurrent modules. All combinations of existing LSMs
> are supported.
>
> The "security=" boot option takes a comma separated list of LSMs,
> registering them in the order presented. The LSM hooks will be
> executed in the order registered. Hooks that return errors are
> not short circuited. All hooks are called even if one of the LSM
> hooks fails. The result returned will be that of the last LSM
> hook that failed.
>

This is an important design trade-off. From my perspective I think you
might want to revisit this, today it sounds like effective security ==
all hooks process and allow the operation. In this world a lack of
proper policy/setting can make hooks fail. I've not yet looked at the
code, but you might want to revisit this.

> All behavior from security/capability.c has been moved into
> the hook handling.  The security/commoncap functions used
> to get called from the LSM specific code. The handling of the
> capability functions has been moved out of the LSMs and into the
> hook handling.
>
> A level of indirection has been introduced in the handling of
> security blobs. LSMs no longer access ->security fields directly,
> instead they use an abstraction provided by lsm_[gs]et field
> functions.
>
> The notion that "the security context" can be represented as a
> single u32 "secid" does not scale to the case where multiple LSMs
> want to provide "the security context". The XFRM and secmark
> facilities appear unlikely to ever allow for more than the existing
> 32 bit values. The NetLabel scheme might possibly be used to
> represent more than one labeling scheme (CIPSO does allow for
> multiple tags) although there is no plan to do so at this time.
> The SO_PEERSEC scheme is capable of providing information from
> multiple LSMs. Auditing can deal with multiple secids.
>
> The NetLabel, XFRM and secmark facilities are restricted to use
> by one LSM at a time. The SO_PEERSEC facility can provide information
> from multiple LSMs, but existing user space tools don't understand
> that. The default behavior is to assign each of these facilities
> to the first registered LSM that uses them. They can be configured
> for use by any of the LSMs that provide hooks for them. SO_PEERSEC
> can be configured to provide information from all of the LSMs that
> provide hooks.
>
> The /proc/*/attr interfaces are given to one LSM. This can be
> done by setting CONFIG_SECURITY_PRESENT. Additional interfaces
> have been created in /proc/*/attr so that each LSM has its own
> named interfaces. The name of the presenting LSM can be read from
> /sys/kernel/security/present. The list of LSMs being used can be
> read from /sys/kernel/security/lsm.
>
> A "security context" may now contrain information processed by
> more than one LSM. The proper form of a security context identifies
> the information it contains by LSM:
>
> smack='Pop'selinux='system_u:object_r:etc_r:s0'
>
> A security context without the LSM identifying lsm='' gets
> passed through to all of the LSMs that use a security context. This
> maintains compatability in the case where there is only one LSM
> using the security context.
>
> Signed-off-by: Casey Schaufler 

Balbir Singh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ksummit-2013-discuss] [ARM ATTEND] catching up on exploit mitigations

2013-07-31 Thread Olof Johansson
On Wed, Jul 31, 2013 at 7:24 AM, Dave Jones  wrote:
> On Wed, Jul 31, 2013 at 10:40:12AM +0100, Russell King - ARM Linux wrote:
>  > On Tue, Jul 30, 2013 at 08:04:44PM -0400, Dave Jones wrote:
>  > > To use ARM as an example, the bugs I've seen have mostly been in arch 
> specific
>  > > code that does things like page-table manipulation.  The chromebook bugs 
> I
>  > > was hitting for eg were various kinds of PTE corruption warnings.
>  >
>  > Hmm, really?  Did you reported these bugs?  I'm not aware of mainline
>  > having any changes related to bug reports on PTEs on ARM.
>
> I wasn't sure if it was a googleism, or happens on mainline, so no.

We'd definitely be interested in hearing about what you've found. The
3.4 kernel has a lot of out-of-tree code for exynos, we've reduced it
quite a bit on the upcoming 3.8 one but it needs a bit more baking.

As of 3.10, it's not actually hard to run a mainline kernel on the
chromebook, but you have limited functionality (no wifi, no USB3, no
accellerated graphics). Easiest is to do it by booting from SD card.
See instructions at
http://www.chromium.org/chromium-os/u-boot-porting-guide/using-nv-u-boot-on-the-samsung-arm-chromebook.


How long would a useful run of trinity take? I've got an
autobuilder/autobooter setup here with a cross-section of current ARM
hardware (7 platforms and counting) that I mostly do boot testing on,
and I should add a smallish suite of testcases to the same. Trinity
would be a good candidate for that.


-Olof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 2/2] qspinlock x86: Enable x86 to use queue spinlock

2013-07-31 Thread Waiman Long
This patch makes the necessary changes at the x86 architecture specific
layer to enable the presence of the CONFIG_QUEUE_SPINLOCK kernel option
which can be used to replace the ticket spinlock by the queue spinlock.

The turning on of ARCH_QUEUE_SPINLOCK config option does not mean
that the ticket spinlock will be replaced. It only means that the
architecture supports the replacement. Actual replacement will only
happen if the QUEUE_SPINLOCK config option is also set.

Signed-off-by: Waiman Long 
---
 arch/x86/Kconfig  |3 +++
 arch/x86/include/asm/spinlock.h   |2 ++
 arch/x86/include/asm/spinlock_types.h |4 
 arch/x86/kernel/paravirt-spinlocks.c  |   10 ++
 4 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b32ebf9..8531e47 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2344,6 +2344,9 @@ config X86_DMA_REMAP
bool
depends on STA2X11
 
+config ARCH_QUEUE_SPINLOCK
+   def_bool y
+
 source "net/Kconfig"
 
 source "drivers/Kconfig"
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..0ea805d 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,7 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+#ifndef CONFIG_QUEUE_SPINLOCK
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -130,6 +131,7 @@ static __always_inline void 
arch_spin_lock_flags(arch_spinlock_t *lock,
 }
 
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
+#endif /* !CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index ad0ad07..8a7721e 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -7,6 +7,9 @@
 
 #include 
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+# include 
+#else
 #if (CONFIG_NR_CPUS < 256)
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
@@ -27,6 +30,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#endif
 
 #include 
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 676b8c7..75569cb 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -15,6 +15,15 @@ default_spin_lock_flags(arch_spinlock_t *lock, unsigned long 
flags)
 
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
+# ifdef CONFIG_QUEUE_SPINLOCK
+   .spin_is_locked = queue_spin_is_locked,
+   .spin_is_contended = queue_spin_is_contended,
+
+   .spin_lock = queue_spin_lock,
+   .spin_lock_flags = default_spin_lock_flags,
+   .spin_trylock = queue_spin_trylock,
+   .spin_unlock = queue_spin_unlock,
+# else
.spin_is_locked = __ticket_spin_is_locked,
.spin_is_contended = __ticket_spin_is_contended,
 
@@ -22,6 +31,7 @@ struct pv_lock_ops pv_lock_ops = {
.spin_lock_flags = default_spin_lock_flags,
.spin_trylock = __ticket_spin_trylock,
.spin_unlock = __ticket_spin_unlock,
+# endif
 #endif
 };
 EXPORT_SYMBOL(pv_lock_ops);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation

2013-07-31 Thread Waiman Long
This patch introduces a new queue spinlock implementation that can
serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be as fair as
the ticket spinlock. It is faster both in single-thread and in high
contention situations. Only in light to moderate contention where
the average queue depth is around 1-2 will this queue spinlock be
potentially a bit slower due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines.

The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.

The current queue node address encoding is as follows:
Bits 0-1 : queue node index in the per-cpu array (4 entries)
Bits 2-16: cpu number + 1 (max cpus = 16383)

By using also the unused byte, we could support more cpus and queue
node entries in the array at the expense of more coding and probably
performance overheads.

In the extremely unlikely case that all the queue node entries are
used up, the current code will fall back to busy spinning without
waiting in a queue with warning message.

With minor modification to the lock fastpath, we could also make a
less fair version of queue spinlock that allows lock stealing. This
can potentially provide better performance for some workloads.

This patch allows the optional replacement of architecture specific
implementation of ticket spinlock by this generic version of queue
spinlock. Two new config parameters are introduced:

1. QUEUE_SPINLOCK
   A select-able option that enables the building and replacement of
   architecture specific spinlock by queue spinlock.
2. ARCH_QUEUE_SPINLOCK
   Have to be defined in arch/$(ARCH)/Kconfig to enable QUEUE_SPINLOCK
   option. This option, by itself, will not enable the queue spinlock
   feature.

For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.93Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

Lock Type   Time (ns)
-   -
Ticket spinlock   12.1
Queue spinlock10.9

So the queue spinlock is slightly faster.

The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs to evaluate the performance impact of this patch on the
3.10.1 kernel. The jobs per minute (JPM) results for the 1100-2000
user ranges are shown below:

++++++++
| Kernel | 3.10.1 | 3.10.1 |%Change | 3.10.1 | 3.10.1 |%Change |
||| qlock  ||| qlock  ||
| HT |  off   |  off   ||   on   |   on   ||
++++++++
|alltests| 403590 | 397622 |  -1.5% | 411479 | 411428 |   0.0% |
|custom  | 304581 | 307165 |  +0.9% | 230335 | 254798 | +10.6% |
|dbase   | 896252 | 900597 |  +0.5% |1134659 |1137730 |  +0.3% |
|disk| 213679 | 270597 | +26.6% | 203974 | 249998 | +22.6% |
|five_sec| 144012 | 157286 |  +9.2% | 132599 | 150812 | +13.7% |
|fserver | 472741 | 467213 |  -1.2% | 270601 | 285737 |  +5.6% |
|high_systime| 131106 | 139015 |  +6.0% | 120294 | 125802 |  +4.6% |
|new_dbase   | 934871 | 936860 |  +0.2  |1177640 |1181679 |  +0.3% |
|new_fserver | 452544 | 444967 |  -1.7% | 262368 | 275827 |  +5.1% |
|shared  | 366750 | 369901 |  +0.9% | 403645 | 415840 |  +3.0% |
|short   |1061663 |3003942 |+183.0% |1040392 |2834437 |+172.4% |
++++++++

There are cases where the performance drops 1-2%, but the majority
of them are performance gains and some of them are pretty significant.

For the disk workload, the spinlock bottleneck is the standalone
mb_cache_spinlock in fs/mbcache.c. For the short workload, the spinlock
bottleneck is the d_lock in the dentry structure.

The following table shows the %time used up by the locks as reported
by the perf command at 1000 users for disk workload & 1500 users for
short workload with HT off.

-+--+--+--+
Lock | ticket   |  qlock   |  qlock   |
 

Re: [PATCH V2 1/9] perf tools: add test for reading object code

2013-07-31 Thread Namhyung Kim
Hi Peter,

On Wed, Jul 31, 2013 at 11:24 PM, Peter Zijlstra  wrote:
> On Wed, Jul 31, 2013 at 11:17:32AM -0300, Arnaldo Carvalho de Melo wrote:
>> Em Wed, Jul 31, 2013 at 12:13:50AM +0300, Adrian Hunter escreveu:
>> > Using the information in mmap events, perf tools can read object
>> > code associated with sampled addresses.  A test is added that
>> > compares bytes read by perf with the same bytes read using
>> > objdump.
>>
>> So this parses objdump output, and we also already have the annotation
>> logic that does that too, have you thought about having common routines
>> for these two cases?
>>
>
> Or better yet, stop using objdump like this and start using libbfd
> directly. The only reason we did horrible things like parsing objdump
> output is because nobody knew how the underlying stuff actually worked
> and we wanted to have something quick.

I have similar patch for getting srcline info (using addr2line) based
on Roberto Vitillo's patch.  I'll send the series soon and then look
at the objdump too.


-- 
Thanks,
Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mmc: omap_hsmmc: Fix sleep too long in ISR context.

2013-07-31 Thread majianpeng
We found a problem when we removed a working sd card that the irqaction
of omap_hsmmc can sleep to 3.6s. This cause our watchdog to work.
In func omap_hsmmc_reset_controller_fsm, it should watch a 0->1
transition.It used loops_per_jiffy as the timer.
The code is:
> while ((!(OMAP_HSMMC_READ(host->base, SYSCTL) & bit))
>&& (i++ < limit))
>cpu_relax();
But the loops_per_jiffy is:
>  while(i++ < limit)
>   cpu_relax();
It add some codes so the time became long.
Becasue those codes in ISR context, it can't use timer_before/after.
I divived the time into 1ms and used udelay(1) to instead.
It will cause do additional udelay(1).But from my test,it looks good.

Reported-by: Yuzheng Ma 
Tested-by: Yuzheng Ma 
Signed-off-by: Jianpeng Ma 
---
 drivers/mmc/host/omap_hsmmc.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/mmc/host/omap_hsmmc.c b/drivers/mmc/host/omap_hsmmc.c
index 1865321..96daca1 100644
--- a/drivers/mmc/host/omap_hsmmc.c
+++ b/drivers/mmc/host/omap_hsmmc.c
@@ -977,6 +977,8 @@ static inline void omap_hsmmc_reset_controller_fsm(struct 
omap_hsmmc_host *host,
unsigned long limit = (loops_per_jiffy *
msecs_to_jiffies(MMC_TIMEOUT_MS));
 
+   /*Divided time into us for unit 1,we can use udelay(1)*/
+   i = limit / (MMC_TIMEOUT_MS * 1000);
OMAP_HSMMC_WRITE(host->base, SYSCTL,
 OMAP_HSMMC_READ(host->base, SYSCTL) | bit);
 
@@ -985,15 +987,19 @@ static inline void omap_hsmmc_reset_controller_fsm(struct 
omap_hsmmc_host *host,
 * Monitor a 0->1 transition first
 */
if (mmc_slot(host).features & HSMMC_HAS_UPDATED_RESET) {
-   while ((!(OMAP_HSMMC_READ(host->base, SYSCTL) & bit))
-   && (i++ < limit))
-   cpu_relax();
+   while (i--) {
+   if ((OMAP_HSMMC_READ(host->base, SYSCTL) & bit))
+   break;
+   udelay(1);
+   }
}
-   i = 0;
 
-   while ((OMAP_HSMMC_READ(host->base, SYSCTL) & bit) &&
-   (i++ < limit))
-   cpu_relax();
+   i = limit / (MMC_TIMEOUT_MS * 1000);
+   while (i--) {
+   if (!(OMAP_HSMMC_READ(host->base, SYSCTL) & bit))
+   break;
+   udealy(1);
+   }
 
if (OMAP_HSMMC_READ(host->base, SYSCTL) & bit)
dev_err(mmc_dev(host->mmc),
-- 
1.8.1.2


Thanks!
Jianpeng 
MaN�Р骒r��yb�X�肚�v�^�)藓{.n�+�伐�{��赙zXФ�≤�}��财�z�:+v�����赙zZ+��+zf"�h���~i���z��wア�?�ㄨ��&�)撷f��^j谦y�m��@A�a囤�
0鹅h���i

Re: [PATCH 4/9] dma: edma: Find missed events and issue them

2013-07-31 Thread Joel Fernandes
On 07/31/2013 04:18 AM, Sekhar Nori wrote:
> On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote:
>> Hi Sekhar,
>>
>> On 07/30/2013 02:05 AM, Sekhar Nori wrote:
>>> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote:
 In an effort to move to using Scatter gather lists of any size with
 EDMA as discussed at [1] instead of placing limitations on the driver,
 we work through the limitations of the EDMAC hardware to find missed
 events and issue them.

 The sequence of events that require this are:

 For the scenario where MAX slots for an EDMA channel is 3:

 SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null

 The above SG list will have to be DMA'd in 2 sets:

 (1) SG1 -> SG2 -> SG3 -> Null
 (2) SG4 -> SG5 -> SG6 -> Null

 After (1) is succesfully transferred, the events from the MMC controller
 donot stop coming and are missed by the time we have setup the transfer
 for (2). So here, we catch the events missed as an error condition and
 issue them manually.
>>>
>>> Are you sure there wont be any effect of these missed events on the
>>> peripheral side. For example, wont McASP get into an underrun condition
>>> when it encounters a null PaRAM set? Even UART has to transmit to a
>>
>> But it will not encounter null PaRAM set because McASP uses contiguous
>> buffers for transfer which are not scattered across physical memory.
>> This can be accomplished with an SG of size 1. For such SGs, this patch
>> series leaves it linked Dummy and does not link to Null set. Null set is
>> only used for SG lists that are > MAX_NR_SG in size such as those
>> created for example by MMC and Crypto.
>>
>>> particular baud so I guess it cannot wait like the way MMC/SD can.
>>
>> Existing driver have to wait anyway if they hit MAX SG limit today. If
>> they don't want to wait, they would have allocated a contiguous block of
>> memory and DMA that in one stretch so they don't lose any events, and in
>> such cases we are not linking to Null.
> 
> As long as DMA driver can advertize its MAX SG limit, peripherals can
> always work around that by limiting the number of sync events they
> generate so as to not having any of the events getting missed. With this
> series, I am worried that EDMA drivers is advertizing that it can handle
> any length SG list while not taking care of missing any events while
> doing so. This will break the assumptions that driver writers make.

This is already being done by some other DMA engine drivers ;). We can
advertise more than we can handle at a time, that's the basis of this
whole idea.

I understand what you're saying but events are not something that have
be serviced immediately, they can be queued etc and the actually
transfer from the DMA controller can be delayed. As long as we don't
miss the event we are fine which my series takes care off.

So far I have tested this series on following modules in various
configurations and have seen no issues:
- Crypto AES
- MMC/SD
- SPI (128x160 display)

>>> Also, wont this lead to under-utilization of the peripheral bandwith?
>>> Meaning, MMC/SD is ready with data but cannot transfer because the DMA
>>> is waiting to be set-up.
>>
>> But it is waiting anyway even today. Currently based on MAX segs, MMC
>> driver/subsystem will make SG list of size max_segs. Between these
>> sessions of creating such smaller SG-lists, if for some reason the MMC
>> controller is sending events, these will be lost anyway.
> 
> But if MMC/SD driver knows how many events it should generate if it
> knows the MAX SG limit. So there should not be any missed events in
> current code. And I am not claiming that your solution is making matters
> worse. But its not making it much better as well.

This is not true for crypto, the events are not deasserted and crypto
continues to send events. This is what led to the "don't trigger in
Null" patch where I'm setting the missed flag to avoid recursion.

>> This can be used only for buffers that are contiguous in memory, not
>> those that are scattered across memory.
> 
> I was hinting at using the linking facility of EDMA to achieve this.
> Each PaRAM set has full 32-bit source and destination pointers so I see
> no reason why non-contiguous case cannot be handled.
> 
> Lets say you need to transfer SG[0..6] on channel C. Now, PaRAM sets are
> typically 4 times the number of channels. In this case we use one DMA
> PaRAM set and two Link PaRAM sets per channel. P0 is the DMA PaRAM set
> and P1 and P2 are the Link sets.
> 
> Initial setup:
> 
> SG0 -> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL
>  ^  ^  ^
>  |  |  |
> P0  -> P1  -> P2  -> NULL
> 
> P[0..2].TCINTEN = 1, so get an interrupt after each SG element
> completion. On each completion interrupt, hardware automatically copies
> the linked PaRAM set into the DMA PaRAM set so after SG0 is transferred
> out, the state of hardware is:
> 
> SG1  -> SG2 -> SG3 -> SG3 -> SG6 -> NULL
>  ^  

Re: [PATCH v2] dma: mmp_pdma: fix a memory alloc error

2013-07-31 Thread Xiang Wang
2013/7/29 Vinod Koul :
> On Thu, Jul 25, 2013 at 10:22:50AM +0800, Xiang Wang wrote:
>
> Btw its generally a good idea to remove the parts not required for further
> discussion
>
>> Would you please explain a little more about why we should use
>> GFP_NOWAIT here? This memory is not dedicated for DMA controller. Do
>> we have special reasons to use GFP_NOWAIT? Thanks!
> There are few reasons
> - the dma requests can be triggered from atomic context
> - so for above you can argue to use the GFP_ATOMIC and the guideline for dma
>   drivers [1] was discussed briefly
>
> ~Vinod
> [1]: http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/02316.html
>
> --
Hi, Vinod
Thanks for your explanation. But want to double confirm with you that
the memory alloc in my patch is in probe function, still need to use
GFP_NOWAIT? Thanks!

-- 
Regards,
Xiang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v2] sched,x86: optimize switch_mm for multi-threaded workloads

2013-07-31 Thread Linus Torvalds
On Wed, Jul 31, 2013 at 7:14 PM, Rik van Riel  wrote:
>
> Is this better?  Not that I really care which version gets applied :)

I'm ok with this.

That said, I'm assuming this goes in through Ingo, since it is both
x86 and scheduler-related. I can take it directly, but see no real
reason to. Ingo?

Talking about x86-specific - I'm assuming other architectures have
similar patterns?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/RFT PATCH 4/5] ARM: mm: change max*pfn to include the physical offset of memory

2013-07-31 Thread Santosh Shilimkar
On Wednesday 31 July 2013 06:56 AM, Russell King - ARM Linux wrote:
> On Fri, Jul 12, 2013 at 05:48:13PM -0400, Santosh Shilimkar wrote:
>> diff --git a/arch/arm/include/asm/dma-mapping.h 
>> b/arch/arm/include/asm/dma-mapping.h
>> index 5b579b9..b2d5937 100644
>> --- a/arch/arm/include/asm/dma-mapping.h
>> +++ b/arch/arm/include/asm/dma-mapping.h
>> @@ -47,12 +47,12 @@ static inline int dma_set_mask(struct device *dev, u64 
>> mask)
>>  #ifndef __arch_pfn_to_dma
>>  static inline dma_addr_t pfn_to_dma(struct device *dev, unsigned long pfn)
>>  {
>> -return (dma_addr_t)__pfn_to_bus(pfn);
>> +return (dma_addr_t)__pfn_to_bus(pfn + PHYS_PFN_OFFSET);
>>  }
>>  
>>  static inline unsigned long dma_to_pfn(struct device *dev, dma_addr_t addr)
>>  {
>> -return __bus_to_pfn(addr);
>> +return __bus_to_pfn(addr) - PHYS_PFN_OFFSET;
>>  }
>>  
>>  static inline void *dma_to_virt(struct device *dev, dma_addr_t addr)
>> @@ -64,15 +64,16 @@ static inline dma_addr_t virt_to_dma(struct device *dev, 
>> void *addr)
>>  {
>>  return (dma_addr_t)__virt_to_bus((unsigned long)(addr));
>>  }
>> +
>>  #else
>>  static inline dma_addr_t pfn_to_dma(struct device *dev, unsigned long pfn)
>>  {
>> -return __arch_pfn_to_dma(dev, pfn);
>> +return __arch_pfn_to_dma(dev, pfn + PHYS_PFN_OFFSET);
>>  }
>>  
>>  static inline unsigned long dma_to_pfn(struct device *dev, dma_addr_t addr)
>>  {
>> -return __arch_dma_to_pfn(dev, addr);
>> +return __arch_dma_to_pfn(dev, addr) - PHYS_PFN_OFFSET;
>>  }
>>  
>>  static inline void *dma_to_virt(struct device *dev, dma_addr_t addr)
>> @@ -86,6 +87,13 @@ static inline dma_addr_t virt_to_dma(struct device *dev, 
>> void *addr)
>>  }
>>  #endif
> 
> Note that I don't think the above are correct - the 'pfn' argument to the
> above functions already includes the PFN offset of physical memory -
> they're all physical_address >> PAGE_SHIFT values.
> 
Right.
Updated patch pushed into the patch system. (patch 7805/1)

Regards,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v2] sched,x86: optimize switch_mm for multi-threaded workloads

2013-07-31 Thread Rik van Riel
On Wed, 31 Jul 2013 17:41:39 -0700
Linus Torvalds  wrote:

> However, as Rik points out, activate_mm() is different in that we
> shouldn't have any preexisting MMU state anyway. And besides, that
> should never trigger the "prev == next" case.
> 
> But it does look a bit messy, and even your comment is a bit
> misleading (it might make somebody think that all of switch_mm() is
> protected from interrupts)

Is this better?  Not that I really care which version gets applied :)

---8<---

Subject: sched,x86: optimize switch_mm for multi-threaded workloads

Dick Fowles, Don Zickus and Joe Mario have been working on improvements 
to perf, and noticed heavy cache line contention on the mm_cpumask,
running linpack on a 60 core / 120 thread system.

The cause turned out to be unnecessary atomic accesses to the
mm_cpumask. When in lazy TLB mode, the CPU is only removed from
the mm_cpumask if there is a TLB flush event.

Most of the time, no such TLB flush happens, and the kernel
skips the TLB reload.  It can also skip the atomic memory
set & test.

Here is a summary of Joe's test results:

 * The __schedule function dropped from 24% of all program cycles down
   to 5.5%.
 * The cacheline contention/hotness for accesses to that bitmask went
   from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
   all shared misses which is now quite cold)
 * The average load latency for the bit-test-n-set instruction in
   __schedule dropped from 10k-15k cycles down to an average of 600 cycles.
 * The linpack program results improved from 133 GFlops to 144 GFlops.
   Peak GFlops rose from 133 to 153. 

Reported-by: Don Zickus 
Reported-by: Joe Mario 
Tested-by: Joe Mario 
Signed-off-by: Rik van Riel 

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/mmu_context.h | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index cdbf367..3ac6089 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -59,7 +59,13 @@ static inline void switch_mm(struct mm_struct *prev, struct 
mm_struct *next,
this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
-   if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
+   if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+   /* On established mms, the mm_cpumask is only changed
+* from irq context, from ptep_clear_flush while in
+* lazy tlb mode, and here. Irqs are blocked during
+* schedule, protecting us from simultaneous changes.
+*/
+   cpumask_set_cpu(cpu, mm_cpumask(next));
/* We were in lazy tlb mode and leave_mm disabled
 * tlb flush IPI delivery. We must reload CR3
 * to make sure to use no freed page tables.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

2013-07-31 Thread Linus Torvalds
On Wed, Jul 31, 2013 at 6:58 PM, Rik van Riel  wrote:
>
> Would you like me to document the things we found in the comment,
> and resend a patch, or is the patch good as-is?

Please re-send with commentary..

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 3/4] tracing/kprobes: Fail to unregister if probe event files are open

2013-07-31 Thread Steven Rostedt
On Wed, 2013-07-31 at 18:42 -0400, Steven Rostedt wrote:
>  
> > This should be fine. Either event_remove() path takes event_mutex
> > first and then ->write() fails, or ftrace_event_enable_disable()
> > actually disables this even successfully.
> 
> Actually I meant while in unregister_trace_probe(), it gets by the
> trace_probe_is_enabled() part first, then the write succeeds (as the
> event_mutex isn't taken till unregister_probe_event()). The the
> unregister_probe_event fails, but the tp was freed. The event files
> still reference the tp and this is where a crash can happen without this
> patch set.

And it's not just theoretical. I worked on a program to try to trigger
this bug, and succeeded :-)  (I've never been so happy being able to
crash my kernel)

[  128.999772] BUG: unable to handle kernel paging request at 000500f9
[  129.15] IP: [] probes_open+0x3b/0xa7
[  129.15] PGD 7808a067 PUD 0 
[  129.15] Oops:  [#1] PREEMPT SMP 
[  129.15] Dumping ftrace buffer:
   
[  129.15] -
[  129.15] Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput 
snd_hda_codec_idt kvm_intel snd_hda_intel snd_hda_codec snd_hwdep snd_seq 
snd_seq_device snd_pcm kvm snd_page_alloc snd_timer shpchp snd microcode 
i2c_i801 soundcore pata_acpi firewire_ohci firewire_core crc_itu_t ata_generic 
i915 drm_kms_helper drm i2c_algo_bit i2c_core video
[  129.15] CPU: 1 PID: 2070 Comm: test-kprobe-rem Not tainted 
3.11.0-rc3-test+ #47
[  129.15] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To 
be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[  129.15] task: 880077756440 ti: 880076e52000 task.ti: 
880076e52000
[  129.15] RIP: 0010:[]  [] 
probes_open+0x3b/0xa7
[  129.15] RSP: 0018:880076e53c38  EFLAGS: 00010203
[  129.15] RAX: 00050001 RBX: 88007844f440 RCX: 0003
[  129.15] RDX: 0003 RSI: 0003 RDI: 880076e52000
[  129.15] RBP: 880076e53c58 R08: 880076e53bd8 R09: 
[  129.15] R10: 880077756440 R11: 0006 R12: 810dee35
[  129.15] R13: 880079250418 R14:  R15: 88007844f450
[  129.15] FS:  7f87a276f700() GS:88007d48() 
knlGS:
[  129.15] CS:  0010 DS:  ES:  CR0: 8005003b
[  129.15] CR2: 000500f9 CR3: 77262000 CR4: 07e0
[  129.15] Stack:
[  129.15]  880076e53c58 81219ea0 88007844f440 
810dee35
[  129.15]  880076e53ca8 81130f78 8800772986c0 
8800796f93a0
[  129.15]  81d1b5d8 880076e53e04  
88007844f440
[  129.15] Call Trace:
[  129.15]  [] ? security_file_open+0x2c/0x30
[  129.15]  [] ? unregister_trace_probe+0x4b/0x4b
[  129.15]  [] do_dentry_open+0x162/0x226
[  129.15]  [] finish_open+0x46/0x54
[  129.15]  [] do_last+0x7f6/0x996
[  129.15]  [] ? inode_permission+0x42/0x44
[  129.15]  [] path_openat+0x232/0x496
[  129.15]  [] do_filp_open+0x3a/0x8a
[  129.15]  [] ? __alloc_fd+0x168/0x17a
[  129.15]  [] do_sys_open+0x70/0x102
[  129.15]  [] ? trace_hardirqs_on_caller+0x160/0x197
[  129.15]  [] SyS_open+0x1e/0x20
[  129.15]  [] system_call_fastpath+0x16/0x1b
[  129.15] Code: e5 41 54 53 48 89 f3 48 83 ec 10 48 23 56 78 48 39 c2 75 
6c 31 f6 48 c7 c7 d0 e6 a4 81 e8 3a 91 43 00 48 8b 05 f2 f8 96 00 eb 0c  80 
f8 00 00 00 03 75 31 48 8b 00 48 3d 60 e7 a4 81 75 ec eb 
[  129.15] RIP  [] probes_open+0x3b/0xa7
[  129.15]  RSP 
[  129.15] CR2: 000500f9
[  130.21] ---[ end trace 35f17d68fc569897 ]---

Attached is the program I used. It took lots of tweaks and doesn't
always trigger the bug on the first run. It can take several runs to
trigger. Each bug I've seen has been rather random. Twice it crashed due
to memory errors, once it just screwed up the kprobes, but the system
still ran fine. I had another kprobes error bug, when when I went to do
more tracing, it crashed.

What my program does is creates two threads (well, one thread and the
main thread). They place themselves onto CPU 0 and 1. Then the following
occurs:

CPU 0   CPU 1
-   -
  create sigprocmask probe

loop:   loop:

fd = open(events/kprobes/sigprocmask/enable)

pthread_barrier_wait()

  nanosleep(interval)

  pthread_barrier_wait()

write(fd, "1", 1);

write(fd, "0", 1);

  truncate kprobe_events
  // deletes all probes

pthread_barrier_wait();
  pthread_barrier_wait();


write(fd, "0", 1); // just in case

  create sigprocmask probe

pthread_barrier_wait()

  

[PATCH] mfd: max8997: cast void pointer to data of max8997_pmic_dt_match

2013-07-31 Thread Jingoo Han
Casting (void *) data of max8997_pmic_dt_match is necessary,
because variable 'data' of struct 'of_device_id' is defined as
'const void *data'. Thus, pointer should be used instead of value.

Signed-off-by: Jingoo Han 
---
 drivers/mfd/max8997.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/mfd/max8997.c b/drivers/mfd/max8997.c
index 1471405..e6d668a 100644
--- a/drivers/mfd/max8997.c
+++ b/drivers/mfd/max8997.c
@@ -51,7 +51,7 @@ static struct mfd_cell max8997_devs[] = {
 
 #ifdef CONFIG_OF
 static struct of_device_id max8997_pmic_dt_match[] = {
-   { .compatible = "maxim,max8997-pmic", .data = TYPE_MAX8997 },
+   { .compatible = "maxim,max8997-pmic", .data = (void *)TYPE_MAX8997 },
{},
 };
 #endif
-- 
1.7.10.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] raid5: offload stripe handle to workqueue

2013-07-31 Thread Shaohua Li
On Wed, Jul 31, 2013 at 06:33:32AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jul 31, 2013 at 09:24:34AM +0800, Shaohua Li wrote:
> > stripe is the work unit actually. As I said, if I queue a work for each 
> > stripe,
> > just queue_work() will make the system blast because of the pwq->pool->lock
> > contention. dispatching one work has another side effect that I can't add 
> > block
> 
> Hmmm I see.  I'm not familiar with the code base and could be
> missing something but how does the custom stripe dispatch queue
> synchronize?  Doesn't that make use of a lock anyway?  If so, how
> would scheduling separate work items be worse?  

It does have lock, but when a stripe is queued to handle, no lock is required.
So the workqueue lock will be high contended.

> Also, can you please
> elaborate the block plug part?

Basically I do:

blk_start_plug()
handle_stripe() //may dispatch request
blk_end_plug()

If only handle one stripe between block plug, the plug is useless, so I need
handle several stripes.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/9] ARM: edma: Don't clear EMR of channel in edma_stop

2013-07-31 Thread Joel Fernandes
On 07/31/2013 04:35 AM, Sekhar Nori wrote:
> On Wednesday 31 July 2013 10:35 AM, Joel Fernandes wrote:
>> On 07/30/2013 03:29 AM, Sekhar Nori wrote:
>>> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote:
 We certainly don't want error conditions to be cleared anywhere
>>>
>>> 'anywhere' is a really loaded term.
>>>
 as this will make us 'forget' about missed events. We depend on
 knowing which events were missed in order to be able to reissue them.
>>>
 This fixes a race condition where the EMR was being cleared
 by the transfer completion interrupt handler.

 Basically, what was happening was:

 Missed event
  |
  |
  V
 SG1-SG2-SG3-Null
  \
   \__TC Interrupt (Almost same time as ARM is executing
 TC interrupt handler, an event got missed and also forgotten
 by clearing the EMR).
>>>
>>> Sorry, but I dont see how edma_stop() is coming into picture in the race
>>> you describe?
>>
>> In edma_callback function, for the case of DMA_COMPLETE (Transfer
>> completion interrupt), edma_stop() is called when all sets have been
>> processed. This had the effect of clearing the EMR.
> 
> Ah, thanks. I was missing the fact that the race comes into picture only
> when using the DMA engine driver. I guess that should be mentioned
> somewhere since it is not immediately obvious.
> 
> The patch looks good to me. So if you respin just this one with some
> updated explanation based on what you wrote below, I will take it.

Sure I'll do that. Also the trigger_channel patch, will you be taking
that one too? I can send these 2 in a series as they touch
arch/arm/common/edma.c

Thanks,

-Joel



> 
> Thanks,
> Sekhar
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mfd: lpc_ich: Staticize struct 'lpc_chipset_info'

2013-07-31 Thread Jingoo Han
'lpc_chipset_info' is used only in this file.

Fix the following sparse warning:

drivers/mfd/lpc_ich.c:216:21: warning: symbol 'lpc_chipset_info' was not 
declared. Should it be static?

Signed-off-by: Jingoo Han 
---
 drivers/mfd/lpc_ich.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/mfd/lpc_ich.c b/drivers/mfd/lpc_ich.c
index 2403332..9483bc8 100644
--- a/drivers/mfd/lpc_ich.c
+++ b/drivers/mfd/lpc_ich.c
@@ -213,7 +213,7 @@ enum lpc_chipsets {
LPC_COLETO, /* Coleto Creek */
 };
 
-struct lpc_ich_info lpc_chipset_info[] = {
+static struct lpc_ich_info lpc_chipset_info[] = {
[LPC_ICH] = {
.name = "ICH",
.iTCO_version = 1,
-- 
1.7.10.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

2013-07-31 Thread Rik van Riel

On 07/31/2013 08:41 PM, Linus Torvalds wrote:

On Wed, Jul 31, 2013 at 4:14 PM, Paul Turner  wrote:

We attached the following explanatory comment to our version of the patch:

/*
* In the common case (two user threads sharing mm
* switching) the bit will be set; avoid doing a write
* (via atomic test & set) unless we have to.  This is
* safe, because no other CPU ever writes to our bit
* in the mask, and interrupts are off (so we can't
* take a TLB IPI here.)  If we don't do this, then
* switching threads will pingpong the cpumask
* cacheline.
*/


So as mentioned, the "interrupts will be off" is actually dubious.
It's true for the context switch case, but not for the activate_mm().

However, as Rik points out, activate_mm() is different in that we
shouldn't have any preexisting MMU state anyway. And besides, that
should never trigger the "prev == next" case.

But it does look a bit messy, and even your comment is a bit
misleading (it might make somebody think that all of switch_mm() is
protected from interrupts)
.
Anyway, I'm perfectly ok with the patch itself, but I just wanted to
make sure people had thought about these things.


Would you like me to document the things we found in the comment,
and resend a patch, or is the patch good as-is?

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch]workqueue: copy attr with all fields.

2013-07-31 Thread Shaohua Li
On Wed, Jul 31, 2013 at 06:27:52AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jul 31, 2013 at 09:39:29AM +0800, Shaohua Li wrote:
> > Hmm, I didn't agree it's more confusing to change copy_workqueue_attrs(), 
> > the
> 
> We're talking past each other.  I'm not saying copy_workqueue_attrs()
> shouldn't copy no_numa.  I'm saying get_unbound_pool() should clear
> no_numa so that pools don't have random no_numa settings.
> 
> > name of the function suggests it is a 'copy'. And clearing no_numa in
> > apply_workqueue_attrs() after copy_workqueue_attrs() looks like a hack to 
> > me.
> 
> Why would apply_workqueue_attrs() modify no_numa when it *is* dealing
> with an actual workqueue attr.  I've been talking about
> get_unbound_pool() not apply_workqueue_attrs() the whole time.
> 
> > But it depends on you, feel free to fix it by yourself.
> 
> Please update the patch to add no_numa clearing to get_unbound_pool()
> and explain what's going on.

Something like this?


Subject: workqueue: copy workqueue_attrs with all fields

$echo '0' > /sys/bus/workqueue/devices/xxx/numa
$cat /sys/bus/workqueue/devices/xxx/numa
I got 1. It should be 0, the reason is copy_workqueue_attrs() called in
apply_workqueue_attrs() doesn't copy no_numa field.

Signed-off-by: Shaohua Li 

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0b72e81..daf3f03 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3411,6 +3411,7 @@ static void copy_workqueue_attrs(struct workqueue_attrs 
*to,
 {
to->nice = from->nice;
cpumask_copy(to->cpumask, from->cpumask);
+   to->no_numa = from->no_numa;
 }
 
 /* hash value of the content of @attr */
@@ -3578,6 +3579,12 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
lockdep_set_subclass(>lock, 1);   /* see put_pwq() */
copy_workqueue_attrs(pool->attrs, attrs);
 
+   /*
+* no_numa isn't a worker_pool attribute, always clear it. See 'struct
+* workqueue_attrs' comments for detail.
+*/
+   pool->attrs->no_numa = 0;
+
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mfd: twl4030-power: Staticize local functions

2013-07-31 Thread Jingoo Han
twl4030_power_configure_scripts(), twl4030_power_configure_resources(),
twl4030_power_probe() are used only in this file.

Fix the following sparse warnings:

drivers/mfd/twl4030-power.c:496:5: warning: symbol 
'twl4030_power_configure_scripts' was not declared. Should it be static?
drivers/mfd/twl4030-power.c:512:5: warning: symbol 
'twl4030_power_configure_resources' was not declared. Should it be static?
drivers/mfd/twl4030-power.c:556:5: warning: symbol 'twl4030_power_probe' was 
not declared. Should it be static?

Signed-off-by: Jingoo Han 
---
 drivers/mfd/twl4030-power.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/mfd/twl4030-power.c b/drivers/mfd/twl4030-power.c
index a5fd3c7..adce79d 100644
--- a/drivers/mfd/twl4030-power.c
+++ b/drivers/mfd/twl4030-power.c
@@ -493,7 +493,7 @@ int twl4030_remove_script(u8 flags)
return err;
 }
 
-int twl4030_power_configure_scripts(struct twl4030_power_data *pdata)
+static int twl4030_power_configure_scripts(struct twl4030_power_data *pdata)
 {
int err;
int i;
@@ -509,7 +509,7 @@ int twl4030_power_configure_scripts(struct 
twl4030_power_data *pdata)
return 0;
 }
 
-int twl4030_power_configure_resources(struct twl4030_power_data *pdata)
+static int twl4030_power_configure_resources(struct twl4030_power_data *pdata)
 {
struct twl4030_resconfig *resconfig = pdata->resource_config;
int err;
@@ -553,7 +553,7 @@ static bool twl4030_power_use_poweroff(struct 
twl4030_power_data *pdata,
return false;
 }
 
-int twl4030_power_probe(struct platform_device *pdev)
+static int twl4030_power_probe(struct platform_device *pdev)
 {
struct twl4030_power_data *pdata = pdev->dev.platform_data;
struct device_node *node = pdev->dev.of_node;
-- 
1.7.10.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 3/7] cgroup: restructure the failure path in cgroup_write_event_control()

2013-07-31 Thread Li Zefan
It uses a single label and checks the validity of each pointer. This
is err-prone, and actually we had a bug because one of the check was
insufficient.

Use multi lables as we do in other places.

v2:
- drop initializations of local variables.

Signed-off-by: Li Zefan 
---
 kernel/cgroup.c | 39 ++-
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index fb11569..ae905a5 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3934,11 +3934,11 @@ static void cgroup_event_ptable_queue_proc(struct file 
*file,
 static int cgroup_write_event_control(struct cgroup *cgrp, struct cftype *cft,
  const char *buffer)
 {
-   struct cgroup_event *event = NULL;
+   struct cgroup_event *event;
struct cgroup *cgrp_cfile;
unsigned int efd, cfd;
-   struct file *efile = NULL;
-   struct file *cfile = NULL;
+   struct file *efile;
+   struct file *cfile;
char *endp;
int ret;
 
@@ -3964,31 +3964,31 @@ static int cgroup_write_event_control(struct cgroup 
*cgrp, struct cftype *cft,
efile = eventfd_fget(efd);
if (IS_ERR(efile)) {
ret = PTR_ERR(efile);
-   goto fail;
+   goto out_kfree;
}
 
event->eventfd = eventfd_ctx_fileget(efile);
if (IS_ERR(event->eventfd)) {
ret = PTR_ERR(event->eventfd);
-   goto fail;
+   goto out_put_efile;
}
 
cfile = fget(cfd);
if (!cfile) {
ret = -EBADF;
-   goto fail;
+   goto out_put_eventfd;
}
 
/* the process need read permission on control file */
/* AV: shouldn't we check that it's been opened for read instead? */
ret = inode_permission(file_inode(cfile), MAY_READ);
if (ret < 0)
-   goto fail;
+   goto out_put_cfile;
 
event->cft = __file_cft(cfile);
if (IS_ERR(event->cft)) {
ret = PTR_ERR(event->cft);
-   goto fail;
+   goto out_put_cfile;
}
 
/*
@@ -3998,18 +3998,18 @@ static int cgroup_write_event_control(struct cgroup 
*cgrp, struct cftype *cft,
cgrp_cfile = __d_cgrp(cfile->f_dentry->d_parent);
if (cgrp_cfile != cgrp) {
ret = -EINVAL;
-   goto fail;
+   goto out_put_cfile;
}
 
if (!event->cft->register_event || !event->cft->unregister_event) {
ret = -EINVAL;
-   goto fail;
+   goto out_put_cfile;
}
 
ret = event->cft->register_event(cgrp, event->cft,
event->eventfd, buffer);
if (ret)
-   goto fail;
+   goto out_put_cfile;
 
efile->f_op->poll(efile, >pt);
 
@@ -4029,16 +4029,13 @@ static int cgroup_write_event_control(struct cgroup 
*cgrp, struct cftype *cft,
 
return 0;
 
-fail:
-   if (cfile)
-   fput(cfile);
-
-   if (event && event->eventfd && !IS_ERR(event->eventfd))
-   eventfd_ctx_put(event->eventfd);
-
-   if (!IS_ERR_OR_NULL(efile))
-   fput(efile);
-
+out_put_cfile:
+   fput(cfile);
+out_put_eventfd:
+   eventfd_ctx_put(event->eventfd);
+out_put_efile:
+   fput(efile);
+out_kfree:
kfree(event);
 
return ret;
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 4/7] cgroup: rename cgroup_pidlist->mutex

2013-07-31 Thread Li Zefan
It's a rw_semaphore not a mutex.

Signed-off-by: Li Zefan 
---
 kernel/cgroup.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ae905a5..6662811 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3436,7 +3436,7 @@ struct cgroup_pidlist {
/* pointer to the cgroup we belong to, for list removal purposes */
struct cgroup *owner;
/* protects the other fields */
-   struct rw_semaphore mutex;
+   struct rw_semaphore rwsem;
 };
 
 /*
@@ -3509,7 +3509,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct 
cgroup *cgrp,
struct pid_namespace *ns = task_active_pid_ns(current);
 
/*
-* We can't drop the pidlist_mutex before taking the l->mutex in case
+* We can't drop the pidlist_mutex before taking the l->rwsem in case
 * the last ref-holder is trying to remove l from the list at the same
 * time. Holding the pidlist_mutex precludes somebody taking whichever
 * list we find out from under us - compare release_pid_array().
@@ -3518,7 +3518,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct 
cgroup *cgrp,
list_for_each_entry(l, >pidlists, links) {
if (l->key.type == type && l->key.ns == ns) {
/* make sure l doesn't vanish out from under us */
-   down_write(>mutex);
+   down_write(>rwsem);
mutex_unlock(>pidlist_mutex);
return l;
}
@@ -3529,8 +3529,8 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct 
cgroup *cgrp,
mutex_unlock(>pidlist_mutex);
return l;
}
-   init_rwsem(>mutex);
-   down_write(>mutex);
+   init_rwsem(>rwsem);
+   down_write(>rwsem);
l->key.type = type;
l->key.ns = get_pid_ns(ns);
l->owner = cgrp;
@@ -3591,7 +3591,7 @@ static int pidlist_array_load(struct cgroup *cgrp, enum 
cgroup_filetype type,
l->list = array;
l->length = length;
l->use_count++;
-   up_write(>mutex);
+   up_write(>rwsem);
*lp = l;
return 0;
 }
@@ -3669,7 +3669,7 @@ static void *cgroup_pidlist_start(struct seq_file *s, 
loff_t *pos)
int index = 0, pid = *pos;
int *iter;
 
-   down_read(>mutex);
+   down_read(>rwsem);
if (pid) {
int end = l->length;
 
@@ -3696,7 +3696,7 @@ static void *cgroup_pidlist_start(struct seq_file *s, 
loff_t *pos)
 static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 {
struct cgroup_pidlist *l = s->private;
-   up_read(>mutex);
+   up_read(>rwsem);
 }
 
 static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
@@ -3742,7 +3742,7 @@ static void cgroup_release_pid_array(struct 
cgroup_pidlist *l)
 * pidlist_mutex, we have to take pidlist_mutex first.
 */
mutex_lock(>owner->pidlist_mutex);
-   down_write(>mutex);
+   down_write(>rwsem);
BUG_ON(!l->use_count);
if (!--l->use_count) {
/* we're the last user if refcount is 0; remove and free */
@@ -3750,12 +3750,12 @@ static void cgroup_release_pid_array(struct 
cgroup_pidlist *l)
mutex_unlock(>owner->pidlist_mutex);
pidlist_free(l->list);
put_pid_ns(l->key.ns);
-   up_write(>mutex);
+   up_write(>rwsem);
kfree(l);
return;
}
mutex_unlock(>owner->pidlist_mutex);
-   up_write(>mutex);
+   up_write(>rwsem);
 }
 
 static int cgroup_pidlist_release(struct inode *inode, struct file *file)
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: caam - Remove unused functions from Job Ring

2013-07-31 Thread Herbert Xu
On Wed, Jul 31, 2013 at 03:48:56PM +0530, Ruchika Gupta wrote:
> Signed-off-by: Ruchika Gupta 

Patch applied.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hw_random: mxc-rnga: Check the return value from clk_prepare_enable()

2013-07-31 Thread Herbert Xu
On Sun, Jul 21, 2013 at 02:41:38PM -0300, Fabio Estevam wrote:
> From: Fabio Estevam 
> 
> clk_prepare_enable() may fail, so let's check its return value and propagate 
> it
> in the case of error.
> 
> Signed-off-by: Fabio Estevam 

Patch applied.  Thanks!
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: caam - RNG instantiation by directly programming DECO

2013-07-31 Thread Herbert Xu
On Thu, Jul 04, 2013 at 11:26:03AM +0530, Ruchika Gupta wrote:
> Remove the dependency of RNG instantiation on Job Ring. Now
> RNG instantiation for devices with RNG version > 4 is done
> by directly programming DECO 0.
> 
> Signed-off-by: Ruchika Gupta 

Patch applied.  Thanks a lot!
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: crypto: ux500: Fix logging, make arrays const, neatening

2013-07-31 Thread Herbert Xu
On Mon, Jul 22, 2013 at 04:19:35PM -0700, Joe Perches wrote:
> On Tue, 2013-07-23 at 00:35 +0200, Linus Walleij wrote:
> > Have you tested this on hardware or is it compile-tested only?
> 
> Hi Linus.
> 
> Compile tested only.  Not tested on real devices.  -ENOHARDWARE.

Patch applied.  Thanks!
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] block: support embedded device command line partition

2013-07-31 Thread Caizhiyong
> From: Karel Zak [mailto:k...@redhat.com]
> Sent: Wednesday, July 31, 2013 9:25 PM
> To: Caizhiyong
> Cc: Andrew Morton; linux-kernel@vger.kernel.org; Wanglin (Albert); Quyaxin
> Subject: Re: [PATCH] block: support embedded device command line partition
> 
> On Sat, Jul 27, 2013 at 01:56:24PM +, Caizhiyong wrote:
> > +static int parse_partitions(struct parsed_partitions *state,
> > +   struct cmdline_parts *parts)
> > +{
> > +   int slot;
> > +   uint64_t from = 0;
> > +   uint64_t disk_size;
> > +   char buf[BDEVNAME_SIZE];
> > +   struct cmdline_subpart *subpart;
> > +
> > +   bdevname(state->bdev, buf);
> > +
> > +   while (parts && strncmp(buf, parts->name, BDEVNAME_SIZE))
> > +   parts = parts->next_parts;
> > +
> > +   if (!parts)
> > +   return 0;
> > +
> > +   disk_size = (uint64_t) get_capacity(state->bdev->bd_disk) << 9;
> > +
> > +   for (slot = 1, subpart = parts->subpart;
> > +subpart && slot < state->limit;
> > +subpart = subpart->next_subpart, slot++) {
> > +
> > +   unsigned label_max;
> > +   struct partition_meta_info *info;
> > +
> > +   if (subpart->from == OFFSET_CONTINUOUS)
> > +   subpart->from = from;
> > +   else
> > +   from = subpart->from;
> > +
> > +   if (from >= disk_size)
> > +   break;
> > +
> > +   if (subpart->size > (disk_size - from))
> > +   subpart->size = disk_size - from;
> > +
> > +   from += subpart->size;
> > +
> > +   put_partition(state, slot, le64_to_cpu(subpart->from >> 9),
> > + le64_to_cpu(subpart->size >> 9));
> 
>  Why le64_to_cpu()?
> 
>  I guess that memparse() does not return explicitly LE and it also
>  seems that your code  on another places uses ->size and ->from
>  without any extra care.

Right. I will remove.

> 
> Karel
> 
> --
>  Karel Zak  
>  http://karelzak.blogspot.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Convert PowerPC macro spin_event_timeout() to architecture independent macro

2013-07-31 Thread Stephen Boyd
On 07/31/13 17:20, Timur Tabi wrote:
> On 07/31/2013 07:16 PM, Stephen Boyd wrote:
>> cpu_relax() is usually just a compiler barrier or an instruction hint to
>> the cpu that it should cool down because we're spinning in a tight loop.
>> It certainly shouldn't be calling into the  scheduler.
> Ah yes, I remember now.  So it does seem that if we can fix the problem
> of non-incrementing 'jiffies', then this macro can be used in interrupts.

That's encouraging. It looks like you introduced it to use in interrupt
context but then it got shot down[1]? I lost track in all the versions.

>
> Of course, that assumes that spinning in interrupt context is a good
> idea to begin with.  Maybe we shouldn't be encouraging it?

I read through the v5 discussion and it seems I'm about to walk through
some tall grass on the way to Cerulean City.

Andrew Morton, I choose you! Use your mind-power move to convince
everyone that having a macro for spinning on a register in interrupt
context is a good thing. At least it will be more obvious.

>
 FYI, you might want to look at the code reviews for spin_event_timeout()
 on the linuxppc-dev mailing list, back in March 2009.

>> Sure. Any pointers? Otherwise I'll go digging around the archives.
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2009-March/thread.html
>

[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2009-May/072521.html

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/hotplug: remove unnecessary BUG_ON in __offline_pages()

2013-07-31 Thread Xishi Qiu
On 2013/8/1 0:55, Dave Hansen wrote:

> On 07/29/2013 11:49 PM, Xishi Qiu wrote:
>> I think we can remove "BUG_ON(start_pfn >= end_pfn)" in __offline_pages(),
>> because in memory_block_action() "nr_pages = PAGES_PER_SECTION * 
>> sections_per_block" 
>> is always greater than 0.
> ...
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1472,7 +1472,6 @@ static int __ref __offline_pages(unsigned long 
>> start_pfn,
>>  struct zone *zone;
>>  struct memory_notify arg;
>>  
>> -BUG_ON(start_pfn >= end_pfn);
>>  /* at least, alignment against pageblock is necessary */
>>  if (!IS_ALIGNED(start_pfn, pageblock_nr_pages))
>>  return -EINVAL;
> 
> I think you're saying that you don't see a way to hit this BUG_ON() in
> practice.  That does appear to be true, unless sections_per_block ended
> up 0 or negative.  The odds of getting in to this code if
> 'sections_per_block' was bogus are pretty small.
> 

Yes, I find there is an only to hit this BUG_ON() in v3.11, and 
"sections_per_block"
seems to be always greater than 0.

> Or, is this a theoretical thing that folks might run in to when adding
> new features or developing?  It's in a cold path and the cost of the
> check is miniscule.  The original author (cc'd) also saw a need to put
> this in probably because he actually ran in to this.
> 

In v2.6.32, If info->length==0, this way may hit this BUG_ON().
acpi_memory_disable_device()
remove_memory(info->start_addr, info->length)
offline_pages()

Later Fujitsu's patch rename this function and the BUG_ON() is unnecessary.

Thanks,
Xishi Qiu

> In any case, it looks fairly safe to me:
> 
> Reviewed-by: Dave Hansen 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/9] syslog_ns: make syslog handling per namespace

2013-07-31 Thread Gao feng
On 07/29/2013 10:31 AM, Rui Xiang wrote:
> This patch makes syslog buf and other fields per
> namespace.
> 
> Here use ns->log_buf(log_buf_len, logbuf_lock,
> log_first_seq, logbuf_lock, and so on) fields
> instead of global ones to handle syslog.
> 
> Syslog interfaces such as /dev/kmsg, /proc/kmsg,
> and syslog syscall are all containerized for
> container users.
> 

/dev/kmsg is used by the syslog api closelog, openlog, syslog, vsyslog,
this should be per user namespace, but seems in your patch,
the syslog message generated through these APIs on host can be exported
to the /dev/kmsg of container, is this want we want?

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] chipidea: Use devm_request_irq()

2013-07-31 Thread Peter Chen
On Wed, Jul 31, 2013 at 10:15:12AM -0400, Tejun Heo wrote:
> hello,
> 
> On Wed, Jul 31, 2013 at 09:55:26PM +0800, Peter Chen wrote:
> > I think the main point is we should allocate managed resource which is used
> > at interrupt handler before devm_request_irq, and all resources used
> > at interrupt
> > handler should be managed.
> > 
> > If we use non-managed resource at interrupt handler, but using managed 
> > interrupt
> > handler, things still will go wrong if there is an odd (unexpected)
> > interrupt after
> > we finish deactivation at removal.
> 
> In general, applying devm partially isn't a good idea.  It's very easy
> to get into trouble thanks to release order dependency which isn't
> immediately noticeable and there have been actual bugs caused by that.
> The strategies which seem to work are either
> 
> * Convert everything over to devm by wrapping deactivation in a devres
>   callback too.  As long as your init sequence is sane (ie. irq
>   shouldn't be request before things used by irq are ready).
> 
> * Allocate all resources using devres but shut down the execution
>   engine in the remove_one().  Again, as all releases are controlled
>   by devres, you won't have to worry about messing up the release
>   sequencing.
> 

thanks, Tejun. So, Alex and Fabio, this patch may not be suitable currently,
since many resources at both EHCI and device side are non-managed.
-- 

Best Regards,
Peter Chen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PCI: update device mps when doing pci hotplug

2013-07-31 Thread Yijing Wang
>> Yes, I think that would be safe.  If the switch is set to a larger MPS
>> than the hot-added device supports, I don't think we can safely use
>> the device.
> 
> I opened a bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=60671
> for this problem.  Please correct any mistakes in my summary and
> reference it in your changelog.

OK, thanks!

> 
> Bjorn
> 
> 


-- 
Thanks!
Yijing

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PCI: update device mps when doing pci hotplug

2013-07-31 Thread Yijing Wang
> Yes, I think that would be safe.  If the switch is set to a larger MPS
> than the hot-added device supports, I don't think we can safely use
> the device.

OK, I will refresh my patch, after test in my machine, I will send it out.

Thanks!
Yijing.

> 
> Bjorn
> 
> 


-- 
Thanks!
Yijing

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] xfs: Implement FALLOC_FL_COLLAPSE_RANGE

2013-07-31 Thread Dave Chinner
On Wed, Jul 31, 2013 at 11:42:14PM +0900, Namjae Jeon wrote:
> From: Namjae Jeon 
> 
> New fallocate flag FALLOC_FL_COLLAPSE_RANGE implementation for XFS.

A good start, but there's lots of work needed to make this safe for
general use.

> Signed-off-by: Namjae Jeon 
> Signed-off-by: Ashish Sangwan 
> ---
>  fs/xfs/xfs_bmap.c |   92 
> +
>  fs/xfs/xfs_bmap.h |3 ++
>  fs/xfs/xfs_file.c |   26 --
>  fs/xfs/xfs_iops.c |   35 +++
>  fs/xfs/xfs_vnodeops.c |   62 +
>  fs/xfs/xfs_vnodeops.h |2 ++
>  6 files changed, 217 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
> index 05c698c..ae677b1 100644
> --- a/fs/xfs/xfs_bmap.c
> +++ b/fs/xfs/xfs_bmap.c
> @@ -6145,3 +6145,95 @@ next_block:
>  
>   return error;
>  }
> +
> +/*
> + * xfs_update_logical()
> + *   Updates starting logical block of extents by subtracting
> + *   shift from them. At max XFS_LINEAR_EXTS number extents
> + *   will be updated and *current_ext is the extent number which
> + *   is currently being updated.

Single space after the "*" in the comments. Also, format them out to
80 columns.

> + */
> +int
> +xfs_update_logical(
> + xfs_trans_t *tp,
> + struct xfs_inode*ip,
> + int *done,
> + xfs_fileoff_t   start_fsb,
> + xfs_fileoff_t   shift,
> + xfs_extnum_t*current_ext)

This belongs in xfs_bmap_btree.c, named something like
xfs_bmbt_shift_rec_offsets().

Also, not typedefs for structures, please. (i.e. struct xfs_trans).


> +{
> + xfs_btree_cur_t *cur;

struct xfs_btree_cur*cur;

And for all the other structures, too.

> + xfs_bmbt_rec_host_t *gotp;
> + xfs_mount_t *mp;
> + xfs_ifork_t *ifp;
> + xfs_extnum_tnexts = 0;
> + xfs_fileoff_t   startoff;
> + int error = 0;
> + int i;
> +
> + ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
> + mp = ip->i_mount;

Do the assigment in the declaration on mp.

struct xfs_mount*mp = ip->i_mount;

> +
> + if (!(ifp->if_flags & XFS_IFEXTENTS) &&
> + (error = xfs_iread_extents(tp, ip, XFS_DATA_FORK)))
> + return error;

H - that rings alarm bells.

> +
> + if (!*current_ext) {
> + gotp = xfs_iext_bno_to_ext(ifp, start_fsb, current_ext);
> + /*
> +  * gotp can be null in 2 cases: 1) if there are no extents
> +  * or 2) start_fsb lies in a hole beyond which there are
> +  * no extents. Either way, we are done.
> +  */
> + if (!gotp) {
> + *done = 1;
> + return 0;
> + }
> + }

So, you do a lookup on an extent index, which returns a incore
extent record.

> +
> + if (ifp->if_flags & XFS_IFBROOT)
> + cur = xfs_bmbt_init_cursor(mp, tp, ip, XFS_DATA_FORK);
> + else
> + cur = NULL;
> +
> + while (nexts++ < XFS_LINEAR_EXTS &&

What has XFS_LINEAR_EXTS got to do with how many extents can be moved
in a single transaction at a time?

> +*current_ext <  XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK)) {
> + gotp = xfs_iext_get_ext(ifp, (*current_ext)++);
> + startoff = xfs_bmbt_get_startoff(gotp);
> + if (cur) {
> + if ((error = xfs_bmbt_lookup_eq(cur,
> + startoff,
> + xfs_bmbt_get_startblock(gotp),
> + xfs_bmbt_get_blockcount(gotp),
> + )))

Please don't put assignments inside if() statements where possible.
i.e.
error = xfs_bmbt_lookup_eq(cur, 
if (error)

Is the normal way of doing this.

> + goto del_cursor;
> + XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
> + }
> + startoff -= shift;
> + xfs_bmbt_set_startoff(gotp, startoff);

So, we've looked up and extent, and reduced it's offset by shift.

> + if (cur) {
> + error = xfs_bmbt_update(cur, startoff,
> + xfs_bmbt_get_startblock(gotp),
> + xfs_bmbt_get_blockcount(gotp),
> + xfs_bmbt_get_state(gotp));
> + if (error)
> + goto del_cursor;

And then we update the btree record in place.

Ok, major alarm bells are going off at this point. Is the record
still in the right place? Where did you validate that the shift
downwards didn't overlap the previous extent in the tree?

What happens if the start or end of 

Re: i915 backlight

2013-07-31 Thread Aaron Lu
On 08/01/2013 12:36 AM, Borislav Petkov wrote:
> On Wed, Jul 31, 2013 at 06:22:52PM +0200, Borislav Petkov wrote:
>> Dudes,
>>
>> has anyone already reported this (happens on Linus of today +
>> tip/master):
> 
> Oh, one more thing: I can't control the backlight anymore on this x230
> with the Fn-Fx keys and this is most probably related to that recent
> backlight revert story. If there is a new approach I can test, please
> let me know.

Can you please run acpi_listen and then press the Fn-Fx key, see if the
events are correctly sent out?

> 
> Btw, it worked before on this machine with "acpi_backlight=vendor" on
> the cmdline.

>From the bug page:
https://bugzilla.kernel.org/show_bug.cgi?id=51231#c80
I got the impression that both the acpi_video interface and the vendor
interface thinkpad_screen are broken. So adding this cmdline here works
suggests that either thinkpad_screen works or thinkpad vendor driver
doesn't get loaded or doesn't create that interface for some reason.

Alternatively, if the intel_backlight interface works(highly possible),
you can use xorg.conf to specify the that backlight interface for X.

Section "Device"
Option "Backlight"  "intel_backlight"
Identifier  "Card0"
Driver  "intel"
BusID   "PCI:0:2:0"
EndSection

Thanks,
Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >