Re: [patch 1/2] [PATCH] mm: Save soft-dirty bits on swapped pages
On Thu, Aug 01, 2013 at 09:51:32AM +0900, Minchan Kim wrote: > > Index: linux-2.6.git/include/linux/swapops.h > > === > > --- linux-2.6.git.orig/include/linux/swapops.h > > +++ linux-2.6.git/include/linux/swapops.h > > @@ -67,6 +67,8 @@ static inline swp_entry_t pte_to_swp_ent > > swp_entry_t arch_entry; > > > > BUG_ON(pte_file(pte)); > > + if (pte_swp_soft_dirty(pte)) > > + pte = pte_swp_clear_soft_dirty(pte); > > Why do you remove soft-dirty flag whenever pte_to_swp_entry is called? > Isn't there any problem if we use mincore? No, there is no problem. pte_to_swp_entry caller when we know that pte we're decoding is having swap format (except the case in swap code which figures out the number of bits allowed for offset). Still since this bit is set on "higher" level than __swp_type/__swp_offset helpers it should be cleaned before the value from pte comes to "one level down" helpers function. > > +static inline int maybe_same_pte(pte_t pte, pte_t swp_pte) > > Nitpick. > If maybe_same_pte is used widely, it looks good to me > but it's used for only swapoff at the moment so I think pte_swap_same > would be better name. I don't see much difference, but sure, lets rename it on top once series in -mm tree, sounds good? Cyrill -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [26/84] ARM: OMAP: RX-51: change probe order of touchscreen and panel SPI devices
Hi, On 31/07/13 16:23, Ben Hutchings wrote: > 3.2.50-rc1 review patch. If anyone has any objections, please let me know. > > -- > > From: Aaro Koskinen > > commit e65f131a14726e5f1b880a528271a52428e5b3a5 upstream. > > Commit 9fdca9df (spi: omap2-mcspi: convert to module_platform_driver) > broke the SPI display/panel driver probe on RX-51/N900. The exact cause is > not fully understood, but it seems to be related to the probe order. SPI > communication to the panel driver (spi1.2) fails unless the touchscreen > (spi1.0) has been probed/initialized before. When the omap2-mcspi driver > was converted to a platform driver, it resulted in that the devices are > probed immediately after the board registers them in the order they are > listed in the board file. I'm not really familiar with this issue, but: The commit 9fdca9df was merged to v3.5. Has that also been backported to 3.2? If not, I don't think this patch is needed for 3.2, as the patch causing the problem is not in 3.2. Tomi signature.asc Description: OpenPGP digital signature
Re: [PATCH 01/11] writeback: plug writeback at a high level
On Wed, Jul 31, 2013 at 04:40:19PM +0200, Jan Kara wrote: > On Wed 31-07-13 14:15:40, Dave Chinner wrote: > > From: Dave Chinner > > > > Doing writeback on lots of little files causes terrible IOPS storms > > because of the per-mapping writeback plugging we do. This > > essentially causes imeediate dispatch of IO for each mapping, > > regardless of the context in which writeback is occurring. > > > > IOWs, running a concurrent write-lots-of-small 4k files using fsmark > > on XFS results in a huge number of IOPS being issued for data > > writes. Metadata writes are sorted and plugged at a high level by > > XFS, so aggregate nicely into large IOs. However, data writeback IOs > > are dispatched in individual 4k IOs, even when the blocks of two > > consecutively written files are adjacent. > > > > Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem, > > metadata CRCs enabled. > > > > Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches) > > > > Test: > > > > $ ./fs_mark -D 1 -S0 -n 1 -s 4096 -L 120 -d > > /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d > > /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d > > /mnt/scratch/6 -d /mnt/scratch/7 > > > > Result: > > > > wallsys create rate Physical write IO > > timeCPU (avg files/s)IOPS Bandwidth > > - - -- - > > unpatched 6m56s 15m47s 24,000+/-50026,000 130MB/s > > patched 5m06s 13m28s 32,800+/-600 1,500 180MB/s > > improvement -26.44% -14.68% +36.67% -94.23% +38.46% > > > > If I use zero length files, this workload at about 500 IOPS, so > > plugging drops the data IOs from roughly 25,500/s to 1000/s. > > 3 lines of code, 35% better throughput for 15% less CPU. > > > > The benefits of plugging at this layer are likely to be higher for > > spinning media as the IO patterns for this workload are going make a > > much bigger difference on high IO latency devices. > > > > Signed-off-by: Dave Chinner > Just one question: Won't this cause a regression when files are say 2 MB > large? Then we generate maximum sized requests for these files with > per-inode plugging anyway and they will unnecessarily sit in the plug list > until the plug list gets full (that is after 16 requests). Granted it > shouldn't be too long but with fast storage it may be measurable... Latency of IO dispatch only matters for the initial IOs being queued. This, however, is not a latency sensitive IO path - writeback is our bulk throughput IO engine, and in those cases low latency dispatch is precisely what we don't want. We want to optimise IO patterns for maximum *bandwidth*, not minimal latency. The problem is that fast storage with immediate dispatch and dep queues can keep ahead of IO dispatch, preventing throughput optimisations like IO aggregation from being made because there is never any IO queued to aggregate. That's why I'm seeing a couple of orders of magnitude higher IOPS than I should. Sure, the hardware can do that, but it's not the *most efficient* method of dispatching background IO. Allowing IOs a chance to aggregate in the scheduler for a short while because dispatch allows existing bulk throughput optimisations to be made to the IO stream, and as we can see, where a delayed allocation filesystem is optimised for adjacent allocation across sequentially written inodes such oppportunites for IO aggregation make a big difference to performance. So, to test your 2MB IO case, I ran a fsmark test using 40,000 2MB files instead of 10 million 4k files. wall time IOPSBW mmotm 170s1000350MB/s patched 167s1000350MB/s The IO profiles are near enough to be identical, and the wall time is basically the same. I just don't see any particular concern about larger IOs and initial dispatch latency here from either a theoretical or an observed POV. Indeed, I haven't seen a performance degradation as a result of this patch in any of the testing I've done since I first posted it... > Now if we have maximum sized request in the plug list, maybe we could just > dispatch it right away but that's another story. That, in itself is potentially an issue, too, as it prevents seek minimisation optimisations from being made when we batch up multiple IOs on the plug list... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/18] ARM: davinci: Switch to sched_clock_register()
On Thursday 01 August 2013 04:01 AM, Stephen Boyd wrote: > The 32 bit sched_clock interface now supports 64 bits. Upgrade to > the 64 bit function to allow us to remove the 32 bit registration > interface. > > Cc: Sekhar Nori > Cc: Kevin Hilman > Signed-off-by: Stephen Boyd > --- > arch/arm/mach-davinci/time.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/arm/mach-davinci/time.c b/arch/arm/mach-davinci/time.c > index 7a55b5c..6d5440a 100644 > --- a/arch/arm/mach-davinci/time.c > +++ b/arch/arm/mach-davinci/time.c > @@ -285,7 +285,7 @@ static struct clocksource clocksource_davinci = { > /* > * Overwrite weak default sched_clock with something more precise > */ > -static u32 notrace davinci_read_sched_clock(void) > +static u64 notrace davinci_read_sched_clock(void) > { > return timer32_read([TID_CLOCKSOURCE]); > } > @@ -392,7 +392,7 @@ void __init davinci_timer_init(void) > davinci_clock_tick_rate)) > printk(err, clocksource_davinci.name); > > - setup_sched_clock(davinci_read_sched_clock, 32, > + sched_clock_register(davinci_read_sched_clock, 32, > davinci_clock_tick_rate); Please fix the line break to align with open parens. Otherwise looks good. I tested it on DA850 EVM as well. With above fixed: Acked-by: Sekhar Nori Thanks, Sekhar -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] scripts/mod/modpost.c: permit '.cranges' secton for sh64 architecture.
Need permit '.cranges' section for sh64 architecture, or modpost will report warning: LD init/built-in.o WARNING: init/built-in.o (.cranges): unexpected non-allocatable section. Did you forget to use "ax"/"aw" in a .S file? Note that for example contains section definitions for use in .S files. Signed-off-by: Chen Gang --- scripts/mod/modpost.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c index 6216434..8247979 100644 --- a/scripts/mod/modpost.c +++ b/scripts/mod/modpost.c @@ -821,6 +821,7 @@ static const char *section_white_list[] = { ".comment*", ".debug*", + ".cranges", /* sh64 */ ".zdebug*", /* Compressed debug sections. */ ".GCC-command-line",/* mn10300 */ ".GCC.command.line",/* record-gcc-switches, non mn10300 */ -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/5] usb: phy: Add AM335x PHY driver
On 7/31/2013 1:46 AM, Sebastian Andrzej Siewior wrote: This driver is a redo of my earlier attempt. It uses parts of the generic PHY driver and uses the new control driver for the register the phy needs to power on/off the phy. It also enables easy access for the wakeup register which is not yet implemented. The difference between the omap attempt is: - no static holding variable - one global visible function which exports a struct with callbacks to access the "control" registers. Signed-off-by: Sebastian Andrzej Siewior --- +static const struct of_device_id omap_control_usb_id_table[] = { + { .compatible = "ti,am335x-ctrl-module", .data = _am335x }, + {} +}; +MODULE_DEVICE_TABLE(of, omap_control_usb_id_table); + +static struct platform_driver am335x_control_driver; +static int match(struct device *dev, void *data) +{ + struct device_node *node = (struct device_node *)data; + return dev->of_node == node && + dev->driver == _control_driver.driver; +} + +struct phy_control *am335x_get_phy_control(struct device *dev) +{ + struct device_node *node; + struct am335x_control_usb *ctrl_usb; + + node = of_parse_phandle(dev->of_node, "ti,ctrl_mod", 0); + if (!node) + return NULL; + + dev = bus_find_device(_bus_type, NULL, node, match); of_find_device_by_node doesn't work??? + ctrl_usb = dev_get_drvdata(dev); + if (!ctrl_usb) + return NULL; + return _usb->phy_ctrl; +} +EXPORT_SYMBOL_GPL(am335x_get_phy_control); + +static int am335x_control_usb_probe(struct platform_device *pdev) +{ + struct resource *res; + struct am335x_control_usb *ctrl_usb; + const struct of_device_id *of_id; + const struct phy_control *phy_ctrl; + + of_id = of_match_node(omap_control_usb_id_table, pdev->dev.of_node); + if (!of_id) + return -EINVAL; + + phy_ctrl = of_id->data; + + ctrl_usb = devm_kzalloc(>dev, sizeof(*ctrl_usb), GFP_KERNEL); + if (!ctrl_usb) { + dev_err(>dev, "unable to alloc memory for control usb\n"); + return -ENOMEM; + } + + ctrl_usb->dev = >dev; + + res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "control_dev"); + ctrl_usb->reset_reg = devm_ioremap_resource(>dev, res); As per current DTSI in patch 5 you are mapping from control module base till 0x650. Cant it be split to 2 registers one mapping for phy on/off and one for phy wkup? + if (IS_ERR(ctrl_usb->reset_reg)) + return PTR_ERR(ctrl_usb->reset_reg); + spin_lock_init(_usb->lock); + ctrl_usb->phy_ctrl = *phy_ctrl; + + dev_set_drvdata(ctrl_usb->dev, ctrl_usb); + return 0; +} + +static struct platform_driver am335x_control_driver = { + .probe = am335x_control_usb_probe, + .driver = { + .name = "am335x-control-usb", + .owner = THIS_MODULE, + .of_match_table = of_match_ptr(omap_control_usb_id_table), + }, +}; + +module_platform_driver(am335x_control_driver); +MODULE_LICENSE("GPL v2"); -- -George -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] syslog_ns: make syslog handling per namespace
On 08/01/2013 11:10 AM, Rui Xiang wrote: > On 2013/8/1 9:36, Gao feng wrote: >> On 07/29/2013 10:31 AM, Rui Xiang wrote: >>> This patch makes syslog buf and other fields per >>> namespace. >>> >>> Here use ns->log_buf(log_buf_len, logbuf_lock, >>> log_first_seq, logbuf_lock, and so on) fields >>> instead of global ones to handle syslog. >>> >>> Syslog interfaces such as /dev/kmsg, /proc/kmsg, >>> and syslog syscall are all containerized for >>> container users. >>> >> >> /dev/kmsg is used by the syslog api closelog, openlog, syslog, vsyslog, >> this should be per user namespace, but seems in your patch, > > Yes, /dev/kmsg is per user namespace, and per syslog ns, too. > >> the syslog message generated through these APIs on host can be exported >> to the /dev/kmsg of container, is this want we want? >> > Ah.. I think your question targets at devkmsg_writev function, right? yep, another small problem, you forgot to remove the global logbuf_lock. > You remind me that it's really an issue. Printk_emit in devkmsg_writev > should not use init_syslog_ns as its syslog_ns but current_user_ns->syslog_ns. > > In 1st version, current_syslog_ns was used in vprintk_emit. In this version, > the interface vprintk_emit has changed, but this patch misses that. > I will fix it. > Ok, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] more ext4 bug fixes
(Sorry for the repeated resends to the kernel.org mailing lists; I'm on the road, and my attempts to send this has been screwing up the e-mail send to Linus in various different ways, most of it caused by the fact that emacs message mode is helpfully trying to colorize the e-mail addresses in a way that makes them impossible to read when using ssh from a ChromeOS laptop...) The following changes since commit 3b2f64d00c46e1e4e9bd0bb9bb12619adac27a4b: Linux 3.11-rc2 (2013-07-21 12:05:29 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus for you to fetch changes up to 44fb851dfb2f8e3462617e19a7b3b9025db9d919: ext4: add WARN_ON to check the length of allocated blocks (2013-07-29 12:51:42 -0400) ext4 bugfixes for v3.11-rc4 Eric Sandeen (1): ext4: destroy ext4_es_cachep on module unload Theodore Ts'o (2): ext4: make sure group number is bumped after a inode allocation race ext4: fix retry handling in ext4_ext_truncate() Zheng Liu (1): ext4: add WARN_ON to check the length of allocated blocks fs/ext4/extents.c | 2 +- fs/ext4/ialloc.c | 10 +- fs/ext4/inode.c | 39 ++- fs/ext4/super.c | 1 + 4 files changed, 25 insertions(+), 27 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] xfs: Implement FALLOC_FL_COLLAPSE_RANGE
2013/8/1, Dave Chinner : > On Wed, Jul 31, 2013 at 11:42:14PM +0900, Namjae Jeon wrote: >> From: Namjae Jeon >> >> New fallocate flag FALLOC_FL_COLLAPSE_RANGE implementation for XFS. > Hi Dave. > A good start, but there's lots of work needed to make this safe for > general use. First, Thanks for your advice and help. > >> Signed-off-by: Namjae Jeon >> Signed-off-by: Ashish Sangwan >> --- >> fs/xfs/xfs_bmap.c | 92 >> + >> fs/xfs/xfs_bmap.h |3 ++ >> fs/xfs/xfs_file.c | 26 -- >> fs/xfs/xfs_iops.c | 35 +++ >> fs/xfs/xfs_vnodeops.c | 62 + >> fs/xfs/xfs_vnodeops.h |2 ++ >> 6 files changed, 217 insertions(+), 3 deletions(-) >> >> diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c >> index 05c698c..ae677b1 100644 >> --- a/fs/xfs/xfs_bmap.c >> +++ b/fs/xfs/xfs_bmap.c >> @@ -6145,3 +6145,95 @@ next_block: >> >> return error; >> } >> + >> +/* >> + * xfs_update_logical() >> + * Updates starting logical block of extents by subtracting >> + * shift from them. At max XFS_LINEAR_EXTS number extents >> + * will be updated and *current_ext is the extent number which >> + * is currently being updated. > > Single space after the "*" in the comments. Also, format them out to > 80 columns. > Single space after * followed by a tab gives checkpath warning: WARNING: please, no space before tabs #29: FILE: fs/xfs/xfs_bmap.c:6151: + * ^IUpdates starting logical block of extents by subtracting$ >> + */ >> +int >> +xfs_update_logical( >> +xfs_trans_t *tp, >> +struct xfs_inode*ip, >> +int *done, >> +xfs_fileoff_t start_fsb, >> +xfs_fileoff_t shift, >> +xfs_extnum_t*current_ext) > > This belongs in xfs_bmap_btree.c, named something like > xfs_bmbt_shift_rec_offsets(). Okay. > > Also, not typedefs for structures, please. (i.e. struct xfs_trans). Okay. > > >> +{ >> +xfs_btree_cur_t *cur; > > struct xfs_btree_cur*cur; > > And for all the other structures, too. Okay. > >> +xfs_bmbt_rec_host_t *gotp; >> +xfs_mount_t *mp; >> +xfs_ifork_t *ifp; >> +xfs_extnum_tnexts = 0; >> +xfs_fileoff_t startoff; >> +int error = 0; >> +int i; >> + >> +ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK); >> +mp = ip->i_mount; > > Do the assigment in the declaration on mp. Okay. > > struct xfs_mount*mp = ip->i_mount; > >> + >> +if (!(ifp->if_flags & XFS_IFEXTENTS) && >> +(error = xfs_iread_extents(tp, ip, XFS_DATA_FORK))) >> +return error; > > H - that rings alarm bells. ok, we will remove the alarm. > >> + >> +if (!*current_ext) { >> +gotp = xfs_iext_bno_to_ext(ifp, start_fsb, current_ext); >> +/* >> + * gotp can be null in 2 cases: 1) if there are no extents >> + * or 2) start_fsb lies in a hole beyond which there are >> + * no extents. Either way, we are done. >> + */ >> +if (!gotp) { >> +*done = 1; >> +return 0; >> +} >> +} > > So, you do a lookup on an extent index, which returns a incore > extent record. > >> + >> +if (ifp->if_flags & XFS_IFBROOT) >> +cur = xfs_bmbt_init_cursor(mp, tp, ip, XFS_DATA_FORK); >> +else >> +cur = NULL; >> + >> +while (nexts++ < XFS_LINEAR_EXTS && > > What has XFS_LINEAR_EXTS got to do with how many extents can be moved > in a single transaction at a time? We are also not sure how many extents to move in 1 transaction. xfs_bunmapi deletes 3 extents in 1 iteration. But deletion and updation are 2 different tasks. BTW, we tested this patch with 10,000 extents and it is working fine. Could you help us out here? What number would be appropriate? > >> + *current_ext < XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK)) { >> +gotp = xfs_iext_get_ext(ifp, (*current_ext)++); >> +startoff = xfs_bmbt_get_startoff(gotp); >> +if (cur) { >> +if ((error = xfs_bmbt_lookup_eq(cur, >> +startoff, >> +xfs_bmbt_get_startblock(gotp), >> +xfs_bmbt_get_blockcount(gotp), >> +))) > > Please don't put assignments inside if() statements where possible. > i.e. > error = xfs_bmbt_lookup_eq(cur, > if (error) > > Is the normal way of doing this. Okay:) > >> +goto del_cursor; >> +XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor); >> +} >> +startoff -= shift; >> +xfs_bmbt_set_startoff(gotp, startoff); > > So, we've looked up
linux-next: manual merge of the usb-gadget tree with Linus' tree
Hi Felipe, Today's linux-next merge of the usb-gadget tree got conflicts in drivers/usb/musb/omap2430.c and drivers/usb/musb/tusb6010.c between commit c1f01be4060b ("usb: musb: fix resource passed from glue layer to musb") from Linus' tree and commit c1a7d67c1901 ("usb: musb: use dev_get_platdata()") from the usb-gadget tree. I fixed it up (see below) and can carry the fix as necessary (no action is required). -- Cheers, Stephen Rothwells...@canb.auug.org.au diff --cc drivers/usb/musb/omap2430.c index f44e8b5,ebb46ec..000 --- a/drivers/usb/musb/omap2430.c +++ b/drivers/usb/musb/omap2430.c @@@ -481,8 -481,8 +481,8 @@@ static u64 omap2430_dmamask = DMA_BIT_M static int omap2430_probe(struct platform_device *pdev) { - struct resource musb_resources[2]; + struct resource musb_resources[3]; - struct musb_hdrc_platform_data *pdata = pdev->dev.platform_data; + struct musb_hdrc_platform_data *pdata = dev_get_platdata(>dev); struct omap_musb_board_data *data; struct platform_device *musb; struct omap2430_glue*glue; diff --cc drivers/usb/musb/tusb6010.c index 6f8a9ca,2196ee6..000 --- a/drivers/usb/musb/tusb6010.c +++ b/drivers/usb/musb/tusb6010.c @@@ -1156,8 -1156,8 +1156,8 @@@ static u64 tusb_dmamask = DMA_BIT_MASK( static int tusb_probe(struct platform_device *pdev) { - struct resource musb_resources[2]; + struct resource musb_resources[3]; - struct musb_hdrc_platform_data *pdata = pdev->dev.platform_data; + struct musb_hdrc_platform_data *pdata = dev_get_platdata(>dev); struct platform_device *musb; struct tusb6010_glue*glue; pgpwzXdAZeVie.pgp Description: PGP signature
[GIT PULL] more ext4 bug fixes
The following changes since commit 3b2f64d00c46e1e4e9bd0bb9bb12619adac27a4b: Linux 3.11-rc2 (2013-07-21 12:05:29 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus for you to fetch changes up to 44fb851dfb2f8e3462617e19a7b3b9025db9d919: ext4: add WARN_ON to check the length of allocated blocks (2013-07-29 12:51:42 -0400) ext4 bugfixes for v3.11-rc4 Eric Sandeen (1): ext4: destroy ext4_es_cachep on module unload Theodore Ts'o (2): ext4: make sure group number is bumped after a inode allocation race ext4: fix retry handling in ext4_ext_truncate() Zheng Liu (1): ext4: add WARN_ON to check the length of allocated blocks fs/ext4/extents.c | 2 +- fs/ext4/ialloc.c | 10 +- fs/ext4/inode.c | 39 ++- fs/ext4/super.c | 1 + 4 files changed, 25 insertions(+), 27 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] more ext4 bug fixes
The following changes since commit 3b2f64d00c46e1e4e9bd0bb9bb12619adac27a4b: Linux 3.11-rc2 (2013-07-21 12:05:29 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus for you to fetch changes up to 44fb851dfb2f8e3462617e19a7b3b9025db9d919: ext4: add WARN_ON to check the length of allocated blocks (2013-07-29 12:51:42 -0400) ext4 bugfixes for v3.11-rc4 Eric Sandeen (1): ext4: destroy ext4_es_cachep on module unload Theodore Ts'o (2): ext4: make sure group number is bumped after a inode allocation race ext4: fix retry handling in ext4_ext_truncate() Zheng Liu (1): ext4: add WARN_ON to check the length of allocated blocks fs/ext4/extents.c | 2 +- fs/ext4/ialloc.c | 10 +- fs/ext4/inode.c | 39 ++- fs/ext4/super.c | 1 + 4 files changed, 25 insertions(+), 27 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] usb: musb: dsps: use proper child nodes
On 7/31/2013 1:46 AM, Sebastian Andrzej Siewior wrote: This moves the two instances from the big node into two child nodes. The glue layer ontop does almost nothing. There is one devices containing the control module for USB (2) phy, (2) usb and later the dma engine. The usb device is the "glue device" which contains the musb device as a child. This is what we do ever since. The new file musb_am335x is just here to prob the new bus and populate child devices. There are a lot of changes to the dsps file as a result of the changes: - musb_core_offset This is gone. The device tree provides memory ressources information for the device there is no need to "fix" things - instances This is gone as well. If we have two instances then we have have two child enabled nodes in the device tree. For instance the SoC in beagle bone has two USB instances but only one has been wired up so there is no need to load and init the second instance since it won't be used. - dsps_glue is now per glue device In the past there was one of this structs but with an array of two and each instance accessed its variable depending on the platform device id. - no unneeded copy of structs I do not know why struct dsps_musb_wrapper is copied but it is not necessary. The same goes for musb_hdrc_platform_data which allocated on demand and then again by platform_device_add_data(). One copy is enough. Signed-off-by: Sebastian Andrzej Siewior diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi index 38b446b..0f756ca 100644 --- a/arch/arm/boot/dts/am33xx.dtsi +++ b/arch/arm/boot/dts/am33xx.dtsi @@ -26,6 +26,10 @@ serial5 = d_can0 = d_can1 = + usb0 = + usb1 = + phy0 = _phy; + phy1 = _phy; }; cpus { @@ -333,21 +337,85 @@ status = "disabled"; }; - usb@4740 { - compatible = "ti,musb-am33xx"; - reg = <0x4740 0x1000 /* usbss */ - 0x47401000 0x800 /* musb instance 0 */ - 0x47401800 0x800>;/* musb instance 1 */ - interrupts = <17 /* usbss */ - 18/* musb instance 0 */ - 19>; /* musb instance 1 */ - multipoint = <1>; - num-eps = <16>; - ram-bits = <12>; - port0-mode = <3>; - port1-mode = <3>; - power = <250>; + usb: usb@4740 { + compatible = "ti,am33xx-usb"; + reg = <0x4740 0x1000>; + ranges; + #address-cells = <1>; + #size-cells = <1>; ti,hwmods = "usb_otg_hs"; + status = "disabled"; + + ctrl_mod: control@44e1 { + compatible = "ti,am335x-ctrl-module"; + reg = <0x44e1 0x650>; Do you really need to map Control Module base to 0x650? If some other driver does this mapping we will always end returning -EPROBE_DEFER. How about this reg = <0x44e10620 0x4> , <0x44e10648 0x1>; reg-names = "phycontrol_dev" , "phywkup_dev"; and map for power on/off and phywkup separately in the control driver. + reg-names = "control_dev"; + status = "disabled"; + }; + diff --git a/drivers/usb/musb/Kconfig b/drivers/usb/musb/Kconfig index 797e3fd..b7257ae 100644 --- a/drivers/usb/musb/Kconfig +++ b/drivers/usb/musb/Kconfig @@ -83,6 +83,7 @@ config USB_MUSB_AM35X config USB_MUSB_DSPS tristate "TI DSPS platforms" + select USB_MUSB_AM335X_CHILD How about adding select AM335X_PHY_USB here? config USB_MUSB_BLACKFIN tristate "Blackfin" @@ -93,6 +94,9 @@ config USB_MUSB_UX500 endchoice +config USB_MUSB_AM335X_CHILD + tristate + choice prompt 'MUSB DMA mode' default MUSB_PIO_ONLY if ARCH_MULTIPLATFORM -- -George -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-next: manual merge of the driver-core tree with the net-next tree
Hi Greg, Today's linux-next merge of the driver-core tree got a conflict in net/core/net-sysfs.c between commit ff80e519ab1b ("net: export physical port id via sysfs") from the net-next tree and commit 6be8aeef348a ("net: core: convert class code to use dev_groups") from the driver-core tree. I fixed it up (see below) and can carry the fix as necessary (no action is required). -- Cheers, Stephen Rothwells...@canb.auug.org.au diff --cc net/core/net-sysfs.c index 8826b0d,707c313..000 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@@ -237,9 -250,9 +250,31 @@@ static ssize_t operstate_show(struct de return sprintf(buf, "%s\n", operstates[operstate]); } + static DEVICE_ATTR_RO(operstate); + ++static ssize_t phys_port_id_show(struct device *dev, ++ struct device_attribute *attr, char *buf) ++{ ++ struct net_device *netdev = to_net_dev(dev); ++ ssize_t ret = -EINVAL; ++ ++ if (!rtnl_trylock()) ++ return restart_syscall(); ++ ++ if (dev_isalive(netdev)) { ++ struct netdev_phys_port_id ppid; ++ ++ ret = dev_get_phys_port_id(netdev, ); ++ if (!ret) ++ ret = sprintf(buf, "%*phN\n", ppid.id_len, ppid.id); ++ } ++ rtnl_unlock(); ++ ++ return ret; ++} ++static DEVICE_ATTR_RO(phys_port_id); + /* read-write attributes */ - NETDEVICE_SHOW(mtu, fmt_dec); static int change_mtu(struct net_device *net, unsigned long new_mtu) { @@@ -333,52 -344,32 +366,33 @@@ static ssize_t group_store(struct devic { return netdev_store(dev, attr, buf, len, change_group); } - - static ssize_t show_phys_port_id(struct device *dev, -struct device_attribute *attr, char *buf) - { - struct net_device *netdev = to_net_dev(dev); - ssize_t ret = -EINVAL; - - if (!rtnl_trylock()) - return restart_syscall(); - - if (dev_isalive(netdev)) { - struct netdev_phys_port_id ppid; - - ret = dev_get_phys_port_id(netdev, ); - if (!ret) - ret = sprintf(buf, "%*phN\n", ppid.id_len, ppid.id); - } - rtnl_unlock(); - - return ret; - } - - static struct device_attribute net_class_attributes[] = { - __ATTR(addr_assign_type, S_IRUGO, show_addr_assign_type, NULL), - __ATTR(addr_len, S_IRUGO, show_addr_len, NULL), - __ATTR(dev_id, S_IRUGO, show_dev_id, NULL), - __ATTR(ifalias, S_IRUGO | S_IWUSR, show_ifalias, store_ifalias), - __ATTR(iflink, S_IRUGO, show_iflink, NULL), - __ATTR(ifindex, S_IRUGO, show_ifindex, NULL), - __ATTR(type, S_IRUGO, show_type, NULL), - __ATTR(link_mode, S_IRUGO, show_link_mode, NULL), - __ATTR(address, S_IRUGO, show_address, NULL), - __ATTR(broadcast, S_IRUGO, show_broadcast, NULL), - __ATTR(carrier, S_IRUGO | S_IWUSR, show_carrier, store_carrier), - __ATTR(speed, S_IRUGO, show_speed, NULL), - __ATTR(duplex, S_IRUGO, show_duplex, NULL), - __ATTR(dormant, S_IRUGO, show_dormant, NULL), - __ATTR(operstate, S_IRUGO, show_operstate, NULL), - __ATTR(mtu, S_IRUGO | S_IWUSR, show_mtu, store_mtu), - __ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags), - __ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len, - store_tx_queue_len), - __ATTR(netdev_group, S_IRUGO | S_IWUSR, show_group, store_group), - __ATTR(phys_port_id, S_IRUGO, show_phys_port_id, NULL), - {} + NETDEVICE_SHOW(group, fmt_dec); + static DEVICE_ATTR(netdev_group, S_IRUGO | S_IWUSR, group_show, group_store); + + static struct attribute *net_class_attrs[] = { + _attr_netdev_group.attr, + _attr_type.attr, + _attr_dev_id.attr, + _attr_iflink.attr, + _attr_ifindex.attr, + _attr_addr_assign_type.attr, + _attr_addr_len.attr, + _attr_link_mode.attr, + _attr_address.attr, + _attr_broadcast.attr, + _attr_speed.attr, + _attr_duplex.attr, + _attr_dormant.attr, + _attr_operstate.attr, ++ _attr_phys_port_id.attr, + _attr_ifalias.attr, + _attr_carrier.attr, + _attr_mtu.attr, + _attr_flags.attr, + _attr_tx_queue_len.attr, + NULL, }; + ATTRIBUTE_GROUPS(net_class); /* Show a given an attribute in the statistics group */ static ssize_t netstat_show(const struct device *d, pgpLzs2vJkKrd.pgp Description: PGP signature
Re: [PATCH 17/18] sched: Retry migration of tasks to CPU on a preferred node
* Mel Gorman [2013-07-15 16:20:19]: > When a preferred node is selected for a tasks there is an attempt to migrate > the task to a CPU there. This may fail in which case the task will only > migrate if the active load balancer takes action. This may never happen if Apart from load imbalance or heavily loaded cpus on the preferred node, what could be the other reasons for migration failure with migrate_task_to()? I see it almost similar to active load balance except for pushing instead of pulling tasks. If load imbalance is the only reason, do we need to retry? If the task is really so attached to memory on that node, shouldn't we getting task_numa_placement hit before the next 5 seconds? > the conditions are not right. This patch will check at NUMA hinting fault > time if another attempt should be made to migrate the task. It will only > make an attempt once every five seconds. > > Signed-off-by: Mel Gorman -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] fs: Introduce new flag FALLOC_FL_COLLAPSE_RANGE
2013/8/1, Dave Chinner : > On Wed, Jul 31, 2013 at 11:42:00PM +0900, Namjae Jeon wrote: >> From: Namjae Jeon >> >> Fallocate now supports new FALLOC_FL_COLLAPSE_RANGE flag. >> The semantics of this flag are following: >> 1) It collapses the range lying between offset and length by removing any >> data >>blocks which are present in this range and than updates all the >> logical >>offsets of extents beyond "offset + len" to nullify the hole created >> by >>removing blocks. In short, it does not leave a hole. >> 1) It should be used exclusively. No other fallocate flag in combination. >> 2) Offset and length supplied to fallocate should be fs block size >> aligned. > Hi Dave. > Given that the rest of fallocate() interfaces are byte based, this > is going to cause some confusion if it's not well documented. i.e. > this restriction needs to be documented in the header file that is > exposed to userspace, as well as in the fallocate(2) man page. Agree. > >> 3) It wokrs beyond "EOF", so the extents which are pre-allocated beyond >> "EOF" >>are also updated. > > I don't think that's valid for this operation. If the range is > beyond EOF, you are not modifying anything visible to userspace, and > that makes it the same as a hole punch operation. So, I'd get rid of > thisnasty implementation complexity altogether. The basic idea behind collapse range is that it does'nt leaves a hole. And this is achieved by updating starting block of each of the extents which lies beyond hole. When we are updating an extent, we are actually moving hole to the right, towards the last allocated block in the file space. We continue doing this for each extent and finally when this operation completes, the hole is gone. If the file has preallocated extent(s) beyond EOF, something like this=> hole, extent1, extent2, extent_beyond_EOF1. And we start updating extents one by one, to nullify the effect of hole=> STEP1: After updating extent1: extent1, hole, extent2, extent_beyond_EOF1. STEP2: After updating extent2: extent1, extent2, hole, extent_beyond_EOF1. STEP3: After updating extent_beyond_EOF1: extent1, extent2, extent_beyond_EOF1, hole => And this removes hole from the file. If we stop updating just after extent2, this will not remove hole from the file space but instead, it will place the hole between EOF and 1st preallcoated extent beyond EOF as in STEP2. So, if user does an extending truncate, hole will became visible as it is still there. > >> 4) It reduces the i_size of inode by the amount of collapse range which >> lies >>within i_size. So, if offset >= i_size, i_size won't be changed at >> all. > > Similarly, I don't think that asking for a range that overlaps EOF > is valid, either. The moment you overlap or go beyond EOF, you are > effectively truncating the file as there is no data to collapse into > the range Yes, we could put a restriction on that, something like: if (len + offset > i_size) { return -EINVAL;} > > As it is, I don't see these EOF rules enforced by this patch, nor > are they documented at all in the patch. The EOF file rules are handled in the fs specific patches. As in XFS => + if (mode & FALLOC_FL_COLLAPSE_RANGE) { + error = -xfs_collapse_file_space(ip, offset + len, len); + if (error || offset >= i_size_read(inode)) + goto out_unlock; + + /* Shrink size in case of FALLOC_FL_COLLAPSE_RANGE */ + if ((offset + len) > i_size_read(inode)) + new_size = offset; + else + new_size = i_size_read(inode) - len; + error = -xfs_update_file_size(ip, new_size); But it seems ok to not allow (offset + len) > i_size. > > Regardless of what semantics we decide on, this needs xfs_io > support and extensive corner case tests added to xfstests so that we > can confirm the implementation in every filesystem is correct and we > don't introduce regressions in future. it also needs documentation > in the fallocate(2) man page. Yes, we are currently working on it. With next version of collapse_range patches, we will also send xfs_io and xfstests patches and manpage also. > >> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h >> index 990c4cc..7567c8c 100644 >> --- a/include/uapi/linux/falloc.h >> +++ b/include/uapi/linux/falloc.h >> @@ -1,9 +1,10 @@ >> #ifndef _UAPI_FALLOC_H_ >> #define _UAPI_FALLOC_H_ >> >> -#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ >> -#define FALLOC_FL_PUNCH_HOLE0x02 /* de-allocates range */ >> +#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ >> +#define FALLOC_FL_PUNCH_HOLE0x02 /* de-allocates range */ >> #define FALLOC_FL_NO_HIDE_STALE 0x04 /* reserved codepoint */ >> +#define FALLOC_FL_COLLAPSE_RANGE0x08 /* it does not leave a hole */ > > I'd suggest that you need to explain what FALLOC_FL_COLLAPSE_RANGE > does in great detail right
Re: [PATCH v5 0/4] [SCSI] sg: fix race condition in sg_open
On 13-07-22 01:03 PM, Jörn Engel wrote: On Mon, 22 July 2013 12:40:29 +0800, Vaughan Cao wrote: There is a race when open sg with O_EXCL flag. Also a race may happen between sg_open and sg_remove. Changes from v4: * [3/4] use ERR_PTR series instead of adding another parameter in sg_add_sfp * [4/4] fix conflict for cherry-pick from v3. Changes from v3: * release o_sem in sg_release(), not in sg_remove_sfp(). * not set exclude with sfd_lock held. Vaughan Cao (4): [SCSI] sg: use rwsem to solve race during exclusive open [SCSI] sg: no need sg_open_exclusive_lock [SCSI] sg: checking sdp->detached isn't protected when open [SCSI] sg: push file descriptor list locking down to per-device locking drivers/scsi/sg.c | 178 +- 1 file changed, 83 insertions(+), 95 deletions(-) Patchset looks good to me, although I didn't test it on hardware yet. Signed-off-by: Joern Engel James, care to pick this up? Acked-by: Douglas Gilbert Tested O_EXCL with multiple processes and threads; passed. sg driver prior to this patch had "leaky" O_EXCL logic according to the same test. Block device passed. James, could you clean this up: drivers/scsi/sg.c:242:6: warning: unused variable ‘res’ [-Wunused-variable] Doug Gilbert -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 18/18] sched: Swap tasks when reschuling if a CPU on a target node is imbalanced
> @@ -904,6 +908,8 @@ static int task_numa_find_cpu(struct task_struct *p, int > nid) > src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, > -weight); > > for_each_cpu(cpu, cpumask_of_node(nid)) { > + struct task_struct *swap_candidate = NULL; > + > dst_load = target_load(cpu, idx); > > /* If the CPU is idle, use it */ > @@ -922,12 +928,41 @@ static int task_numa_find_cpu(struct task_struct *p, > int nid) >* migrate to its preferred node due to load imbalances. >*/ > balanced = (dst_eff_load <= src_eff_load); > - if (!balanced) > - continue; > + if (!balanced) { > + struct rq *rq = cpu_rq(cpu); > + unsigned long src_faults, dst_faults; > + > + /* Do not move tasks off their preferred node */ > + if (rq->curr->numa_preferred_nid == nid) > + continue; > + > + /* Do not attempt an illegal migration */ > + if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(rq->curr))) > + continue; > + > + /* > + * Do not impair locality for the swap candidate. > + * Destination for the swap candidate is the source cpu > + */ > + if (rq->curr->numa_faults) { > + src_faults = > rq->curr->numa_faults[task_faults_idx(nid, 1)]; > + dst_faults = > rq->curr->numa_faults[task_faults_idx(src_cpu_node, 1)]; > + if (src_faults > dst_faults) > + continue; > + } > + > + /* > + * The destination is overloaded but running a task > + * that is not running on its preferred node. Consider > + * swapping the CPU tasks are running on. > + */ > + swap_candidate = rq->curr; > + } > > if (dst_load < min_load) { > min_load = dst_load; > dst_cpu = cpu; > + *swap_p = swap_candidate; Are we some times passing a wrong candidate? Lets say the first cpu balanced is false and we set the swap_candidate, but find the second cpu(/or later cpus) to be idle or has lesser effective load, then we could be sending the task that is running on the first cpu as the swap candidate. Then would the preferred cpu and swap_candidate match? -- Thanks and Regards Srikar -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle backend driver to sysdev.
Hi Dongsheng, On 07/31/2013 11:16 AM, Wang Dongsheng-B40534 wrote: > Hi Preeti, > >> -Original Message- >> From: Preeti U Murthy [mailto:pre...@linux.vnet.ibm.com] >> Sent: Wednesday, July 31, 2013 12:00 PM >> To: Wang Dongsheng-B40534 >> Cc: Deepthi Dharwar; b...@kernel.crashing.org; daniel.lezc...@linaro.org; >> linux-kernel@vger.kernel.org; mich...@ellerman.id.au; >> srivatsa.b...@linux.vnet.ibm.com; sva...@linux.vnet.ibm.com; linuxppc- >> d...@lists.ozlabs.org; r...@sisk.pl; linux...@vger.kernel.org >> Subject: Re: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle >> backend driver to sysdev. >> >> Hi Dongsheng, >> >> On 07/31/2013 08:52 AM, Wang Dongsheng-B40534 wrote: >>> >>> -Original Message- From: Deepthi Dharwar [mailto:deep...@linux.vnet.ibm.com] Sent: Wednesday, July 31, 2013 10:59 AM To: b...@kernel.crashing.org; daniel.lezc...@linaro.org; linux- ker...@vger.kernel.org; mich...@ellerman.id.au; srivatsa.b...@linux.vnet.ibm.com; pre...@linux.vnet.ibm.com; sva...@linux.vnet.ibm.com; linuxppc-...@lists.ozlabs.org Cc: r...@sisk.pl; Wang Dongsheng-B40534; linux...@vger.kernel.org Subject: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle backend driver to sysdev. Move pseries_idle backend driver code to arch/powerpc/sysdev so that the code can be used for a common driver for powernv and pseries. This removes a lot of code duplicacy. >>> Why not drivers/cpuidle/? >>> >>> I think it should be move to drivers/cpuidle. >> >> Please take a look at what the cpuidle under drivers has to provide. >> cpuidle has two parts to it. The front end and the back end. The front >> end constitutes the cpuidle governors, registering of arch specific >> cpuidle drivers, disabling and enabling of cpuidle feature. It is this >> front end code which is present under drivers/cpuidle. >> >> The arch specific cpuidle drivers which decide what needs to be done to >> enter a specific idle state chosen by the cpuidle governor is what >> constitutes the back end of cpuidle. This will not be in drivers/cpuidle >> but in an arch/ specific code. >> >> The cpuidle under drivers/cpuidle drives the idle power management, but >> the low level handling of the entry into idle states should be taken care >> of by the architecture. >> >> Your recent patch : >> cpuidle: add freescale e500 family porcessors idle support IMO should >> hook onto the backend cpuidle driver that this patchset provides. >> > Sorry, I don't think so, cpuidle framework has been already very common. > Here we just need to do state definition and handling. I wonder whether > we need this layer. > > If your handle is platform dependent, it should be in arch/platform. > > If it is only for some platforms and the operation of these platforms can be > multiplexed, Why cannot as a driver to put into driver/cpuidle? > > If it a general driver, I think we can put some common operating to > driver/cpuidle > and make the platform specific code to arch/powerpc/platform. > > This patch include front end and back end, not just back end. > > This patch include too many state of different platforms and handle function. > This state > and handle that should belong to itself platforms. Not a general way. If > Deepthi will do > a general powerpc cpuidle, I think, it's cannot just using the macro to > distinguish > platform. the front end code maybe move to driver/cpuidle(drvier register) > should be better, > make the Low Power State and what should be handle to > arch/powerpc/platform/**, because different > platforms have different state of low power consumption, and the processing > method. > The front end can provide some general methods to register into general > powerpc cpuidle driver. As Daniel pointed out, with a call to cpuidle_register(), we can get the cpuidle_driver and cpuidle_device registered through the generic cpuidle framework. Hence we can get rid of the powerpc_idle_devices_init() routine. We can have the hotplug notifier in the generic cpuidle framework as well. The rest of the patchset however should be arch specific IMO. Regards Preeti U Murthy > > -dongsheng > > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 08/18] sched: Reschedule task on preferred NUMA node once selected
* Mel Gorman [2013-07-15 16:20:10]: > A preferred node is selected based on the node the most NUMA hinting > faults was incurred on. There is no guarantee that the task is running > on that node at the time so this patch rescheules the task to run on > the most idle CPU of the selected node when selected. This avoids > waiting for the balancer to make a decision. > > Signed-off-by: Mel Gorman > --- > kernel/sched/core.c | 17 + > kernel/sched/fair.c | 46 +- > kernel/sched/sched.h | 1 + > 3 files changed, 63 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 5e02507..b67a102 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4856,6 +4856,23 @@ fail: > return ret; > } > > +#ifdef CONFIG_NUMA_BALANCING > +/* Migrate current task p to target_cpu */ > +int migrate_task_to(struct task_struct *p, int target_cpu) > +{ > + struct migration_arg arg = { p, target_cpu }; > + int curr_cpu = task_cpu(p); > + > + if (curr_cpu == target_cpu) > + return 0; > + > + if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p))) > + return -EINVAL; > + > + return stop_one_cpu(curr_cpu, migration_cpu_stop, ); As I had noted earlier, this upsets schedstats badly. Can we add a TODO for this patch, which mentions that schedstats need to taken care. One alternative that I can think of is to have a per scheduling class routine that gets called and does the needful. for example: for fair share, it could update the schedstats as well as check for cfs_throttling. But I think its an issue that needs some fix or we should obsolete schedstats. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/10] KVM: PPC: reserve a capability number for multitce support
This is to reserve a capablity number for upcoming support of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls which support mulptiple DMA map/unmap operations per one call. Signed-off-by: Alexey Kardashevskiy --- Changes: 2013/07/16: * changed the number Signed-off-by: Alexey Kardashevskiy --- include/uapi/linux/kvm.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index acccd08..99c2533 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_ARM_EL1_32BIT 93 +#define KVM_CAP_SPAPR_MULTITCE 94 #ifdef KVM_CAP_IRQ_ROUTING -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/10] powerpc: add real mode support for dma operations on powernv
The existing TCE machine calls (tce_build and tce_free) only support virtual mode as they call __raw_writeq for TCE invalidation what fails in real mode. This introduces tce_build_rm and tce_free_rm real mode versions which do mostly the same but use "Store Doubleword Caching Inhibited Indexed" instruction for TCE invalidation. This new feature is going to be utilized by real mode support of VFIO. Signed-off-by: Alexey Kardashevskiy --- Changes: 2013/11/07: * added comment why stdcix cannot be used in virtual mode 2013/08/07: * tested on p7ioc and fixed a bug with realmode addresses Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/machdep.h| 12 arch/powerpc/platforms/powernv/pci-ioda.c | 47 +++ arch/powerpc/platforms/powernv/pci.c | 38 + arch/powerpc/platforms/powernv/pci.h | 3 +- 4 files changed, 81 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index 8b48090..07dd3b1 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -78,6 +78,18 @@ struct machdep_calls { long index); void(*tce_flush)(struct iommu_table *tbl); + /* _rm versions are for real mode use only */ + int (*tce_build_rm)(struct iommu_table *tbl, +long index, +long npages, +unsigned long uaddr, +enum dma_data_direction direction, +struct dma_attrs *attrs); + void(*tce_free_rm)(struct iommu_table *tbl, + long index, + long npages); + void(*tce_flush_rm)(struct iommu_table *tbl); + void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size, unsigned long flags, void *caller); void(*iounmap)(volatile void __iomem *token); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index d8140b1..5815f1d 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -70,6 +70,16 @@ define_pe_printk_level(pe_err, KERN_ERR); define_pe_printk_level(pe_warn, KERN_WARNING); define_pe_printk_level(pe_info, KERN_INFO); +/* + * stdcix is only supposed to be used in hypervisor real mode as per + * the architecture spec + */ +static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr) +{ + __asm__ __volatile__("stdcix %0,0,%1" + : : "r" (val), "r" (paddr) : "memory"); +} + static int pnv_ioda_alloc_pe(struct pnv_phb *phb) { unsigned long pe; @@ -454,10 +464,13 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) } } -static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, -u64 *startp, u64 *endp) +static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe, +struct iommu_table *tbl, +u64 *startp, u64 *endp, bool rm) { - u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index; + u64 __iomem *invalidate = rm? + (u64 __iomem *)pe->tce_inval_reg_phys: + (u64 __iomem *)tbl->it_index; unsigned long start, end, inc; start = __pa(startp); @@ -484,7 +497,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, mb(); /* Ensure above stores are visible */ while (start <= end) { -__raw_writeq(start, invalidate); + if (rm) + __raw_rm_writeq(start, invalidate); + else + __raw_writeq(start, invalidate); start += inc; } @@ -496,10 +512,12 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, struct iommu_table *tbl, -u64 *startp, u64 *endp) +u64 *startp, u64 *endp, bool rm) { unsigned long start, end, inc; - u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index; + u64 __iomem *invalidate = rm? + (u64 __iomem *)pe->tce_inval_reg_phys: + (u64 __iomem *)tbl->it_index; /* We'll invalidate DMA address in PE scope */ start = 0x2ul << 60; @@ -515,22 +533,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, mb(); while (start <= end) { - __raw_writeq(start, invalidate); + if (rm) +
[PATCH 08/10] powerpc/iommu: rework to support realmode
The TCE tables handling may differ for real and virtual modes so additional ppc_md.tce_build_rm/ppc_md.tce_free_rm/ppc_md.tce_flush_rm handlers were introduced earlier. So this adds the following: 1. support for the new ppc_md calls; 2. ability to iommu_tce_build to process mupltiple entries per call; 3. arch_spin_lock to protect TCE table from races in both real and virtual modes; 4. proper TCE table protection from races with the existing IOMMU code in iommu_take_ownership/iommu_release_ownership; 5. hwaddr variable renamed to hpa as it better describes what it actually represents; 6. iommu_tce_direction is static now as it is not called from anywhere else. This will be used by upcoming real mode support of VFIO on POWER. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 9 +- arch/powerpc/kernel/iommu.c | 198 ++- 2 files changed, 136 insertions(+), 71 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index c34656a..b01bde1 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -78,6 +78,7 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; + arch_spinlock_t it_rm_lock; #endif }; @@ -152,9 +153,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl, extern int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce); extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, - unsigned long hwaddr, enum dma_data_direction direction); -extern unsigned long iommu_clear_tce(struct iommu_table *tbl, - unsigned long entry); + unsigned long *hpas, unsigned long npages, bool rm); +extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, + unsigned long npages, bool rm); extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages); extern int iommu_put_tce_user_mode(struct iommu_table *tbl, @@ -164,7 +165,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl); extern int iommu_take_ownership(struct iommu_table *tbl); extern void iommu_release_ownership(struct iommu_table *tbl); -extern enum dma_data_direction iommu_tce_direction(unsigned long tce); - #endif /* __KERNEL__ */ #endif /* _ASM_IOMMU_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b20ff17..8314c80 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl, kfree(name); } -enum dma_data_direction iommu_tce_direction(unsigned long tce) +static enum dma_data_direction iommu_tce_direction(unsigned long tce) { if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE)) return DMA_BIDIRECTIONAL; @@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long tce) else return DMA_NONE; } -EXPORT_SYMBOL_GPL(iommu_tce_direction); void iommu_flush_tce(struct iommu_table *tbl) { @@ -972,73 +971,117 @@ int iommu_tce_put_param_check(struct iommu_table *tbl, } EXPORT_SYMBOL_GPL(iommu_tce_put_param_check); -unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry) -{ - unsigned long oldtce; - struct iommu_pool *pool = get_pool(tbl, entry); - - spin_lock(&(pool->lock)); - - oldtce = ppc_md.tce_get(tbl, entry); - if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)) - ppc_md.tce_free(tbl, entry, 1); - else - oldtce = 0; - - spin_unlock(&(pool->lock)); - - return oldtce; -} -EXPORT_SYMBOL_GPL(iommu_clear_tce); - int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages) { - unsigned long oldtce; - struct page *page; - - for ( ; pages; --pages, ++entry) { - oldtce = iommu_clear_tce(tbl, entry); - if (!oldtce) - continue; - - page = pfn_to_page(oldtce >> PAGE_SHIFT); - WARN_ON(!page); - if (page) { - if (oldtce & TCE_PCI_WRITE) - SetPageDirty(page); - put_page(page); - } - } - - return 0; + return iommu_free_tces(tbl, entry, pages, false); } EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages); -/* - * hwaddr is a kernel virtual address here (0xc... bazillion), - * tce_build converts it to a physical address. - */ +int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, + unsigned long npages, bool rm) +{ + int i, ret = 0, to_free = 0; + + if (rm && !ppc_md.tce_free_rm) + return
[PATCH 03/10] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
This is to reserve a capablity number for upcoming support of VFIO-IOMMU DMA operations in real mode. Signed-off-by: Alexey Kardashevskiy --- Changes: 2013/07/16: * changed the number 2013/07/11: * changed order in a file, added comment about a gap in ioctl number Signed-off-by: Alexey Kardashevskiy --- include/uapi/linux/kvm.h | 4 1 file changed, 4 insertions(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 99c2533..53c3f1f 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_ARM_EL1_32BIT 93 #define KVM_CAP_SPAPR_MULTITCE 94 +#define KVM_CAP_SPAPR_TCE_IOMMU 95 #ifdef KVM_CAP_IRQ_ROUTING @@ -933,6 +934,9 @@ struct kvm_s390_ucas_mapping { #define KVM_ARM_SET_DEVICE_ADDR _IOW(KVMIO, 0xab, struct kvm_arm_device_addr) /* Available with KVM_CAP_PPC_RTAS */ #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO, 0xac, struct kvm_rtas_token_args) +/* 0xad and 0xaf are already taken */ +/* Available with KVM_CAP_SPAPR_TCE_IOMMU */ +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu) /* ioctl for vm fd */ #define KVM_CREATE_DEVICE_IOWR(KVMIO, 0xe0, struct kvm_create_device) -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/10] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
This adds special support for huge pages (16MB) in real mode. The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a hash table of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is in the hash table, then no reference counting is done, otherwise an exit to virtual mode happens. The hash table is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this hash table does not cost much. However this can change and we may want to optimize this. Signed-off-by: Paul Mackerras Signed-off-by: Alexey Kardashevskiy --- Changes: 2013/07/12: * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled for KVM_BOOK3S_64 2013/06/27: * list of huge pages replaces with hashtable for better performance * spinlock removed from real mode and only protects insertion of new huge [ages descriptors into the hashtable 2013/06/05: * fixed compile error when CONFIG_IOMMU_API=n 2013/05/20: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). Conflicts: arch/powerpc/kernel/iommu.c Conflicts: arch/powerpc/kernel/iommu.c Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/kvm_host.h | 25 arch/powerpc/kernel/iommu.c | 6 +- arch/powerpc/kvm/book3s_64_vio.c| 121 ++-- arch/powerpc/kvm/book3s_64_vio_hv.c | 32 +- 4 files changed, 175 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 4eeaf7d..c57b25a 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include @@ -183,9 +184,33 @@ struct kvmppc_spapr_tce_table { u32 window_size; struct iommu_group *grp;/* used for IOMMU groups */ struct vfio_group *vfio_grp;/* used for IOMMU groups */ + DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */ + spinlock_t hugepages_write_lock;/* used for IOMMU groups */ struct page *pages[0]; }; +/* + * The KVM guest can be backed with 16MB pages. + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa >> 24, 32) + +struct kvmppc_spapr_iommu_hugepage { + struct hlist_node hash_node; + unsigned long gpa; /* Guest physical address */ + unsigned long hpa; /* Host physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + struct kvmppc_linear_info { void*base_virt; unsigned longbase_pfn; diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 8314c80..e4a8135 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, if (!pg) { ret = -EAGAIN; } else if (PageCompound(pg)) { - ret = -EAGAIN; + /* Hugepages will be released at KVM exit */ + ret = 0; } else { if (oldtce & TCE_PCI_WRITE) SetPageDirty(pg); @@ -1010,6 +1011,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT); if (!pg) { ret = -EAGAIN; + } else if (PageCompound(pg)) { + /* Hugepages will be released at KVM exit */ + ret = 0; } else { if (oldtce & TCE_PCI_WRITE) SetPageDirty(pg); diff --git
[PATCH 07/10] KVM: PPC: Add support for multiple-TCE hcalls
This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a function to convert a guest physical address to a host virtual address in order to parse a TCE list from H_PUT_TCE_INDIRECT. This also implements the KVM_CAP_PPC_MULTITCE capability. When present, the hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration. Signed-off-by: Paul Mackerras Signed-off-by: Alexey Kardashevskiy --- Changelog: 2013/08/01 (v7): * realmode_get_page/realmode_put_page use was replaced with get_page_unless_zero/put_page_unless_one 2013/07/11: * addressed many, many comments from maintainers 2013/07/06: * fixed number of wrong get_page()/put_page() calls 2013/06/27: * fixed clear of BUSY bit in kvmppc_lookup_pte() * H_PUT_TCE_INDIRECT does realmode_get_page() now * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64 * updated doc 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) Signed-off-by: Alexey Kardashevskiy --- Documentation/virtual/kvm/api.txt | 26 arch/powerpc/include/asm/kvm_host.h | 9 ++ arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 132 +++- arch/powerpc/kvm/book3s_64_vio_hv.c | 267 arch/powerpc/kvm/book3s_hv.c| 41 - arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +- arch/powerpc/kvm/book3s_pr_papr.c | 35 + arch/powerpc/kvm/powerpc.c | 3 + 9 files changed, 503 insertions(+), 34 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index ef925ea..1c8942a 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2382,6 +2382,32 @@ calls by the guest for that service will be passed to userspace to be handled. +4.86 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability means the kernel is capable of handling hypercalls +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user +space. This significantly accelerates DMA operations for PPC KVM guests. +User space should expect that its handlers for these hypercalls +are not going to be called if user space previously registered LIOBN +in KVM (via KVM_CREATE_SPAPR_TCE or similar calls). + +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, +user space might have to advertise it for the guest. For example, +IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is +present in the "ibm,hypertas-functions" device-tree property. + +The hypercalls mentioned above may or may not be processed successfully +in the kernel based fast path. If they can not be handled by the kernel, +they will get passed on to user space. So user space still has to have +an implementation for these despite the in kernel acceleration. + +This capability is always enabled. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index af326cd..b8fe3de 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -609,6 +610,14 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; + + unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT hcall */ + enum { + TCERM_NONE, + TCERM_GETPAGE, + TCERM_PUTTCE, + TCERM_PUTLIST, + } tce_rm_fail; /* failed stage of
[PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap
The current VFIO-on-POWER implementation supports only user mode driven mapping, i.e. QEMU is sending requests to map/unmap pages. However this approach is really slow, so we want to move that to KVM. Since H_PUT_TCE can be extremely performance sensitive (especially with network adapters where each packet needs to be mapped/unmapped) we chose to implement that as a "fast" hypercall directly in "real mode" (processor still in the guest context but MMU off). To be able to do that, we need to provide some facilities to access the struct page count within that real mode environment as things like the sparsemem vmemmap mappings aren't accessible. This adds an API function realmode_pfn_to_page() to get page struct when MMU is off. This adds to MM a new function put_page_unless_one() which drops a page if counter is bigger than 1. It is going to be used when MMU is off (for example, real mode on PPC64) and we want to make sure that page release will not happen in real mode as it may crash the kernel in a horrible way. CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. Cc: linux...@kvack.org Cc: Benjamin Herrenschmidt Cc: Andrew Morton Reviewed-by: Paul Mackerras Signed-off-by: Paul Mackerras Signed-off-by: Alexey Kardashevskiy --- Changes: 2013/07/25 (v7): * removed realmode_put_page and added put_page_unless_one() instead. The name has been chosen to conform the already existing get_page_unless_zero(). * removed realmode_get_page. Instead, get_page_unless_zero() should be used 2013/07/10: * adjusted comment (removed sentence about virtual mode) * get_page_unless_zero replaced with atomic_inc_not_zero to minimize effect of a possible get_page_unless_zero() rework (if it ever happens). 2013/06/27: * realmode_get_page() fixed to use get_page_unless_zero(). If failed, the call will be passed from real to virtual mode and safely handled. * added comment to PageCompound() in include/linux/page-flags.h. 2013/05/20: * PageTail() is replaced by PageCompound() in order to have the same checks for whether the page is huge in realmode_get_page() and realmode_put_page() Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++ arch/powerpc/mm/init_64.c| 50 +++- include/linux/mm.h | 14 + include/linux/page-flags.h | 4 ++- 4 files changed, 68 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index 46db094..4a191c4 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -394,6 +394,8 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array, hpte_slot_array[index] = hidx << 4 | 0x1 << 3; } +struct page *realmode_pfn_to_page(unsigned long pfn); + static inline char *get_hpte_slot_array(pmd_t *pmdp) { /* diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index d0cd9e4..8cf345a 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -300,5 +300,53 @@ void vmemmap_free(unsigned long start, unsigned long end) { } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ +/* + * We do not have access to the sparsemem vmemmap, so we fallback to + * walking the list of sparsemem blocks which we already maintain for + * the sake of crashdump. In the long run, we might want to maintain + * a tree if performance of that linear walk becomes a problem. + * + * realmode_pfn_to_page functions can fail due to: + * 1) As real sparsemem blocks do not lay in RAM continously (they + * are in virtual address space which is not available in the real mode), + * the requested page struct can be split between blocks so get_page/put_page + * may fail. + * 2) When huge pages are used, the get_page/put_page API will fail + * in real mode as the linked addresses in the page struct are virtual + * too. + */ +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct vmemmap_backing *vmem_back; + struct page *page; + unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) { + if (pg_va < vmem_back->virt_addr) + continue; + + /* Check that page struct is not split between real pages */ + if ((pg_va + sizeof(struct page)) > + (vmem_back->virt_addr + page_size)) + return NULL; + + page = (struct page *) (vmem_back->phys + pg_va - + vmem_back->virt_addr); + return page; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#elif defined(CONFIG_FLATMEM) + +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct page *page =
[PATCH 06/10] KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
It does not make much sense to have KVM in book3s-64bit and not to have IOMMU bits for PCI pass through support as it costs little and allows VFIO to function on book3s-kvm. Having IOMMU_API always enabled makes it unnecessary to have a lot of "#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those ifdef's we could have only user space emulated devices accelerated (but not VFIO) which do not seem to be very useful. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kvm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index c55c538..3b2b761 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -59,6 +59,7 @@ config KVM_BOOK3S_64 depends on PPC_BOOK3S_64 select KVM_BOOK3S_64_HANDLER select KVM + select SPAPR_TCE_IOMMU ---help--- Support running unmodified book3s_64 and book3s_32 guest kernels in virtual machines on book3s_64 host processors. -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/10] hashtable: add hash_for_each_possible_rcu_notrace()
This adds hash_for_each_possible_rcu_notrace() which is basically a notrace clone of hash_for_each_possible_rcu() which cannot be used in real mode due to its tracing/debugging capability. Signed-off-by: Alexey Kardashevskiy --- include/linux/hashtable.h | 15 +++ 1 file changed, 15 insertions(+) diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h index a9df51f..af8b169 100644 --- a/include/linux/hashtable.h +++ b/include/linux/hashtable.h @@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node) member) /** + * hash_for_each_possible_rcu_notrace - iterate over all possible objects hashing + * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable + * @name: hashtable to iterate + * @obj: the type * to use as a loop cursor for each entry + * @member: the name of the hlist_node within the struct + * @key: the key of the objects to iterate over + * + * This is the same as hash_for_each_possible_rcu() except that it does + * not do any RCU debugging or tracing. + */ +#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \ + hlist_for_each_entry_rcu_notrace(obj, [hash_min(key, HASH_BITS(name))],\ + member) + +/** * hash_for_each_possible_safe - iterate over all possible objects hashing to the * same bucket safe against removals * @name: hashtable to iterate -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/10] KVM: PPC: Add support for IOMMU in-kernel handling
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests targeted an IOMMU TCE table without passing them to user space which saves time on switching to user space and back. Both real and virtual modes are supported. The kernel tries to handle a TCE request in the real mode, if fails it passes the request to the virtual mode to complete the operation. If it a virtual mode handler fails, the request is passed to user space. The first user of this is VFIO on POWER. The external user API in VFIO is required for this patch. The patch adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to associate a virtual PCI bus number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap requests. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Signed-off-by: Paul Mackerras Signed-off-by: Alexey Kardashevskiy --- Changes: 2013/07/11: * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled for KVM_BOOK3S_64 * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense for this here but the next patch for hugepages support will use it more. 2013/07/06: * added realmode arch_spin_lock to protect TCE table from races in real and virtual modes * POWERPC IOMMU API is changed to support real mode * iommu_take_ownership and iommu_release_ownership are protected by iommu_table's locks * VFIO external user API use rewritten * multiple small fixes 2013/06/27: * tce_list page is referenced now in order to protect it from accident invalidation during H_PUT_TCE_INDIRECT execution * added use of the external user VFIO API 2013/06/05: * changed capability number * changed ioctl number * update the doc article number 2013/05/20: * removed get_user() from real mode handlers * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there translated TCEs, tries realmode_get_page() on those and if it fails, it passes control over the virtual mode handler which tries to finish the request handling * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit on a page * The only reason to pass the request to user mode now is when the user mode did not register TCE table in the kernel, in all other cases the virtual mode handler is expected to do the job Signed-off-by: Alexey Kardashevskiy --- Documentation/virtual/kvm/api.txt | 26 arch/powerpc/include/asm/kvm_host.h | 3 + arch/powerpc/include/asm/kvm_ppc.h | 2 + arch/powerpc/include/uapi/asm/kvm.h | 7 + arch/powerpc/kvm/book3s_64_vio.c| 296 +++- arch/powerpc/kvm/book3s_64_vio_hv.c | 122 +++ arch/powerpc/kvm/powerpc.c | 12 ++ 7 files changed, 463 insertions(+), 5 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 1c8942a..6ae65bd 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2408,6 +2408,32 @@ an implementation for these despite the in kernel acceleration. This capability is always enabled. +4.87 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 fd; + __u32 flags; +}; + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +User space passes VFIO group fd. Using the external user VFIO API, +KVM tries gets IOMMU id from passed fd. If succeeded, acceleration +turns on. If failed, map/unmap requests are passed to user space. + +No flag is supported at the moment. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index b8fe3de..4eeaf7d 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + struct iommu_group *grp;/* used for IOMMU groups */ + struct vfio_group *vfio_grp;/* used for IOMMU groups */ struct page *pages[0]; }; @@ -612,6 +614,7 @@ struct kvm_vcpu_arch { u64 busy_preempt; unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT hcall */ + unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */ enum { TCERM_NONE, TCERM_GETPAGE, diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 0ce4691..297cab5 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++
[PATCH 00/10 v7] KVM: PPC: IOMMU in-kernel handling
This accelerates VFIO DMA operations on POWER by moving them into kernel. The changes in this series are: 1. rebased on v3.11-rc3. 2. VFIO external user API will go through VFIO tree so it is excluded from this series. 3. As nobody ever reacted on "hashtable: add hash_for_each_possible_rcu_notrace()", Ben suggested to push it via his tree so I included it to the series. 4. realmode_(get|put)_page is reworked. More details in the individual patch comments. Alexey Kardashevskiy (10): hashtable: add hash_for_each_possible_rcu_notrace() KVM: PPC: reserve a capability number for multitce support KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO powerpc: Prepare to support kernel handling of IOMMU map/unmap powerpc: add real mode support for dma operations on powernv KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently KVM: PPC: Add support for multiple-TCE hcalls powerpc/iommu: rework to support realmode KVM: PPC: Add support for IOMMU in-kernel handling KVM: PPC: Add hugepage support for IOMMU in-kernel handling Documentation/virtual/kvm/api.txt | 52 +++ arch/powerpc/include/asm/iommu.h | 9 +- arch/powerpc/include/asm/kvm_host.h | 37 +++ arch/powerpc/include/asm/kvm_ppc.h| 18 +- arch/powerpc/include/asm/machdep.h| 12 + arch/powerpc/include/asm/pgtable-ppc64.h | 2 + arch/powerpc/include/uapi/asm/kvm.h | 7 + arch/powerpc/kernel/iommu.c | 202 +++ arch/powerpc/kvm/Kconfig | 1 + arch/powerpc/kvm/book3s_64_vio.c | 533 +- arch/powerpc/kvm/book3s_64_vio_hv.c | 405 +-- arch/powerpc/kvm/book3s_hv.c | 41 ++- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +- arch/powerpc/kvm/book3s_pr_papr.c | 35 ++ arch/powerpc/kvm/powerpc.c| 15 + arch/powerpc/mm/init_64.c | 50 ++- arch/powerpc/platforms/powernv/pci-ioda.c | 47 ++- arch/powerpc/platforms/powernv/pci.c | 38 ++- arch/powerpc/platforms/powernv/pci.h | 3 +- include/linux/hashtable.h | 15 + include/linux/mm.h| 14 + include/linux/page-flags.h| 4 +- include/uapi/linux/kvm.h | 5 + 23 files changed, 1430 insertions(+), 123 deletions(-) -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] dma: edma: Find missed events and issue them
On 07/31/2013 09:27 PM, Joel Fernandes wrote: > On 07/31/2013 04:18 AM, Sekhar Nori wrote: >> On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote: >>> Hi Sekhar, >>> >>> On 07/30/2013 02:05 AM, Sekhar Nori wrote: On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: > In an effort to move to using Scatter gather lists of any size with > EDMA as discussed at [1] instead of placing limitations on the driver, > we work through the limitations of the EDMAC hardware to find missed > events and issue them. > > The sequence of events that require this are: > > For the scenario where MAX slots for an EDMA channel is 3: > > SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null > > The above SG list will have to be DMA'd in 2 sets: > > (1) SG1 -> SG2 -> SG3 -> Null > (2) SG4 -> SG5 -> SG6 -> Null > > After (1) is succesfully transferred, the events from the MMC controller > donot stop coming and are missed by the time we have setup the transfer > for (2). So here, we catch the events missed as an error condition and > issue them manually. Are you sure there wont be any effect of these missed events on the peripheral side. For example, wont McASP get into an underrun condition when it encounters a null PaRAM set? Even UART has to transmit to a >>> >>> But it will not encounter null PaRAM set because McASP uses contiguous >>> buffers for transfer which are not scattered across physical memory. >>> This can be accomplished with an SG of size 1. For such SGs, this patch >>> series leaves it linked Dummy and does not link to Null set. Null set is >>> only used for SG lists that are > MAX_NR_SG in size such as those >>> created for example by MMC and Crypto. >>> particular baud so I guess it cannot wait like the way MMC/SD can. >>> >>> Existing driver have to wait anyway if they hit MAX SG limit today. If >>> they don't want to wait, they would have allocated a contiguous block of >>> memory and DMA that in one stretch so they don't lose any events, and in >>> such cases we are not linking to Null. >> >> As long as DMA driver can advertize its MAX SG limit, peripherals can >> always work around that by limiting the number of sync events they >> generate so as to not having any of the events getting missed. With this >> series, I am worried that EDMA drivers is advertizing that it can handle >> any length SG list while not taking care of missing any events while >> doing so. This will break the assumptions that driver writers make. > > This is already being done by some other DMA engine drivers ;). We can > advertise more than we can handle at a time, that's the basis of this > whole idea. > > I understand what you're saying but events are not something that have > be serviced immediately, they can be queued etc and the actually > transfer from the DMA controller can be delayed. As long as we don't > miss the event we are fine which my series takes care off. > > So far I have tested this series on following modules in various > configurations and have seen no issues: > - Crypto AES > - MMC/SD > - SPI (128x160 display) > Also, wont this lead to under-utilization of the peripheral bandwith? Meaning, MMC/SD is ready with data but cannot transfer because the DMA is waiting to be set-up. >>> >>> But it is waiting anyway even today. Currently based on MAX segs, MMC >>> driver/subsystem will make SG list of size max_segs. Between these >>> sessions of creating such smaller SG-lists, if for some reason the MMC >>> controller is sending events, these will be lost anyway. >> >> But if MMC/SD driver knows how many events it should generate if it >> knows the MAX SG limit. So there should not be any missed events in >> current code. And I am not claiming that your solution is making matters >> worse. But its not making it much better as well. > > This is not true for crypto, the events are not deasserted and crypto > continues to send events. This is what led to the "don't trigger in > Null" patch where I'm setting the missed flag to avoid recursion. > >>> This can be used only for buffers that are contiguous in memory, not >>> those that are scattered across memory. >> >> I was hinting at using the linking facility of EDMA to achieve this. >> Each PaRAM set has full 32-bit source and destination pointers so I see >> no reason why non-contiguous case cannot be handled. >> >> Lets say you need to transfer SG[0..6] on channel C. Now, PaRAM sets are >> typically 4 times the number of channels. In this case we use one DMA >> PaRAM set and two Link PaRAM sets per channel. P0 is the DMA PaRAM set >> and P1 and P2 are the Link sets. >> >> Initial setup: >> >> SG0 -> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >> ^ ^ ^ >> | | | >> P0 -> P1 -> P2 -> NULL >> >> P[0..2].TCINTEN = 1, so get an interrupt after each SG element >> completion. On each completion interrupt, hardware
Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
On 07/31/2013 10:56 PM, Minchan Kim wrote: Yes, it's not really slow path because it could return to normal status without calling significant slow functions by reset batchcount of prepare_slowpath. I think it's tradeoff and I am biased your approach although we would lose a little performance because fair aging would recover the loss by fastpath's overhead. But who knows? Someone has a concern. So we should mention about such problems. If the atomic operation in the fast path turns out to be a problem, I suspect we may be able to fix it by using per-cpu counters, and consolidating those every once in a while. However, it may be good to see whether there is a problem in the first place, before adding complexity. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] fs: Introduce new flag FALLOC_FL_COLLAPSE_RANGE
On Thu, Aug 01, 2013 at 12:06:05AM -0400, Theodore Ts'o wrote: > On Thu, Aug 01, 2013 at 12:59:14PM +1000, Dave Chinner wrote: > > > > This funtionality is not introducing any new problems w.r.t. mmap(). > > In terms of truncating a mmap'd file, that can already occur and > > the behaviour is well known. > > That's not race I'm worried about. Consider the following scenario. > We have a 10 meg file, and the first five megs are mapped read/write. > We do a collapse range of one meg starting at the one meg boundary. > Menwhile another process is constantly modifying the pages between the > 3 meg and 4 meg mark, via a mmap'ed region of memory. > > You can have the code which does the collapse range call > filemap_fdatawrite_range() to flush out to disk all of the inode's > dirty pages in the page cache, but between the call to > filemap_fdatawrite_range() and truncate_page_cache_range(), more pages > could have gotten dirtied, and those changes will get lost. That's one of the many race conditions that Jan's patches are intended to solve. Locking the range and then unmapping the pages will then cause the refault to block until the range is unlocked. While the range is locked, we can then writeback all the dirty pages, invalidate the page cache and modify the underlying extents without having to care about user page faults racing with the operation. Let's face it, without that we current cannot serialise *any* multi-page page cache operation against mmap() - direct IO has exactly the same problems as you are describing, as does hole punching, and anything else that relies on the i_mutex or other IO-based locks for serialisation. > The difference between punch_hole and collapse_range is that > previously, we only needed to call truncate_page_cache_range() on the > pages that were going to be disappearing from the page cache, so it > didn't matter if you had someone dirtying the pages racing with the > punch_hole operation. But in the collapse_range case, we need to call > truncate_page_cache_range() on data that is not going to disappear, > but rather be _renumbered_. Sure - as it said up front "the range of the pages being operated on" need to be invalidating. And, if you look at the ext4 patch, is actaully does exactly that > I'll also note that depending on the use case, doing the renumbering > of the pages by throwing all of the pages from the page cache and > forcing them to be read back in from disk might not all be friendly to > a performance sensitive application. Application problem. Besides, this is for accelerating video stream editting (i.e. NLE workflows). Such applications tend to use direct IO for recording and playback, and they most certainly won't have concurrent access to the video stream either by mmap or direct/buferred IO while the non linear editor is punching the advertisements out of the recorded video stream.. > In the case where the page size == the fs block size, instead of > throwing away all of the pages, we could also have VM code which > remaps the pages in the page cache (including potentially editing the > page tables for any mmap'ed pages). That way lies insanity. Not to mention that it's completely unnecessary for the use case this ioctl is actually accelerating. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] fs: Introduce new flag FALLOC_FL_COLLAPSE_RANGE
On Thu, Aug 01, 2013 at 12:59:14PM +1000, Dave Chinner wrote: > > This funtionality is not introducing any new problems w.r.t. mmap(). > In terms of truncating a mmap'd file, that can already occur and > the behaviour is well known. That's not race I'm worried about. Consider the following scenario. We have a 10 meg file, and the first five megs are mapped read/write. We do a collapse range of one meg starting at the one meg boundary. Menwhile another process is constantly modifying the pages between the 3 meg and 4 meg mark, via a mmap'ed region of memory. You can have the code which does the collapse range call filemap_fdatawrite_range() to flush out to disk all of the inode's dirty pages in the page cache, but between the call to filemap_fdatawrite_range() and truncate_page_cache_range(), more pages could have gotten dirtied, and those changes will get lost. The difference between punch_hole and collapse_range is that previously, we only needed to call truncate_page_cache_range() on the pages that were going to be disappearing from the page cache, so it didn't matter if you had someone dirtying the pages racing with the punch_hole operation. But in the collapse_range case, we need to call truncate_page_cache_range() on data that is not going to disappear, but rather be _renumbered_. I'll also note that depending on the use case, doing the renumbering of the pages by throwing all of the pages from the page cache and forcing them to be read back in from disk might not all be friendly to a performance sensitive application. In the case where the page size == the fs block size, instead of throwing away all of the pages, we could also have VM code which remaps the pages in the page cache (including potentially editing the page tables for any mmap'ed pages). This of course gets _much_ more complicated, and we still have to deal with the case where the collapse_range call is aligned with the fs block size, but which is not aligned or is not a muliple of the page size. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/3] ALSA: Added jack detection kcontrol support
On 07/29/2013 07:19 AM, Takashi Iwai wrote: At Fri, 26 Jul 2013 15:45:11 -0700, Felipe F. Tonello wrote: From: "Felipe F. Tonello" This patch adds jack support for alsa kcontrol. This support is necessary since the new kcontrol is used by user-space daemons, such as PulseAudio(>=2.0), to do jack detection.) Signed-off-by: Felipe F. Tonello This patch breaks the build. I'm planing to send a v3 in few days. Felipe Tonello -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] ALSA: oxygen: Updating jack implementation according new ALSA Jacks
Hi Takashi, On 07/29/2013 05:10 AM, Takashi Iwai wrote: At Fri, 26 Jul 2013 12:02:51 -0700, Felipe Tonello wrote: Hi Mark, On Fri, Jul 26, 2013 at 11:56 AM, Mark Brown wrote: On Fri, Jul 26, 2013 at 11:25:33AM -0700, Felipe F. Tonello wrote: From: "Felipe F. Tonello" ALSA standard jacks already are implemented using ALSA KControl. So there is no need implement that itself or to use snd_jack for input events only. Similar changlog comment as on the ASoC patch and... Ok. snd_jack_new(chip->card, "Headphone", - SND_JACK_HEADPHONE, >hp_jack); + SND_JACK_HEADPHONE, 0, >hp_jack); xonar_ds_handle_hp_jack(chip); ...this really ought to be done as part of the commit that adds the parameter since it breaks the build until this patch is applied. But that's why is a patch series. But as you say, are you suggesting me to propose this changes in one patch only? The basic rule of the patch series is: - they are split in a logical manner - each commit must not break the build The second rule is important especially for bisection and backporting. You could have done in a different way (e.g. adding a new function like snd_jack_new_with_index() instead of changing snd_jack_new() itself). Which way is better, depends on the implementation details. In this case I think this will not solve the issue. Because the my propose is to make the control creation standard for jacks. Felipe Tonello -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [RFC][PATCH 3/4] tracing/kprobes: Fail to unregister if probe event files are open
(2013/08/01 11:50), Steven Rostedt wrote: > Here's the new change log, but the same patch. Does this sound ok to you > guys? Great! This looks good for me. ;) Thank you very much! > > -- Steve > >>From 40c32592668b727cbfcf7b1c0567f581bd62a5e4 Mon Sep 17 00:00:00 2001 > From: "Steven Rostedt (Red Hat)" > Date: Wed, 3 Jul 2013 23:33:50 -0400 > Subject: [PATCH] tracing/kprobes: Fail to unregister if probe event files are > in use > > When a probe is being removed, it cleans up the event files that correspond > to the probe. But there is a race between writing to one of these files > and deleting the probe. This is especially true for the "enable" file. > > CPU 0 CPU 1 > - - > > fd = open("enable",O_WRONLY); > > probes_open() > release_all_trace_probes() > unregister_trace_probe() > if (trace_probe_is_enabled(tp)) > return -EBUSY > > write(fd, "1", 1) > __ftrace_set_clr_event() > call->class->reg() > (kprobe_register) >enable_trace_probe(tp) > > __unregister_trace_probe(tp); > list_del(>list) > unregister_probe_event(tp) <-- fails! > free_trace_probe(tp) > > write(fd, "0", 1) > __ftrace_set_clr_event() > call->class->unreg > (kprobe_register) > disable_trace_probe(tp) <-- BOOM! > > A test program was written that used two threads to simulate the > above scenario adding a nanosleep() interval to change the timings > and after several thousand runs, it was able to trigger this bug > and crash: > > BUG: unable to handle kernel paging request at 000500f9 > IP: [] probes_open+0x3b/0xa7 > PGD 7808a067 PUD 0 > Oops: [#1] PREEMPT SMP > Dumping ftrace buffer: > - > Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 > CPU: 1 PID: 2070 Comm: test-kprobe-rem Not tainted 3.11.0-rc3-test+ #47 > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by > O.E.M., BIOS SDBLI944.86P 05/08/2007 > task: 880077756440 ti: 880076e52000 task.ti: 880076e52000 > RIP: 0010:[] [] probes_open+0x3b/0xa7 > RSP: 0018:880076e53c38 EFLAGS: 00010203 > RAX: 00050001 RBX: 88007844f440 RCX: 0003 > RDX: 0003 RSI: 0003 RDI: 880076e52000 > RBP: 880076e53c58 R08: 880076e53bd8 R09: > R10: 880077756440 R11: 0006 R12: 810dee35 > R13: 880079250418 R14: R15: 88007844f450 > FS: 7f87a276f700() GS:88007d48() knlGS: > CS: 0010 DS: ES: CR0: 8005003b > CR2: 000500f9 CR3: 77262000 CR4: 07e0 > Stack: > 880076e53c58 81219ea0 88007844f440 810dee35 > 880076e53ca8 81130f78 8800772986c0 8800796f93a0 > 81d1b5d8 880076e53e04 88007844f440 > Call Trace: > [] ? security_file_open+0x2c/0x30 > [] ? unregister_trace_probe+0x4b/0x4b > [] do_dentry_open+0x162/0x226 > [] finish_open+0x46/0x54 > [] do_last+0x7f6/0x996 > [] ? inode_permission+0x42/0x44 > [] path_openat+0x232/0x496 > [] do_filp_open+0x3a/0x8a > [] ? __alloc_fd+0x168/0x17a > [] do_sys_open+0x70/0x102 > [] ? trace_hardirqs_on_caller+0x160/0x197 > [] SyS_open+0x1e/0x20 > [] system_call_fastpath+0x16/0x1b > Code: e5 41 54 53 48 89 f3 48 83 ec 10 48 23 56 78 48 39 c2 75 6c 31 f6 48 c7 > RIP [] probes_open+0x3b/0xa7 > RSP > CR2: 000500f9 > ---[ end trace 35f17d68fc569897 ]--- > > The unregister_trace_probe() must be done first, and if it fails it must > fail the removal of the kprobe. > > Several changes have already been made by Oleg Nesterov and Masami Hiramatsu > to allow moving the unregister_probe_event() before the removal of > the probe and exit the function if it fails. This prevents the tp > structure from being used after it is freed. > > Link: http://lkml.kernel.org/r/20130704034038.819592...@goodmis.org > > Acked-by: Masami Hiramatsu > Signed-off-by: Steven Rostedt > --- > kernel/trace/trace_kprobe.c | 21 +++-- > 1 file changed, 15 insertions(+), 6 deletions(-) > > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c > index 3811487..243f683 100644 > --- a/kernel/trace/trace_kprobe.c > +++ b/kernel/trace/trace_kprobe.c > @@ -95,7 +95,7 @@ static __kprobes bool trace_probe_is_on_module(struct > trace_probe *tp) > } > > static int register_probe_event(struct trace_probe *tp); > -static void unregister_probe_event(struct trace_probe *tp); > +static int unregister_probe_event(struct
Re: [PATCH 1/4] ALSA: Added jack detection kcontrol support
Hi Takashi, On 07/29/2013 05:05 AM, Takashi Iwai wrote: At Sat, 27 Jul 2013 13:25:24 +0100, Mark Brown wrote: On Fri, Jul 26, 2013 at 04:13:40PM -0700, Felipe Tonello wrote: On Fri, Jul 26, 2013 at 3:48 PM, Mark Brown wrote: What I'd expect to happen here is that for multi function jacks we create a control per function if the controls are valid. Ok, so the idea is just to change the control to type integer instead of boolean, right? Because as you say, the user will be able to check the type of jack based on the status value, right? It might be more idiomatic and more compatible with userspace to create multiple controls for the jack, there was some discussion of this in the past I think but I can't remember the result. I also forget the end result :) Another option would be to make an enum control, but multiple boolean controls are easier in the end. The headset controls of HD-audio are represented in that way, too (because HD-audio specification can't tell better). Yes. Thanks for the comment. Felipe Tonello -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] ALSA: Added jack detection kcontrol support
On 07/27/2013 05:25 AM, Mark Brown wrote: On Fri, Jul 26, 2013 at 04:13:40PM -0700, Felipe Tonello wrote: On Fri, Jul 26, 2013 at 3:48 PM, Mark Brown wrote: What I'd expect to happen here is that for multi function jacks we create a control per function if the controls are valid. Ok, so the idea is just to change the control to type integer instead of boolean, right? Because as you say, the user will be able to check the type of jack based on the status value, right? It might be more idiomatic and more compatible with userspace to create multiple controls for the jack, there was some discussion of this in the past I think but I can't remember the result. Yes. If there's only one function supported the current code is fine but for multiple functions it's going to discard useful information. So, what do you suggest to do that? I'm not sure if I understand what you are saying. When you mean function, do you mean the SND_JACK_BTN_n or the the jack types, such as SND_JACK_HEADPHONE, and so on? The jack types, the buttons definitely sohuld be going up as input events. If a codec creates a jack type SND_JACK_HEADSET (= SND_JACK_HEADPHONE | SND_JACK_MICROPHONE). It should be created two controls, name + "Headphone Jack" and name + "Microphone Jack"? If so, what about the status to report? How to know which control to report? The drivers report a bitmask for status. I did that but I'm not happy with the control name. Usually drivers add jacks like: "Headset" for a headset, "Headphone" for a headphone and so on. I did the following: control name is jack name + (jack type) + Jack. If jack type == jack name, don't add (jack type) to the name. Any suggestion how it should be? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] dma: edma: Find missed events and issue them
On 07/31/2013 09:27 PM, Joel Fernandes wrote: > On 07/31/2013 04:18 AM, Sekhar Nori wrote: >> On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote: >>> Hi Sekhar, >>> >>> On 07/30/2013 02:05 AM, Sekhar Nori wrote: On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: > In an effort to move to using Scatter gather lists of any size with > EDMA as discussed at [1] instead of placing limitations on the driver, > we work through the limitations of the EDMAC hardware to find missed > events and issue them. > > The sequence of events that require this are: > > For the scenario where MAX slots for an EDMA channel is 3: > > SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null > > The above SG list will have to be DMA'd in 2 sets: > > (1) SG1 -> SG2 -> SG3 -> Null > (2) SG4 -> SG5 -> SG6 -> Null > > After (1) is succesfully transferred, the events from the MMC controller > donot stop coming and are missed by the time we have setup the transfer > for (2). So here, we catch the events missed as an error condition and > issue them manually. Are you sure there wont be any effect of these missed events on the peripheral side. For example, wont McASP get into an underrun condition when it encounters a null PaRAM set? Even UART has to transmit to a >>> >>> But it will not encounter null PaRAM set because McASP uses contiguous >>> buffers for transfer which are not scattered across physical memory. >>> This can be accomplished with an SG of size 1. For such SGs, this patch >>> series leaves it linked Dummy and does not link to Null set. Null set is >>> only used for SG lists that are > MAX_NR_SG in size such as those >>> created for example by MMC and Crypto. >>> particular baud so I guess it cannot wait like the way MMC/SD can. >>> >>> Existing driver have to wait anyway if they hit MAX SG limit today. If >>> they don't want to wait, they would have allocated a contiguous block of >>> memory and DMA that in one stretch so they don't lose any events, and in >>> such cases we are not linking to Null. >> >> As long as DMA driver can advertize its MAX SG limit, peripherals can >> always work around that by limiting the number of sync events they >> generate so as to not having any of the events getting missed. With this >> series, I am worried that EDMA drivers is advertizing that it can handle >> any length SG list while not taking care of missing any events while >> doing so. This will break the assumptions that driver writers make. Sorry, just forgot to respond to "not taking care of missing any events while doing so". Can you clarify this? DMA engine driver is taking care of missed events. Also- missing of events doesn't result in feedback to the peripheral. Peripheral sends even to DMA controller, event is missed. Peripheral doesn't know anything about what happened and is waiting for transfer from the DMA controller. Thanks, -Joel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpufreq: Re-arrange header files
On 1 August 2013 03:08, Rafael J. Wysocki wrote: > Normally, I'd regard that as pointless churn, but why don't you actually > remove direct #includes already included through cpufreq.h when you're at it? > > For example, what's the point of including cpumask.h directly from > cpufreq_governor.c if it includes cpufreq.h too? > > Also it doesn't look like all of the includes are really necessary. Okay, all cleaned up now.. (attached for applying): --x-x From: Viresh Kumar Date: Wed, 31 Jul 2013 20:47:01 +0530 Subject: [PATCH V2] cpufreq: Cleanup header files included in core This patch intends to cleanup following issues in the header files included in cpufreq core layers: - Include headers in ascending order, so that we don't add same multiple times by mistake. - must be included after , so that they override whatever they need. - Remove unnecessary header files - Don't include files already included by cpufreq.h or cpufreq_governor.h Signed-off-by: Viresh Kumar --- Changes since V2 - Remove unnecessary header files. drivers/cpufreq/cpufreq.c | 20 ++-- drivers/cpufreq/cpufreq_conservative.c | 12 drivers/cpufreq/cpufreq_governor.c | 6 -- drivers/cpufreq/cpufreq_governor.h | 5 ++--- drivers/cpufreq/cpufreq_ondemand.c | 12 +--- drivers/cpufreq/cpufreq_performance.c | 3 +-- drivers/cpufreq/cpufreq_powersave.c| 3 +-- drivers/cpufreq/cpufreq_stats.c| 9 + drivers/cpufreq/freq_table.c | 4 +--- include/linux/cpufreq.h| 11 +++ 10 files changed, 16 insertions(+), 69 deletions(-) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 170d344..473f2ad 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -17,24 +17,16 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt -#include -#include -#include -#include -#include -#include +#include #include -#include -#include -#include -#include #include -#include -#include -#include +#include +#include +#include #include +#include #include - +#include #include /** diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index 0ceb2ef..841e256 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -11,19 +11,7 @@ * published by the Free Software Foundation. */ -#include -#include -#include -#include -#include -#include -#include -#include -#include #include -#include -#include - #include "cpufreq_governor.h" /* Conservative governor macros */ diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 7409dbd..556064e 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -16,15 +16,9 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt -#include -#include -#include #include #include -#include #include -#include -#include #include "cpufreq_governor.h" diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index 0e0dd4c..cf0b7a4 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -18,10 +18,9 @@ #define _CPUFREQ_GOVERNOR_H #include -#include +#include +#include #include -#include -#include /* * The polling frequency depends on the capability of the processor. Default diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index a3c5574..27c732e 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -12,20 +12,10 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt -#include -#include -#include -#include -#include -#include -#include +#include #include #include -#include #include -#include -#include - #include "cpufreq_governor.h" /* On-demand governor macros */ diff --git a/drivers/cpufreq/cpufreq_performance.c b/drivers/cpufreq/cpufreq_performance.c index 9fef7d6..cf117de 100644 --- a/drivers/cpufreq/cpufreq_performance.c +++ b/drivers/cpufreq/cpufreq_performance.c @@ -12,10 +12,9 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt -#include -#include #include #include +#include static int cpufreq_governor_performance(struct cpufreq_policy *policy, unsigned int event) diff --git a/drivers/cpufreq/cpufreq_powersave.c b/drivers/cpufreq/cpufreq_powersave.c index 32109a1..e3b874c 100644 --- a/drivers/cpufreq/cpufreq_powersave.c +++ b/drivers/cpufreq/cpufreq_powersave.c @@ -12,10 +12,9 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt -#include -#include #include #include +#include static int cpufreq_governor_powersave(struct cpufreq_policy *policy, unsigned int event) diff --git a/drivers/cpufreq/cpufreq_stats.c b/drivers/cpufreq/cpufreq_stats.c index cb38413..a7143b0 100644 ---
Re: [RFC][PATCH 4/4] tracing/uprobes: Fail to unregister if probe event files are open
Srikar, Can you give your Acked-by to this patch. Thanks! -- Steve On Wed, 2013-07-03 at 23:33 -0400, Steven Rostedt wrote: > plain text document attachment > (0004-tracing-uprobes-Fail-to-unregister-if-probe-event-fi.patch) > From: "Steven Rostedt (Red Hat)" > > When one of the event files is opened, we need to prevent them from > being removed. Modules do with with the module owner set (automated > from the VFS layer). The ftrace buffer instances have a ref count added > to the trace_array when the enabled file is opened (the others are not > that big of a deal, as they only reference the event calls which > still exist when an instance disappears). But kprobes and uprobes > do not have any protection. > > Have the unregister probe fail when the event files are open, in use > are used by perf. > > Signed-off-by: Steven Rostedt > --- > kernel/trace/trace_uprobe.c | 48 > --- > 1 file changed, 36 insertions(+), 12 deletions(-) > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > index d5d0cd3..4916da5 100644 > --- a/kernel/trace/trace_uprobe.c > +++ b/kernel/trace/trace_uprobe.c > @@ -70,7 +70,7 @@ struct trace_uprobe { > (sizeof(struct probe_arg) * (n))) > > static int register_uprobe_event(struct trace_uprobe *tu); > -static void unregister_uprobe_event(struct trace_uprobe *tu); > +static int unregister_uprobe_event(struct trace_uprobe *tu); > > static DEFINE_MUTEX(uprobe_lock); > static LIST_HEAD(uprobe_list); > @@ -164,11 +164,17 @@ static struct trace_uprobe *find_probe_event(const char > *event, const char *grou > } > > /* Unregister a trace_uprobe and probe_event: call with locking uprobe_lock > */ > -static void unregister_trace_uprobe(struct trace_uprobe *tu) > +static int unregister_trace_uprobe(struct trace_uprobe *tu) > { > + int ret; > + > + ret = unregister_uprobe_event(tu); > + if (ret) > + return ret; > + > list_del(>list); > - unregister_uprobe_event(tu); > free_trace_uprobe(tu); > + return 0; > } > > /* Register a trace_uprobe and probe_event */ > @@ -181,9 +187,12 @@ static int register_trace_uprobe(struct trace_uprobe *tu) > > /* register as an event */ > old_tp = find_probe_event(tu->call.name, tu->call.class->system); > - if (old_tp) > + if (old_tp) { > /* delete old event */ > - unregister_trace_uprobe(old_tp); > + ret = unregister_trace_uprobe(old_tp); > + if (ret) > + goto end; > + } > > ret = register_uprobe_event(tu); > if (ret) { > @@ -256,6 +265,8 @@ static int create_trace_uprobe(int argc, char **argv) > group = UPROBE_EVENT_SYSTEM; > > if (is_delete) { > + int ret; > + > if (!event) { > pr_info("Delete command needs an event name.\n"); > return -EINVAL; > @@ -269,9 +280,9 @@ static int create_trace_uprobe(int argc, char **argv) > return -ENOENT; > } > /* delete an event */ > - unregister_trace_uprobe(tu); > + ret = unregister_trace_uprobe(tu); > mutex_unlock(_lock); > - return 0; > + return ret; > } > > if (argc < 2) { > @@ -408,16 +419,20 @@ fail_address_parse: > return ret; > } > > -static void cleanup_all_probes(void) > +static int cleanup_all_probes(void) > { > struct trace_uprobe *tu; > + int ret = 0; > > mutex_lock(_lock); > while (!list_empty(_list)) { > tu = list_entry(uprobe_list.next, struct trace_uprobe, list); > - unregister_trace_uprobe(tu); > + ret = unregister_trace_uprobe(tu); > + if (ret) > + break; > } > mutex_unlock(_lock); > + return ret; > } > > /* Probes listing interfaces */ > @@ -462,8 +477,12 @@ static const struct seq_operations probes_seq_op = { > > static int probes_open(struct inode *inode, struct file *file) > { > + int ret = 0; > + > if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) > - cleanup_all_probes(); > + ret = cleanup_all_probes(); > + if (ret) > + return ret; > > return seq_open(file, _seq_op); > } > @@ -970,12 +989,17 @@ static int register_uprobe_event(struct trace_uprobe > *tu) > return ret; > } > > -static void unregister_uprobe_event(struct trace_uprobe *tu) > +static int unregister_uprobe_event(struct trace_uprobe *tu) > { > + int ret; > + > /* tu->event is unregistered in trace_remove_event_call() */ > - trace_remove_event_call(>call); > + ret = trace_remove_event_call(>call); > + if (ret) > + return ret; > kfree(tu->call.print_fmt); > tu->call.print_fmt = NULL; > + return 0; > } > > /* Make a trace
Re: [PATCH] Xen: Fix retry calls into PRIVCMD_MMAPBATCH*.
-- Resend as I haven't seen this hit the lists. Maybe some smtp misconfig. Apologies. Also expanded cc -- When a foreign mapper attempts to map guest frames that are paged out, the mapper receives an ENOENT response and will have to try again while a helper process pages the target frame back in. Gating checks on PRIVCMD_MMAPBATCH* ioctl args were preventing retries of mapping calls. Signed-off-by: Andres Lagar-Cavilla --- drivers/xen/privcmd.c | 32 +++- 1 files changed, 27 insertions(+), 5 deletions(-) diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c index f8e5dd7..44a26c6 100644 --- a/drivers/xen/privcmd.c +++ b/drivers/xen/privcmd.c @@ -43,9 +43,12 @@ MODULE_LICENSE("GPL"); #define PRIV_VMA_LOCKED ((void *)1) -#ifndef HAVE_ARCH_PRIVCMD_MMAP static int privcmd_enforce_singleshot_mapping(struct vm_area_struct *vma); -#endif + +static int privcmd_enforce_singleshot_mapping_granular( + struct vm_area_struct *vma, + unsigned long addr, + unsigned long nr_pages); static long privcmd_ioctl_hypercall(void __user *udata) { @@ -422,9 +425,9 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version) vma = find_vma(mm, m.addr); if (!vma || vma->vm_ops != _vm_ops || - (m.addr != vma->vm_start) || - ((m.addr + (nr_pages << PAGE_SHIFT)) != vma->vm_end) || - !privcmd_enforce_singleshot_mapping(vma)) { + (m.addr < vma->vm_start) || + ((m.addr + (nr_pages << PAGE_SHIFT)) > vma->vm_end) || + !privcmd_enforce_singleshot_mapping_granular(vma, m.addr, nr_pages)) { up_write(>mmap_sem); ret = -EINVAL; goto out; @@ -540,11 +543,30 @@ static int privcmd_mmap(struct file *file, struct vm_area_struct *vma) return 0; } +/* For MMAP */ static int privcmd_enforce_singleshot_mapping(struct vm_area_struct *vma) { return !cmpxchg(>vm_private_data, NULL, PRIV_VMA_LOCKED); } +/* For MMAPBATCH*. This allows asserting the singleshot mapping + * on a per pfn/pte basis. Mapping calls that fail with ENOENT + * can be then retried until success. */ +static int enforce_singleshot_mapping_fn(pte_t *pte, struct page *pmd_page, + unsigned long addr, void *data) +{ + return pte_none(*pte) ? 0 : -EBUSY; +} + +static int privcmd_enforce_singleshot_mapping_granular( + struct vm_area_struct *vma, + unsigned long addr, + unsigned long nr_pages) +{ + return apply_to_page_range(vma->vm_mm, addr, nr_pages << PAGE_SHIFT, + enforce_singleshot_mapping_fn, NULL) == 0; +} + const struct file_operations xen_privcmd_fops = { .owner = THIS_MODULE, .unlocked_ioctl = privcmd_ioctl, -- 1.7.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] f2fs: add sysfs entries to select the gc policy
Hi Jeon, On 07/31/2013 10:33 PM, Namjae Jeon wrote: > From: Namjae Jeon > > Add sysfs entries namely gc_long_idle and gc_short_idle to control the > gc policy. Where long idle corresponds to selecting a cost benefit approach, > while short idle corresponds to selecting a greedy approach to garbage > collection. The selection is mutually exclusive one approach will work at > any point. > > Signed-off-by: Namjae Jeon > Signed-off-by: Pankaj Kumar > --- > Documentation/ABI/testing/sysfs-fs-f2fs | 12 +++ > Documentation/filesystems/f2fs.txt |8 + > fs/f2fs/gc.c| 22 ++-- > fs/f2fs/gc.h|4 +++ > fs/f2fs/super.c | 59 > +-- > 5 files changed, 99 insertions(+), 6 deletions(-) > > diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs > b/Documentation/ABI/testing/sysfs-fs-f2fs > index 5f44095..96b62ea 100644 > --- a/Documentation/ABI/testing/sysfs-fs-f2fs > +++ b/Documentation/ABI/testing/sysfs-fs-f2fs > @@ -19,4 +19,16 @@ Description: >Controls the default sleep time for gc_thread. Time >is in milliseconds. > > +What:/sys/fs/f2fs//gc_long_idle > +Date:July 2013 > +Contact: "Namjae Jeon" > +Description: > + Controls the selection of gc policy. long_idle is used > + to select the cost benefit approach for garbage collection. > > +What:/sys/fs/f2fs//gc_short_idle > +Date:July 2013 > +Contact: "Namjae Jeon" > +Description: > + Controls the selection of gc policy. short_idle is used > + to select the greedy approach for garbage collection. > diff --git a/Documentation/filesystems/f2fs.txt > b/Documentation/filesystems/f2fs.txt > index 2e9e873..06dd5d7 100644 > --- a/Documentation/filesystems/f2fs.txt > +++ b/Documentation/filesystems/f2fs.txt > @@ -158,6 +158,14 @@ Files in /sys/fs/f2fs/ >time for the garbage collection thread. Time is >in milliseconds. > > + gc_long_idle This parameter controls the selection of cost > + benefit approach for garbage collectoin. > Writing > + 1 to this file will select the cost benefit > policy. > + > + gc_short_idleThis parameter controls the selection of greedy > + approach for the garbage collection. Writing 1 > + to this file will select the greedy policy. Why introduce two opposite attributes? It'll cause some confusion condition if we double enable/disable them. > + > > > USAGE > > > diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c > index 60d4f67..af2d9d7 100644 > --- a/fs/f2fs/gc.c > +++ b/fs/f2fs/gc.c > @@ -106,6 +106,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi) > gc_th->max_sleep_time = DEF_GC_THREAD_MAX_SLEEP_TIME; > gc_th->no_gc_sleep_time = DEF_GC_THREAD_NOGC_SLEEP_TIME; > > + gc_th->long_idle = gc_th->short_idle = 0; > + > sbi->gc_thread = gc_th; > init_waitqueue_head(>gc_thread->gc_wait_queue_head); > sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi, > @@ -130,9 +132,23 @@ void stop_gc_thread(struct f2fs_sb_info *sbi) > sbi->gc_thread = NULL; > } > > -static int select_gc_type(int gc_type) > +static int select_gc_type(struct f2fs_gc_kthread *gc_th, int gc_type) > { > - return (gc_type == BG_GC) ? GC_CB : GC_GREEDY; > + int gc_mode; > + > + if (gc_th) { > + if (gc_th->long_idle) { > + gc_mode = GC_CB; > + goto out; > + } else if (gc_th->short_idle) { > + gc_mode = GC_GREEDY; > + goto out; > + } > + } > + > + gc_mode = (gc_type == BG_GC) ? GC_CB : GC_GREEDY; > +out: > + return gc_mode; > } > > static void select_policy(struct f2fs_sb_info *sbi, int gc_type, > @@ -145,7 +161,7 @@ static void select_policy(struct f2fs_sb_info *sbi, int > gc_type, > p->dirty_segmap = dirty_i->dirty_segmap[type]; > p->ofs_unit = 1; > } else { > - p->gc_mode = select_gc_type(gc_type); > + p->gc_mode = select_gc_type(sbi->gc_thread, gc_type); > p->dirty_segmap = dirty_i->dirty_segmap[DIRTY]; > p->ofs_unit = sbi->segs_per_sec; > } > diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h > index f4bf44c..b2faae5 100644 > --- a/fs/f2fs/gc.h > +++ b/fs/f2fs/gc.h > @@ -30,6 +30,10 @@ struct f2fs_gc_kthread { > unsigned int min_sleep_time; > unsigned int max_sleep_time; > unsigned int no_gc_sleep_time; > + > + /*
Re: [PATCH resend] drop_caches: add some documentation and info message
On Wed, 31 Jul 2013 23:11:50 -0400 KOSAKI Motohiro wrote: > >> --- a/fs/drop_caches.c > >> +++ b/fs/drop_caches.c > >> @@ -59,6 +59,8 @@ int drop_caches_sysctl_handler(ctl_table *table, int > >> write, > >>if (ret) > >>return ret; > >>if (write) { > >> + printk(KERN_INFO "%s (%d): dropped kernel caches: %d\n", > >> + current->comm, task_pid_nr(current), sysctl_drop_caches); > >>if (sysctl_drop_caches & 1) > >>iterate_supers(drop_pagecache_sb, NULL); > >>if (sysctl_drop_caches & 2) > > > > How about we do > > > > if (!(sysctl_drop_caches & 4)) > > printk() > > > > so people can turn it off if it's causing problems? > > The best interface depends on the purpose. If you want to detect crazy > application, > we can't assume an application co-operate us. So, I doubt this works. You missed the "!". I'm proposing that setting the new bit 2 will permit people to prevent the new printk if it is causing them problems. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] f2fs: add sysfs support for controlling the gc_thread
Hi Jeon, On 07/31/2013 10:33 PM, Namjae Jeon wrote: > From: Namjae Jeon > > Add sysfs entries to control the timing parameters for > f2fs gc thread. > > Various Sysfs options introduced are: > gc_min_sleep_time: Min Sleep time for GC in ms > gc_max_sleep_time: Max Sleep time for GC in ms > gc_no_gc_sleep_time: Default Sleep time for GC in ms > > Signed-off-by: Namjae Jeon > Signed-off-by: Pankaj Kumar > --- > Documentation/ABI/testing/sysfs-fs-f2fs | 22 ++ > Documentation/filesystems/f2fs.txt | 26 +++ > fs/f2fs/f2fs.h |4 + > fs/f2fs/gc.c| 17 +++-- > fs/f2fs/gc.h| 33 > fs/f2fs/super.c | 124 > +++ > 6 files changed, 206 insertions(+), 20 deletions(-) > create mode 100644 Documentation/ABI/testing/sysfs-fs-f2fs > > diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs > b/Documentation/ABI/testing/sysfs-fs-f2fs > new file mode 100644 > index 000..5f44095 > --- /dev/null > +++ b/Documentation/ABI/testing/sysfs-fs-f2fs > @@ -0,0 +1,22 @@ > +What:/sys/fs/f2fs//gc_max_sleep_time > +Date:July 2013 > +Contact: "Namjae Jeon" > +Description: > + Controls the maximun sleep time for gc_thread. Time > + is in milliseconds. > + > +What:/sys/fs/f2fs//gc_min_sleep_time > +Date:July 2013 > +Contact: "Namjae Jeon" > +Description: > + Controls the minimum sleep time for gc_thread. Time > + is in milliseconds. > + > +What:/sys/fs/f2fs//gc_no_gc_sleep_time > +Date:July 2013 > +Contact: "Namjae Jeon" > +Description: > + Controls the default sleep time for gc_thread. Time > + is in milliseconds. > + > + > diff --git a/Documentation/filesystems/f2fs.txt > b/Documentation/filesystems/f2fs.txt > index 0500c19..2e9e873 100644 > --- a/Documentation/filesystems/f2fs.txt > +++ b/Documentation/filesystems/f2fs.txt > @@ -133,6 +133,32 @@ f2fs. Each file shows the whole f2fs information. > - current memory footprint consumed by f2fs. > > > > +SYSFS ENTRIES > + > + > +Information about mounted f2fs file systems can be found in > +/sys/fs/f2fs. Each mounted filesystem will have a directory in > +/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda). > +The files in each per-device directory are shown in table below. > + > +Files in /sys/fs/f2fs/ > +(see also Documentation/ABI/testing/sysfs-fs-f2fs) > +.. > + File Content > + > + gc_max_sleep_timeThis tuning parameter controls the maximum > sleep > + time for the garbage collection thread. Time is > + in milliseconds. > + > + gc_min_sleep_timeThis tuning parameter controls the minimum > sleep > + time for the garbage collection thread. Time is > + in milliseconds. > + > + gc_no_gc_sleep_time This tuning parameter controls the default > sleep > + time for the garbage collection thread. Time is > + in milliseconds. > + > + > USAGE > > > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index 78777cd..63813be 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -430,6 +430,10 @@ struct f2fs_sb_info { > #endif > unsigned int last_victim[2];/* last victim segment # */ > spinlock_t stat_lock; /* lock for stat operations */ > + > + /* For sysfs suppport */ > + struct kobject s_kobj; > + struct completion s_kobj_unregister; What is this completion used for? Or it's an ahead design? I do not find synchronization routines use it. Am I missing something? > }; > > /* > diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c > index 35f9b1a..60d4f67 100644 > --- a/fs/f2fs/gc.c > +++ b/fs/f2fs/gc.c > @@ -29,10 +29,11 @@ static struct kmem_cache *winode_slab; > static int gc_thread_func(void *data) > { > struct f2fs_sb_info *sbi = data; > + struct f2fs_gc_kthread *gc_th = sbi->gc_thread; > wait_queue_head_t *wq = >gc_thread->gc_wait_queue_head; > long wait_ms; > > - wait_ms = GC_THREAD_MIN_SLEEP_TIME; > + wait_ms = gc_th->min_sleep_time; > > do { > if (try_to_freeze()) > @@ -45,7 +46,7 @@ static int gc_thread_func(void *data) > break; > > if
Re: [PATCH resend] drop_caches: add some documentation and info message
>> --- a/fs/drop_caches.c >> +++ b/fs/drop_caches.c >> @@ -59,6 +59,8 @@ int drop_caches_sysctl_handler(ctl_table *table, int write, >> if (ret) >> return ret; >> if (write) { >> +printk(KERN_INFO "%s (%d): dropped kernel caches: %d\n", >> + current->comm, task_pid_nr(current), sysctl_drop_caches); >> if (sysctl_drop_caches & 1) >> iterate_supers(drop_pagecache_sb, NULL); >> if (sysctl_drop_caches & 2) > > How about we do > > if (!(sysctl_drop_caches & 4)) > printk() > > so people can turn it off if it's causing problems? The best interface depends on the purpose. If you want to detect crazy application, we can't assume an application co-operate us. So, I doubt this works. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] syslog_ns: make syslog handling per namespace
On 2013/8/1 9:36, Gao feng wrote: > On 07/29/2013 10:31 AM, Rui Xiang wrote: >> This patch makes syslog buf and other fields per >> namespace. >> >> Here use ns->log_buf(log_buf_len, logbuf_lock, >> log_first_seq, logbuf_lock, and so on) fields >> instead of global ones to handle syslog. >> >> Syslog interfaces such as /dev/kmsg, /proc/kmsg, >> and syslog syscall are all containerized for >> container users. >> > > /dev/kmsg is used by the syslog api closelog, openlog, syslog, vsyslog, > this should be per user namespace, but seems in your patch, Yes, /dev/kmsg is per user namespace, and per syslog ns, too. > the syslog message generated through these APIs on host can be exported > to the /dev/kmsg of container, is this want we want? > Ah.. I think your question targets at devkmsg_writev function, right? You remind me that it's really an issue. Printk_emit in devkmsg_writev should not use init_syslog_ns as its syslog_ns but current_user_ns->syslog_ns. In 1st version, current_syslog_ns was used in vprintk_emit. In this version, the interface vprintk_emit has changed, but this patch misses that. I will fix it. Thanks for your reminder. :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] asymmetric keys: explicitly add the leading zero byte to encoded message
From: Chun-Yi Lee Per PKCS1 spec, the EMSA-PKCS1-v1_5 encoded message is leading by 0x00 0x01 in its first 2 bytes. The leading zero byte is suppressed by MPI so we pass a pointer to the _preceding_ byte to RSA_verify() in original code, but it has risk for the byte is not zero because it's not in EM buffer's scope, neither RSA_verify() nor mpi_get_buffer() didn't take care the leading byte. To avoid the risk, that's better we explicitly add the leading zero byte to EM for pass to RSA_verify(). This patch allocate a _EM buffer to capture the result from RSA_I2OSP(), then set the first byte to zero in EM and copy the remaining bytes from _EM. Cc: Rusty Russell Cc: Josh Boyer Cc: Randy Dunlap Cc: Herbert Xu Cc: "David S. Miller" Cc: David Howells Signed-off-by: Chun-Yi Lee --- crypto/asymmetric_keys/rsa.c | 14 ++ 1 files changed, 10 insertions(+), 4 deletions(-) diff --git a/crypto/asymmetric_keys/rsa.c b/crypto/asymmetric_keys/rsa.c index ca1a4f3..7bc99d2 100644 --- a/crypto/asymmetric_keys/rsa.c +++ b/crypto/asymmetric_keys/rsa.c @@ -303,6 +303,7 @@ static int RSA_verify_signature(const struct public_key *key, /* Variables as per RFC3447 sec 8.2.2 */ const u8 *H = sig->digest; u8 *EM = NULL; + u8 *_EM = NULL; MPI m = NULL; size_t k; @@ -337,14 +338,19 @@ static int RSA_verify_signature(const struct public_key *key, /* (2c) Convert the message representative (m) to an encoded message * (EM) of length k octets. * -* NOTE! The leading zero byte is suppressed by MPI, so we pass a -* pointer to the _preceding_ byte to RSA_verify()! +* NOTE! The leading zero byte is suppressed by MPI, so we add it +* back to EM before input to RSA_verify()! */ - ret = RSA_I2OSP(m, k, ); + ret = RSA_I2OSP(m, k, &_EM); if (ret < 0) goto error; - ret = RSA_verify(H, EM - 1, k, sig->digest_size, + EM = kmalloc(k, GFP_KERNEL); + memset(EM, 0, 1); + memcpy(EM + 1, _EM, k-1); + kfree(_EM); + + ret = RSA_verify(H, EM, k, sig->digest_size, RSA_ASN1_templates[sig->pkey_hash_algo].data, RSA_ASN1_templates[sig->pkey_hash_algo].size); -- 1.6.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ksummit-2013-discuss] [ARM ATTEND] catching up on exploit mitigations
On Wed, Jul 31, 2013 at 07:47:30PM -0700, Olof Johansson wrote: > > > Hmm, really? Did you reported these bugs? I'm not aware of mainline > > > having any changes related to bug reports on PTEs on ARM. > > > > I wasn't sure if it was a googleism, or happens on mainline, so no. > > As of 3.10, it's not actually hard to run a mainline kernel on the > chromebook, but you have limited functionality (no wifi, no USB3, no > accellerated graphics). Easiest is to do it by booting from SD card. > See instructions at > http://www.chromium.org/chromium-os/u-boot-porting-guide/using-nv-u-boot-on-the-samsung-arm-chromebook. Yeah, that's how I got Fedora running on it (I think the people who did the Fedora spin for the chromebook chose the google kernel for exactly the reasons you mention above). It felt a little sluggish for a native kernel build though, and I'm not setup for cross-building so I stuck with the 3.4 build. > How long would a useful run of trinity take? It blew up in a minute or so of runtime when I tried last week. > I've got an autobuilder/autobooter setup here with a cross-section of > current ARM > hardware (7 platforms and counting) that I mostly do boot testing on, > and I should add a smallish suite of testcases to the same. Trinity > would be a good candidate for that. Feel free to mail me off-list if I can do anything to help out, though hopefully things should build & run 'out of the box'. You shouldn't need any special command line args (though you might get inspiration for ideas from the scripts/test-* harnesses) Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] fs: Introduce new flag FALLOC_FL_COLLAPSE_RANGE
On Wed, Jul 31, 2013 at 09:07:52PM -0400, Theodore Ts'o wrote: > On Thu, Aug 01, 2013 at 10:54:47AM +1000, Dave Chinner wrote: > > > It's not just the range that it's operating on, but also the region > > > beyond the range that's been collapsed out. > > > > Yes, that's part of "the range that it is operating over". > > > > > A quick eyeball of the patch didn't seem to show any code that handled > > > this, which is why I asked the question. > > > > Right, but really it's the least of the problems I've noticed - the > > XFS code is fundamentally broken in many ways - once I've finished > > commenting on it, I'll have a quick look to see if the ext4 code has > > the same fundamental flaws > > Well, the fundamental flaw of potential races if the file being > collapsed has been mmap'ed and there is another process making changes > beyond the range that is being collapsed is I suspect one that is > going to be very hard to solve, short of not allowing the collapse > while there are any read/write mmap's for the file in question. This funtionality is not introducing any new problems w.r.t. mmap(). In terms of truncating a mmap'd file, that can already occur and the behaviour is well known. The current patches don't do the operations in the correct order to ensure this is handled correctly (i.e. the inode size has to be changed before the page cache invalidation, not after), but otherwise there is no new issues introduced here. The real problem is that it has the same problem as hole punching w.r.t. not being able to serialise page faults against the extent manipulations after the page cache invalidation has occurred. That's what Jan Kara's range locking patches we talked about at LSFMM will address, and so is beyond the scope of this patchset. > And I would think these sorts of VM issues should be handled with some > generic library functions, instead of reimplementing them from scratch > in each file system. Well, yes, we have one - truncate_page_cache_range(), and it's used correctly in the ext4 patch Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/9] perf hists: Free srcline when freeing hist_entry
From: Namhyung Kim We've been leaked srcline of hist_entry, it should be freed also. Signed-off-by: Namhyung Kim --- tools/perf/util/hist.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c index 46a0d35a05e1..e6c11ccf138c 100644 --- a/tools/perf/util/hist.c +++ b/tools/perf/util/hist.c @@ -530,6 +530,7 @@ void hist_entry__free(struct hist_entry *he) { free(he->branch_info); free(he->mem_info); + free(he->srcline); free(he); } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/9] perf sort: Fix a memory leak on srcline
From: Namhyung Kim In the hist_entry__srcline_snprintf(), path and self->srcline are pointing the same memory region, but they are doubly allocated. Signed-off-by: Namhyung Kim --- tools/perf/util/sort.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c index 5f118a089519..5ba29fc679c5 100644 --- a/tools/perf/util/sort.c +++ b/tools/perf/util/sort.c @@ -269,10 +269,7 @@ static int hist_entry__srcline_snprintf(struct hist_entry *self, char *bf, if (!fp) goto out_ip; - if (getline(, _len, fp) < 0 || !line_len) - goto out_ip; - self->srcline = strdup(path); - if (self->srcline == NULL) + if (getline(>srcline, _len, fp) < 0 || !line_len) goto out_ip; nl = strchr(self->srcline, '\n'); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/9] perf tools: Enhance and correct srcline behavior
Hello, This patchset tries to fix and enhance current srcline behavior. Firstly it doesn't actually sort by srcline info but by ip. I suspect it was because of a performance reason to run external addr2line utility. It showed the srcline info after hist entries were collapsed. Thanks to Roberto, we now have internal implementation of addr2line using libbfd so can sort/compare by srcline of entries. Second problem for me was it sometimes printed "??:0" and sometimes printed a raw ip value for unknown srcline info. I changed to print the former consistently. While at it, I found some bugs/leaks in srcline handling. Patch 1-3 are fixes for those and can be merged separately. You can get this series on my 'perf/srcline-v1' branch in my tree at: git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git Any comments are welcome, thanks. Namhyung Namhyung Kim (8): perf sort: Fix a memory leak on srcline perf annotate: Reuse path from the result of addr2line perf hists: Free srcline when freeing hist_entry perf tools: Factor out get/free_srcline() perf tools: Do not try to call addr2line for non-binary files perf tools: Pass dso instead of dso_name to get_srcline() perf tools: Save failed result of get_srcline() perf tools: Fix srcline sort key behavior Roberto Vitillo (1): perf tools: Implement addr2line directly using libbfd tools/perf/Makefile| 1 + tools/perf/config/Makefile | 4 + tools/perf/util/annotate.c | 33 +- tools/perf/util/dso.c | 1 + tools/perf/util/dso.h | 2 + tools/perf/util/hist.c | 1 + tools/perf/util/sort.c | 58 -- tools/perf/util/srcline.c | 265 + tools/perf/util/util.h | 8 ++ 9 files changed, 308 insertions(+), 65 deletions(-) create mode 100644 tools/perf/util/srcline.c -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/9] perf tools: Save failed result of get_srcline()
From: Namhyung Kim Some dso's lack srcline info, so there's no point to keep trying on them. Just save failture status and skip them. Signed-off-by: Namhyung Kim --- tools/perf/util/dso.c | 1 + tools/perf/util/dso.h | 1 + tools/perf/util/srcline.c | 10 -- 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c index c4374f07603c..cc959d2fa5da 100644 --- a/tools/perf/util/dso.c +++ b/tools/perf/util/dso.c @@ -421,6 +421,7 @@ struct dso *dso__new(const char *name) dso->loaded = 0; dso->sorted_by_name = 0; dso->has_build_id = 0; + dso->has_srcline = 1; dso->kernel = DSO_TYPE_USER; dso->needs_swap = DSO_SWAP__UNSET; INIT_LIST_HEAD(>node); diff --git a/tools/perf/util/dso.h b/tools/perf/util/dso.h index 94c2f1a8b116..2a1ae2a8d39c 100644 --- a/tools/perf/util/dso.h +++ b/tools/perf/util/dso.h @@ -79,6 +79,7 @@ struct dso { enum dso_binary_typedata_type; u8 adjust_symbols:1; u8 has_build_id:1; + u8 has_srcline:1; u8 hit:1; u8 annotate_warned:1; u8 sname_alloc:1; diff --git a/tools/perf/util/srcline.c b/tools/perf/util/srcline.c index c736d9428cf2..dcff10bed7da 100644 --- a/tools/perf/util/srcline.c +++ b/tools/perf/util/srcline.c @@ -58,10 +58,13 @@ char *get_srcline(struct dso *dso, unsigned long addr) { char *file; unsigned line; - char *srcline = SRCLINE_UNKNOWN; + char *srcline; char *dso_name = dso->long_name; size_t size; + if (!dso->has_srcline) + return SRCLINE_UNKNOWN; + if (dso_name[0] == '[') goto out; @@ -81,8 +84,11 @@ char *get_srcline(struct dso *dso, unsigned long addr) srcline = SRCLINE_UNKNOWN; free(file); -out: return srcline; + +out: + dso->has_srcline = 0; + return SRCLINE_UNKNOWN; } void free_srcline(char *srcline) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 8/9] perf tools: Implement addr2line directly using libbfd
From: Roberto Vitillo When the srcline sort key is used , the external addr2line utility needs to be run for each hist entry to get the srcline info. This can consume quite a time if one has a huge perf.data file. So rather than executing the external utility, implement it internally and just call it. We can do it since we've linked with libbfd already. Signed-off-by: Roberto Agostino Vitillo [namhy...@kernel.org: Use a2l_data struct instead of static globals] Signed-off-by: Namhyung Kim --- tools/perf/config/Makefile | 4 ++ tools/perf/util/srcline.c | 167 + 2 files changed, 171 insertions(+) diff --git a/tools/perf/config/Makefile b/tools/perf/config/Makefile index 214e17e97e5c..128d580ca397 100644 --- a/tools/perf/config/Makefile +++ b/tools/perf/config/Makefile @@ -396,6 +396,10 @@ else endif endif +ifndef ($(filter -lbfd,$(EXTLIBS)),) + CFLAGS += -DLIBBFD_SUPPORT +endif + ifndef NO_STRLCPY ifeq ($(call try-cc,$(SOURCE_STRLCPY),,-DHAVE_STRLCPY),y) CFLAGS += -DHAVE_STRLCPY diff --git a/tools/perf/util/srcline.c b/tools/perf/util/srcline.c index dcff10bed7da..10983a92bb5e 100644 --- a/tools/perf/util/srcline.c +++ b/tools/perf/util/srcline.c @@ -8,6 +8,172 @@ #include "util/util.h" #include "util/debug.h" +#ifdef LIBBFD_SUPPORT + +/* + * Implement addr2line using libbfd. + */ +#define PACKAGE "perf" +#include + +struct a2l_data { + const char *input; + unsigned long addr; + + boolfound; + const char *filename; + const char *funcname; + unsignedline; + + bfd *abfd; + asymbol **syms; +}; + +static int bfd_error(const char *string) +{ + const char *errmsg; + + errmsg = bfd_errmsg(bfd_get_error()); + fflush(stdout); + + if (string) + pr_debug("%s: %s\n", string, errmsg); + else + pr_debug("%s\n", errmsg); + + return -1; +} + +static int slurp_symtab(bfd *abfd, struct a2l_data *a2l) +{ + long storage; + long symcount; + asymbol **syms; + bfd_boolean dynamic = FALSE; + + if ((bfd_get_file_flags(abfd) & HAS_SYMS) == 0) + return bfd_error(bfd_get_filename(abfd)); + + storage = bfd_get_symtab_upper_bound(abfd); + if (storage == 0L) { + storage = bfd_get_dynamic_symtab_upper_bound(abfd); + dynamic = TRUE; + } + if (storage < 0L) + return bfd_error(bfd_get_filename(abfd)); + + syms = malloc(storage); + if (dynamic) + symcount = bfd_canonicalize_dynamic_symtab(abfd, syms); + else + symcount = bfd_canonicalize_symtab(abfd, syms); + + if (symcount < 0) { + free(syms); + return bfd_error(bfd_get_filename(abfd)); + } + + a2l->syms = syms; + return 0; +} + +static void find_address_in_section(bfd *abfd, asection *section, void *data) +{ + bfd_vma pc, vma; + bfd_size_type size; + struct a2l_data *a2l = data; + + if (a2l->found) + return; + + if ((bfd_get_section_flags(abfd, section) & SEC_ALLOC) == 0) + return; + + pc = a2l->addr; + vma = bfd_get_section_vma(abfd, section); + size = bfd_get_section_size(section); + + if (pc < vma || pc >= vma + size) + return; + + a2l->found = bfd_find_nearest_line(abfd, section, a2l->syms, pc - vma, + >filename, >funcname, + >line); +} + +static struct a2l_data *addr2line_init(const char *path) +{ + bfd *abfd; + struct a2l_data *a2l = NULL; + + abfd = bfd_openr(path, NULL); + if (abfd == NULL) + return NULL; + + if (!bfd_check_format(abfd, bfd_object)) + goto out; + + a2l = zalloc(sizeof(*a2l)); + if (a2l == NULL) + goto out; + + a2l->abfd = abfd; + a2l->input = strdup(path); + if (a2l->input == NULL) + goto out; + + if (slurp_symtab(abfd, a2l)) + goto out; + + return a2l; + +out: + if (a2l) { + free((void *)a2l->input); + free(a2l); + } + bfd_close(abfd); + return NULL; +} + +static void addr2line_cleanup(struct a2l_data *a2l) +{ + if (a2l->abfd) + bfd_close(a2l->abfd); + free((void *)a2l->input); + free(a2l->syms); + free(a2l); +} + +static int addr2line(const char *dso_name, unsigned long addr, +char **file, unsigned int *line) +{ + int ret = 0; + struct a2l_data *a2l; + + a2l = addr2line_init(dso_name); + if (a2l == NULL) { + pr_warning("addr2line_init failed for %s\n", dso_name); + return 0; + } + + a2l->addr = addr; +
[PATCH 2/9] perf annotate: Reuse path from the result of addr2line
From: Namhyung Kim In the symbol__get_source_line(), path and src_line->path will have same value, but they were allocated separately, and leaks one. Just share path to src_line->path. Signed-off-by: Namhyung Kim --- tools/perf/util/annotate.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index d102716c43a1..8a5babc71d53 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -1091,11 +1091,7 @@ static int symbol__get_source_line(struct symbol *sym, struct map *map, if (getline(, _len, fp) < 0 || !line_len) goto next_close; - src_line->path = malloc(sizeof(char) * line_len + 1); - if (!src_line->path) - goto next_close; - - strcpy(src_line->path, path); + src_line->path = path; insert_source_line(_root, src_line); next_close: -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/9] perf tools: Factor out get/free_srcline()
From: Namhyung Kim Currently external addr2line tool is used for srcline sort key and annotate with srcline info. Separate the common code to prepare upcoming enhancements. Signed-off-by: Namhyung Kim --- tools/perf/Makefile| 1 + tools/perf/util/annotate.c | 20 ++- tools/perf/util/hist.c | 2 +- tools/perf/util/sort.c | 17 ++ tools/perf/util/srcline.c | 83 ++ tools/perf/util/util.h | 6 6 files changed, 97 insertions(+), 32 deletions(-) create mode 100644 tools/perf/util/srcline.c diff --git a/tools/perf/Makefile b/tools/perf/Makefile index bfd12d02a304..e9e496e01914 100644 --- a/tools/perf/Makefile +++ b/tools/perf/Makefile @@ -360,6 +360,7 @@ LIB_OBJS += $(OUTPUT)util/rblist.o LIB_OBJS += $(OUTPUT)util/intlist.o LIB_OBJS += $(OUTPUT)util/vdso.o LIB_OBJS += $(OUTPUT)util/stat.o +LIB_OBJS += $(OUTPUT)util/srcline.o LIB_OBJS += $(OUTPUT)ui/setup.o LIB_OBJS += $(OUTPUT)ui/helpline.o diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 8a5babc71d53..c8a0d5762b1d 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -1018,7 +1018,7 @@ static void symbol__free_source_line(struct symbol *sym, int len) (sizeof(src_line->p) * (src_line->nr_pcnt - 1)); for (i = 0; i < len; i++) { - free(src_line->path); + free_srcline(src_line->path); src_line = (void *)src_line + sizeof_src_line; } @@ -1035,7 +1035,6 @@ static int symbol__get_source_line(struct symbol *sym, struct map *map, u64 start; int i, k; int evidx = evsel->idx; - char cmd[PATH_MAX * 2]; struct source_line *src_line; struct annotation *notes = symbol__annotation(sym); struct sym_hist *h = annotation__histogram(notes, evidx); @@ -1063,10 +1062,7 @@ static int symbol__get_source_line(struct symbol *sym, struct map *map, start = map__rip_2objdump(map, sym->start); for (i = 0; i < len; i++) { - char *path = NULL; - size_t line_len; u64 offset; - FILE *fp; double percent_max = 0.0; src_line->nr_pcnt = nr_pcnt; @@ -1083,19 +1079,9 @@ static int symbol__get_source_line(struct symbol *sym, struct map *map, goto next; offset = start + i; - sprintf(cmd, "addr2line -e %s %016" PRIx64, filename, offset); - fp = popen(cmd, "r"); - if (!fp) - goto next; - - if (getline(, _len, fp) < 0 || !line_len) - goto next_close; - - src_line->path = path; + src_line->path = get_srcline(filename, offset); insert_source_line(_root, src_line); - next_close: - pclose(fp); next: src_line = (void *)src_line + sizeof_src_line; } @@ -1136,7 +1122,7 @@ static void print_summary(struct rb_root *root, const char *filename) path = src_line->path; color = get_percent_color(percent_max); - color_fprintf(stdout, color, " %s", path); + color_fprintf(stdout, color, " %s\n", path); node = rb_next(node); } diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c index e6c11ccf138c..fa695ce1662a 100644 --- a/tools/perf/util/hist.c +++ b/tools/perf/util/hist.c @@ -530,7 +530,7 @@ void hist_entry__free(struct hist_entry *he) { free(he->branch_info); free(he->mem_info); - free(he->srcline); + free_srcline(he->srcline); free(he); } diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c index 5ba29fc679c5..0d633a059df4 100644 --- a/tools/perf/util/sort.c +++ b/tools/perf/util/sort.c @@ -251,8 +251,7 @@ static int hist_entry__srcline_snprintf(struct hist_entry *self, char *bf, unsigned int width __maybe_unused) { FILE *fp = NULL; - char cmd[PATH_MAX + 2], *path = self->srcline, *nl; - size_t line_len; + char *path = self->srcline; if (path != NULL) goto out_path; @@ -263,19 +262,9 @@ static int hist_entry__srcline_snprintf(struct hist_entry *self, char *bf, if (!strncmp(self->ms.map->dso->long_name, "/tmp/perf-", 10)) goto out_ip; - snprintf(cmd, sizeof(cmd), "addr2line -e %s %016" PRIx64, -self->ms.map->dso->long_name, self->ip); - fp = popen(cmd, "r"); - if (!fp) - goto out_ip; - - if (getline(>srcline, _len, fp) < 0 || !line_len) - goto out_ip; + path = get_srcline(self->ms.map->dso->long_name, self->ip); + self->srcline = path; - nl = strchr(self->srcline, '\n'); - if (nl != NULL) - *nl = '\0'; -
[PATCH 9/9] perf tools: Fix srcline sort key behavior
From: Namhyung Kim Currently the srcline sort key compares ip rather than srcline info. I guess this was due to a performance reason to run external addr2line utility. Now we have implemented the functionality inside, use the srcline info when comparing hist entries. Also constantly print "??:0" string for unknown srcline rather than printing ip. Signed-off-by: Namhyung Kim --- tools/perf/util/sort.c | 41 - 1 file changed, 20 insertions(+), 21 deletions(-) diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c index 58cfe69d3a6c..082480579861 100644 --- a/tools/perf/util/sort.c +++ b/tools/perf/util/sort.c @@ -243,33 +243,32 @@ struct sort_entry sort_sym = { static int64_t sort__srcline_cmp(struct hist_entry *left, struct hist_entry *right) { - return (int64_t)(right->ip - left->ip); + if (!left->srcline) { + if (!left->ms.map) + left->srcline = SRCLINE_UNKNOWN; + else { + struct map *map = left->ms.map; + left->srcline = get_srcline(map->dso, + map__rip_2objdump(map, left->ip)); + } + } + if (!right->srcline) { + if (!right->ms.map) + right->srcline = SRCLINE_UNKNOWN; + else { + struct map *map = right->ms.map; + right->srcline = get_srcline(map->dso, + map__rip_2objdump(map, right->ip)); + } + } + return strcmp(left->srcline, right->srcline); } static int hist_entry__srcline_snprintf(struct hist_entry *self, char *bf, size_t size, unsigned int width __maybe_unused) { - FILE *fp = NULL; - char *path = self->srcline; - - if (path != NULL) - goto out_path; - - if (!self->ms.map) - goto out_ip; - - path = get_srcline(self->ms.map->dso, self->ip); - self->srcline = path; - -out_path: - if (fp) - pclose(fp); - return repsep_snprintf(bf, size, "%s", path); -out_ip: - if (fp) - pclose(fp); - return repsep_snprintf(bf, size, "%-#*llx", BITS_PER_LONG / 4, self->ip); + return repsep_snprintf(bf, size, "%s", self->srcline); } struct sort_entry sort_srcline = { -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/9] perf tools: Do not try to call addr2line for non-binary files
From: Namhyung Kim No need to call addr2line since they don't have such information. Signed-off-by: Namhyung Kim --- tools/perf/util/sort.c| 3 --- tools/perf/util/srcline.c | 11 +-- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c index 0d633a059df4..113aa92b1191 100644 --- a/tools/perf/util/sort.c +++ b/tools/perf/util/sort.c @@ -259,9 +259,6 @@ static int hist_entry__srcline_snprintf(struct hist_entry *self, char *bf, if (!self->ms.map) goto out_ip; - if (!strncmp(self->ms.map->dso->long_name, "/tmp/perf-", 10)) - goto out_ip; - path = get_srcline(self->ms.map->dso->long_name, self->ip); self->srcline = path; diff --git a/tools/perf/util/srcline.c b/tools/perf/util/srcline.c index 7e92cca6f502..777f91880cdb 100644 --- a/tools/perf/util/srcline.c +++ b/tools/perf/util/srcline.c @@ -57,11 +57,17 @@ char *get_srcline(const char *dso_name, unsigned long addr) { char *file; unsigned line; - char *srcline; + char *srcline = SRCLINE_UNKNOWN; size_t size; + if (dso_name[0] == '[') + goto out; + + if (!strncmp(dso_name, "/tmp/perf-", 10)) + goto out; + if (!addr2line(dso_name, addr, , )) - return SRCLINE_UNKNOWN; + goto out; /* just calculate actual length */ size = snprintf(NULL, 0, "%s:%u", file, line) + 1; @@ -73,6 +79,7 @@ char *get_srcline(const char *dso_name, unsigned long addr) srcline = SRCLINE_UNKNOWN; free(file); +out: return srcline; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
Hi Hannes, On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote: > Each zone that holds userspace pages of one workload must be aged at a > speed proportional to the zone size. Otherwise, the time an > individual page gets to stay in memory depends on the zone it happened > to be allocated in. Asymmetry in the zone aging creates rather > unpredictable aging behavior and results in the wrong pages being > reclaimed, activated etc. > > But exactly this happens right now because of the way the page > allocator and kswapd interact. The page allocator uses per-node lists > of all zones in the system, ordered by preference, when allocating a > new page. When the first iteration does not yield any results, kswapd > is woken up and the allocator retries. Due to the way kswapd reclaims > zones below the high watermark while a zone can be allocated from when > it is above the low watermark, the allocator may keep kswapd running > while kswapd reclaim ensures that the page allocator can keep > allocating from the first zone in the zonelist for extended periods of > time. Meanwhile the other zones rarely see new allocations and thus > get aged much slower in comparison. > > The result is that the occasional page placed in lower zones gets > relatively more time in memory, even get promoted to the active list > after its peers have long been evicted. Meanwhile, the bulk of the > working set may be thrashing on the preferred zone even though there > may be significant amounts of memory available in the lower zones. > > Even the most basic test -- repeatedly reading a file slightly bigger > than memory -- shows how broken the zone aging is. In this scenario, > no single page should be able stay in memory long enough to get > referenced twice and activated, but activation happens in spades: > > $ grep active_file /proc/zoneinfo > nr_inactive_file 0 > nr_active_file 0 > nr_inactive_file 0 > nr_active_file 8 > nr_inactive_file 1582 > nr_active_file 11994 > $ cat data data data data >/dev/null > $ grep active_file /proc/zoneinfo > nr_inactive_file 0 > nr_active_file 70 > nr_inactive_file 258753 > nr_active_file 443214 > nr_inactive_file 149793 > nr_active_file 12021 > > Fix this with a very simple round robin allocator. Each zone is > allowed a batch of allocations that is proportional to the zone's > size, after which it is treated as full. The batch counters are reset > when all zones have been tried and the allocator enters the slowpath > and kicks off kswapd reclaim: > > $ grep active_file /proc/zoneinfo > nr_inactive_file 0 > nr_active_file 0 > nr_inactive_file 174 > nr_active_file 4865 > nr_inactive_file 53 > nr_active_file 860 > $ cat data data data data >/dev/null > $ grep active_file /proc/zoneinfo > nr_inactive_file 0 > nr_active_file 0 > nr_inactive_file 22 > nr_active_file 4988 > nr_inactive_file 190969 > nr_active_file 937 First of all, I should appreciate your great work! It's amazing and I saw Zlatko proved it enhances real works. Thanks Zlatko, too! So, I don't want to prevent merging but I think at least, we should discuss some issues. The concern I have is that it could accelerate low memory pinning problems like mlock. Actually, I don't have such workload that makes pin lots of pages but that's why we introduced lowmem_reserve_ratio, as you know well so we should cover this issue, at least. Other thing of my concerns is to add overhead in fast path. Sometime, we are really reluctant to add simple even "if" condition in fastpath but you are adding atomic op whenever page is allocated and enter slowpath whenever all of given zones's batchcount is zero. Yes, it's not really slow path because it could return to normal status without calling significant slow functions by reset batchcount of prepare_slowpath. I think it's tradeoff and I am biased your approach although we would lose a little performance because fair aging would recover the loss by fastpath's overhead. But who knows? Someone has a concern. So we should mention about such problems. > > Signed-off-by: Johannes Weiner > --- > include/linux/mmzone.h | 1 + > mm/page_alloc.c| 39 +-- > 2 files changed, 30 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index af4a3b7..0c41d59 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -352,6 +352,7 @@ struct zone { >* free areas of different sizes >*/ > spinlock_t lock; > + atomic_talloc_batch; > int all_unreclaimable; /* All pages pinned */ > #if defined CONFIG_COMPACTION || defined CONFIG_CMA > /* Set to true when the PG_migrate_skip bits should be cleared */ > diff --git a/mm/page_alloc.c b/mm/page_alloc.c >
[PATCH 6/9] perf tools: Pass dso instead of dso_name to get_srcline()
From: Namhyung Kim This is a preparation of next change. No functional changes are intended. Signed-off-by: Namhyung Kim --- tools/perf/util/annotate.c | 11 --- tools/perf/util/dso.h | 1 + tools/perf/util/sort.c | 2 +- tools/perf/util/srcline.c | 4 +++- tools/perf/util/util.h | 4 +++- 5 files changed, 12 insertions(+), 10 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index c8a0d5762b1d..0137f028c0d6 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -1029,8 +1029,7 @@ static void symbol__free_source_line(struct symbol *sym, int len) /* Get the filename:line for the colored entries */ static int symbol__get_source_line(struct symbol *sym, struct map *map, struct perf_evsel *evsel, - struct rb_root *root, int len, - const char *filename) + struct rb_root *root, int len) { u64 start; int i, k; @@ -1079,7 +1078,7 @@ static int symbol__get_source_line(struct symbol *sym, struct map *map, goto next; offset = start + i; - src_line->path = get_srcline(filename, offset); + src_line->path = get_srcline(map->dso, offset); insert_source_line(_root, src_line); next: @@ -1286,7 +1285,6 @@ int symbol__tty_annotate(struct symbol *sym, struct map *map, bool full_paths, int min_pcnt, int max_lines) { struct dso *dso = map->dso; - const char *filename = dso->long_name; struct rb_root source_line = RB_ROOT; u64 len; @@ -1296,9 +1294,8 @@ int symbol__tty_annotate(struct symbol *sym, struct map *map, len = symbol__size(sym); if (print_lines) { - symbol__get_source_line(sym, map, evsel, _line, - len, filename); - print_summary(_line, filename); + symbol__get_source_line(sym, map, evsel, _line, len); + print_summary(_line, dso->long_name); } symbol__annotate_printf(sym, map, evsel, full_paths, diff --git a/tools/perf/util/dso.h b/tools/perf/util/dso.h index d51aaf272c68..94c2f1a8b116 100644 --- a/tools/perf/util/dso.h +++ b/tools/perf/util/dso.h @@ -5,6 +5,7 @@ #include #include "types.h" #include "map.h" +#include "build-id.h" enum dso_binary_type { DSO_BINARY_TYPE__KALLSYMS = 0, diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c index 113aa92b1191..58cfe69d3a6c 100644 --- a/tools/perf/util/sort.c +++ b/tools/perf/util/sort.c @@ -259,7 +259,7 @@ static int hist_entry__srcline_snprintf(struct hist_entry *self, char *bf, if (!self->ms.map) goto out_ip; - path = get_srcline(self->ms.map->dso->long_name, self->ip); + path = get_srcline(self->ms.map->dso, self->ip); self->srcline = path; out_path: diff --git a/tools/perf/util/srcline.c b/tools/perf/util/srcline.c index 777f91880cdb..c736d9428cf2 100644 --- a/tools/perf/util/srcline.c +++ b/tools/perf/util/srcline.c @@ -4,6 +4,7 @@ #include +#include "util/dso.h" #include "util/util.h" #include "util/debug.h" @@ -53,11 +54,12 @@ out: return ret; } -char *get_srcline(const char *dso_name, unsigned long addr) +char *get_srcline(struct dso *dso, unsigned long addr) { char *file; unsigned line; char *srcline = SRCLINE_UNKNOWN; + char *dso_name = dso->long_name; size_t size; if (dso_name[0] == '[') diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h index 9a0bbe45f642..20dd24d8f338 100644 --- a/tools/perf/util/util.h +++ b/tools/perf/util/util.h @@ -282,7 +282,9 @@ void get_term_dimensions(struct winsize *ws); #define SRCLINE_UNKNOWN ((char *) "??:0") -char *get_srcline(const char *dso_name, unsigned long addr); +struct dso; + +char *get_srcline(struct dso *dso, unsigned long addr); void free_srcline(char *srcline); #endif /* GIT_COMPAT_UTIL_H */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 3/4] tracing/kprobes: Fail to unregister if probe event files are open
Here's the new change log, but the same patch. Does this sound ok to you guys? -- Steve >From 40c32592668b727cbfcf7b1c0567f581bd62a5e4 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Red Hat)" Date: Wed, 3 Jul 2013 23:33:50 -0400 Subject: [PATCH] tracing/kprobes: Fail to unregister if probe event files are in use When a probe is being removed, it cleans up the event files that correspond to the probe. But there is a race between writing to one of these files and deleting the probe. This is especially true for the "enable" file. CPU 0 CPU 1 - - fd = open("enable",O_WRONLY); probes_open() release_all_trace_probes() unregister_trace_probe() if (trace_probe_is_enabled(tp)) return -EBUSY write(fd, "1", 1) __ftrace_set_clr_event() call->class->reg() (kprobe_register) enable_trace_probe(tp) __unregister_trace_probe(tp); list_del(>list) unregister_probe_event(tp) <-- fails! free_trace_probe(tp) write(fd, "0", 1) __ftrace_set_clr_event() call->class->unreg (kprobe_register) disable_trace_probe(tp) <-- BOOM! A test program was written that used two threads to simulate the above scenario adding a nanosleep() interval to change the timings and after several thousand runs, it was able to trigger this bug and crash: BUG: unable to handle kernel paging request at 000500f9 IP: [] probes_open+0x3b/0xa7 PGD 7808a067 PUD 0 Oops: [#1] PREEMPT SMP Dumping ftrace buffer: - Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 CPU: 1 PID: 2070 Comm: test-kprobe-rem Not tainted 3.11.0-rc3-test+ #47 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 task: 880077756440 ti: 880076e52000 task.ti: 880076e52000 RIP: 0010:[] [] probes_open+0x3b/0xa7 RSP: 0018:880076e53c38 EFLAGS: 00010203 RAX: 00050001 RBX: 88007844f440 RCX: 0003 RDX: 0003 RSI: 0003 RDI: 880076e52000 RBP: 880076e53c58 R08: 880076e53bd8 R09: R10: 880077756440 R11: 0006 R12: 810dee35 R13: 880079250418 R14: R15: 88007844f450 FS: 7f87a276f700() GS:88007d48() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 000500f9 CR3: 77262000 CR4: 07e0 Stack: 880076e53c58 81219ea0 88007844f440 810dee35 880076e53ca8 81130f78 8800772986c0 8800796f93a0 81d1b5d8 880076e53e04 88007844f440 Call Trace: [] ? security_file_open+0x2c/0x30 [] ? unregister_trace_probe+0x4b/0x4b [] do_dentry_open+0x162/0x226 [] finish_open+0x46/0x54 [] do_last+0x7f6/0x996 [] ? inode_permission+0x42/0x44 [] path_openat+0x232/0x496 [] do_filp_open+0x3a/0x8a [] ? __alloc_fd+0x168/0x17a [] do_sys_open+0x70/0x102 [] ? trace_hardirqs_on_caller+0x160/0x197 [] SyS_open+0x1e/0x20 [] system_call_fastpath+0x16/0x1b Code: e5 41 54 53 48 89 f3 48 83 ec 10 48 23 56 78 48 39 c2 75 6c 31 f6 48 c7 RIP [] probes_open+0x3b/0xa7 RSP CR2: 000500f9 ---[ end trace 35f17d68fc569897 ]--- The unregister_trace_probe() must be done first, and if it fails it must fail the removal of the kprobe. Several changes have already been made by Oleg Nesterov and Masami Hiramatsu to allow moving the unregister_probe_event() before the removal of the probe and exit the function if it fails. This prevents the tp structure from being used after it is freed. Link: http://lkml.kernel.org/r/20130704034038.819592...@goodmis.org Acked-by: Masami Hiramatsu Signed-off-by: Steven Rostedt --- kernel/trace/trace_kprobe.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 3811487..243f683 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -95,7 +95,7 @@ static __kprobes bool trace_probe_is_on_module(struct trace_probe *tp) } static int register_probe_event(struct trace_probe *tp); -static void unregister_probe_event(struct trace_probe *tp); +static int unregister_probe_event(struct trace_probe *tp); static DEFINE_MUTEX(probe_lock); static LIST_HEAD(probe_list); @@ -351,9 +351,12 @@ static int unregister_trace_probe(struct trace_probe *tp) if (trace_probe_is_enabled(tp)) return -EBUSY; + /* Will fail if probe is being used by ftrace or perf */ + if
Re: [PATCH 3/3] ext4: Implement FALLOC_FL_COLLAPSE_RANGE
On Wed, Jul 31, 2013 at 11:42:26PM +0900, Namjae Jeon wrote: > From: Namjae Jeon > > New fallocate flag FALLOC_FL_COLLAPSE_RANGE implementation for Ext4 . > + > + punch_start = offset >> EXT4_BLOCK_SIZE_BITS(sb); > + punch_stop = (offset + len) >> EXT4_BLOCK_SIZE_BITS(sb); > + > + rounding = max_t(uint, 1 << EXT4_BLOCK_SIZE_BITS(sb), PAGE_CACHE_SIZE); > + ioffset = offset & ~(rounding - 1); > + > + /* Write out all dirty pages */ > + ret = filemap_write_and_wait_range(inode->i_mapping, ioffset, -1); > + if (ret) > + return ret; > + > + /* Take mutex lock */ > + mutex_lock(>i_mutex); > + > + truncate_pagecache_range(inode, ioffset, -1); Ted, that's invalidating the page cache from the start of the collaspse range to the end of the file. So the ext4 code is doing this bit correctly. Why isn't this in the XFS patches? Clearly the need for this was understood, and, well, this code is obviously copied from the XFS hole punching code. i.e. from xfs_free_file_space(). > + /* Wait for existing dio to complete */ > + ext4_inode_block_unlocked_dio(inode); > + inode_dio_wait(inode); That should be done before invalidating the pagecache > + credits = ext4_writepage_trans_blocks(inode); > + handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits); > + if (IS_ERR(handle)) { > + ret = PTR_ERR(handle); > + goto out_dio; > + } > + > + down_write(_I(inode)->i_data_sem); > + > + ext4_discard_preallocations(inode); > + > + ret = ext4_es_remove_extent(inode, punch_start, > + EXT_MAX_BLOCKS - punch_start - 1); > + if (ret) > + goto journal_stop; > + > + ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1); > + if (ret) > + goto journal_stop; So, this code punches out the existing space in the file so that the extent shifting is moving extents into a hole. Why is this in the ext4 code, but not the XFS code? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v14 0/6] LSM: Multiple concurrent LSMs
On Thu, Jul 25, 2013 at 11:52 PM, Casey Schaufler wrote: > Subject: [PATCH v14 0/6] LSM: Multiple concurrent LSMs > > Version 14 of this patchset is based on v3.10. > It required significant change from version 13 due to changes > in the audit code. It came out cleaner, especially in the changes > to NetLabel. This version supports all existing LSMs running > together at the same time. The combinations tested most completely > are: > > apparmor,tomoyo,smack,yama - Ubuntu > apparmor,selinux,smack,yama - Fedora > Does this change the way one would develop a new LSM module? I presume it does not > I have been unable to figure out how to configure SELinux on > Ubuntu and TOMOYO on Fedora. That's the only reason the list > does not include all five LSMs at once. Combining LSMs that > use networking is tricky, but can be done. There are changes > coming from AppArmor that might make it even trickier, but > that's a problem for the future. > > > Change the infrastructure for Linux Security Modules (LSM)s from a > single vector of hook handlers to a list based method for handling > multiple concurrent modules. All combinations of existing LSMs > are supported. > > The "security=" boot option takes a comma separated list of LSMs, > registering them in the order presented. The LSM hooks will be > executed in the order registered. Hooks that return errors are > not short circuited. All hooks are called even if one of the LSM > hooks fails. The result returned will be that of the last LSM > hook that failed. > This is an important design trade-off. From my perspective I think you might want to revisit this, today it sounds like effective security == all hooks process and allow the operation. In this world a lack of proper policy/setting can make hooks fail. I've not yet looked at the code, but you might want to revisit this. > All behavior from security/capability.c has been moved into > the hook handling. The security/commoncap functions used > to get called from the LSM specific code. The handling of the > capability functions has been moved out of the LSMs and into the > hook handling. > > A level of indirection has been introduced in the handling of > security blobs. LSMs no longer access ->security fields directly, > instead they use an abstraction provided by lsm_[gs]et field > functions. > > The notion that "the security context" can be represented as a > single u32 "secid" does not scale to the case where multiple LSMs > want to provide "the security context". The XFRM and secmark > facilities appear unlikely to ever allow for more than the existing > 32 bit values. The NetLabel scheme might possibly be used to > represent more than one labeling scheme (CIPSO does allow for > multiple tags) although there is no plan to do so at this time. > The SO_PEERSEC scheme is capable of providing information from > multiple LSMs. Auditing can deal with multiple secids. > > The NetLabel, XFRM and secmark facilities are restricted to use > by one LSM at a time. The SO_PEERSEC facility can provide information > from multiple LSMs, but existing user space tools don't understand > that. The default behavior is to assign each of these facilities > to the first registered LSM that uses them. They can be configured > for use by any of the LSMs that provide hooks for them. SO_PEERSEC > can be configured to provide information from all of the LSMs that > provide hooks. > > The /proc/*/attr interfaces are given to one LSM. This can be > done by setting CONFIG_SECURITY_PRESENT. Additional interfaces > have been created in /proc/*/attr so that each LSM has its own > named interfaces. The name of the presenting LSM can be read from > /sys/kernel/security/present. The list of LSMs being used can be > read from /sys/kernel/security/lsm. > > A "security context" may now contrain information processed by > more than one LSM. The proper form of a security context identifies > the information it contains by LSM: > > smack='Pop'selinux='system_u:object_r:etc_r:s0' > > A security context without the LSM identifying lsm='' gets > passed through to all of the LSMs that use a security context. This > maintains compatability in the case where there is only one LSM > using the security context. > > Signed-off-by: Casey Schaufler Balbir Singh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ksummit-2013-discuss] [ARM ATTEND] catching up on exploit mitigations
On Wed, Jul 31, 2013 at 7:24 AM, Dave Jones wrote: > On Wed, Jul 31, 2013 at 10:40:12AM +0100, Russell King - ARM Linux wrote: > > On Tue, Jul 30, 2013 at 08:04:44PM -0400, Dave Jones wrote: > > > To use ARM as an example, the bugs I've seen have mostly been in arch > specific > > > code that does things like page-table manipulation. The chromebook bugs > I > > > was hitting for eg were various kinds of PTE corruption warnings. > > > > Hmm, really? Did you reported these bugs? I'm not aware of mainline > > having any changes related to bug reports on PTEs on ARM. > > I wasn't sure if it was a googleism, or happens on mainline, so no. We'd definitely be interested in hearing about what you've found. The 3.4 kernel has a lot of out-of-tree code for exynos, we've reduced it quite a bit on the upcoming 3.8 one but it needs a bit more baking. As of 3.10, it's not actually hard to run a mainline kernel on the chromebook, but you have limited functionality (no wifi, no USB3, no accellerated graphics). Easiest is to do it by booting from SD card. See instructions at http://www.chromium.org/chromium-os/u-boot-porting-guide/using-nv-u-boot-on-the-samsung-arm-chromebook. How long would a useful run of trinity take? I've got an autobuilder/autobooter setup here with a cross-section of current ARM hardware (7 platforms and counting) that I mostly do boot testing on, and I should add a smallish suite of testcases to the same. Trinity would be a good candidate for that. -Olof -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 2/2] qspinlock x86: Enable x86 to use queue spinlock
This patch makes the necessary changes at the x86 architecture specific layer to enable the presence of the CONFIG_QUEUE_SPINLOCK kernel option which can be used to replace the ticket spinlock by the queue spinlock. The turning on of ARCH_QUEUE_SPINLOCK config option does not mean that the ticket spinlock will be replaced. It only means that the architecture supports the replacement. Actual replacement will only happen if the QUEUE_SPINLOCK config option is also set. Signed-off-by: Waiman Long --- arch/x86/Kconfig |3 +++ arch/x86/include/asm/spinlock.h |2 ++ arch/x86/include/asm/spinlock_types.h |4 arch/x86/kernel/paravirt-spinlocks.c | 10 ++ 4 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b32ebf9..8531e47 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2344,6 +2344,9 @@ config X86_DMA_REMAP bool depends on STA2X11 +config ARCH_QUEUE_SPINLOCK + def_bool y + source "net/Kconfig" source "drivers/Kconfig" diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 33692ea..0ea805d 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -34,6 +34,7 @@ # define UNLOCK_LOCK_PREFIX #endif +#ifndef CONFIG_QUEUE_SPINLOCK /* * Ticket locks are conceptually two parts, one indicating the current head of * the queue, and the other indicating the current tail. The lock is acquired @@ -130,6 +131,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, } #endif /* CONFIG_PARAVIRT_SPINLOCKS */ +#endif /* !CONFIG_QUEUE_SPINLOCK */ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) { diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index ad0ad07..8a7721e 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -7,6 +7,9 @@ #include +#ifdef CONFIG_QUEUE_SPINLOCK +# include +#else #if (CONFIG_NR_CPUS < 256) typedef u8 __ticket_t; typedef u16 __ticketpair_t; @@ -27,6 +30,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif #include diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 676b8c7..75569cb 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -15,6 +15,15 @@ default_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags) struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP +# ifdef CONFIG_QUEUE_SPINLOCK + .spin_is_locked = queue_spin_is_locked, + .spin_is_contended = queue_spin_is_contended, + + .spin_lock = queue_spin_lock, + .spin_lock_flags = default_spin_lock_flags, + .spin_trylock = queue_spin_trylock, + .spin_unlock = queue_spin_unlock, +# else .spin_is_locked = __ticket_spin_is_locked, .spin_is_contended = __ticket_spin_is_contended, @@ -22,6 +31,7 @@ struct pv_lock_ops pv_lock_ops = { .spin_lock_flags = default_spin_lock_flags, .spin_trylock = __ticket_spin_trylock, .spin_unlock = __ticket_spin_unlock, +# endif #endif }; EXPORT_SYMBOL(pv_lock_ops); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation
This patch introduces a new queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be as fair as the ticket spinlock. It is faster both in single-thread and in high contention situations. Only in light to moderate contention where the average queue depth is around 1-2 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The idea behind this spinlock implementation is the fact that spinlocks are acquired with preemption disabled. In other words, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Of course, interrupt handler can try to acquire one spinlock while the interrupted user process is in the process of getting another spinlock. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 16-bit size. Together with the 1-byte lock bit, this queue spinlock implementation will only need 4 bytes to hold all the information that it needs. The current queue node address encoding is as follows: Bits 0-1 : queue node index in the per-cpu array (4 entries) Bits 2-16: cpu number + 1 (max cpus = 16383) By using also the unused byte, we could support more cpus and queue node entries in the array at the expense of more coding and probably performance overheads. In the extremely unlikely case that all the queue node entries are used up, the current code will fall back to busy spinning without waiting in a queue with warning message. With minor modification to the lock fastpath, we could also make a less fair version of queue spinlock that allows lock stealing. This can potentially provide better performance for some workloads. This patch allows the optional replacement of architecture specific implementation of ticket spinlock by this generic version of queue spinlock. Two new config parameters are introduced: 1. QUEUE_SPINLOCK A select-able option that enables the building and replacement of architecture specific spinlock by queue spinlock. 2. ARCH_QUEUE_SPINLOCK Have to be defined in arch/$(ARCH)/Kconfig to enable QUEUE_SPINLOCK option. This option, by itself, will not enable the queue spinlock feature. For single-thread performance (no contention), a 256K lock/unlock loop was run on a 2.93Ghz Westmere x86-64 CPU. The following table shows the average time (in ns) for a single lock/unlock sequence (including the looping and timing overhead): Lock Type Time (ns) - - Ticket spinlock 12.1 Queue spinlock10.9 So the queue spinlock is slightly faster. The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere x86-64 CPUs to evaluate the performance impact of this patch on the 3.10.1 kernel. The jobs per minute (JPM) results for the 1100-2000 user ranges are shown below: ++++++++ | Kernel | 3.10.1 | 3.10.1 |%Change | 3.10.1 | 3.10.1 |%Change | ||| qlock ||| qlock || | HT | off | off || on | on || ++++++++ |alltests| 403590 | 397622 | -1.5% | 411479 | 411428 | 0.0% | |custom | 304581 | 307165 | +0.9% | 230335 | 254798 | +10.6% | |dbase | 896252 | 900597 | +0.5% |1134659 |1137730 | +0.3% | |disk| 213679 | 270597 | +26.6% | 203974 | 249998 | +22.6% | |five_sec| 144012 | 157286 | +9.2% | 132599 | 150812 | +13.7% | |fserver | 472741 | 467213 | -1.2% | 270601 | 285737 | +5.6% | |high_systime| 131106 | 139015 | +6.0% | 120294 | 125802 | +4.6% | |new_dbase | 934871 | 936860 | +0.2 |1177640 |1181679 | +0.3% | |new_fserver | 452544 | 444967 | -1.7% | 262368 | 275827 | +5.1% | |shared | 366750 | 369901 | +0.9% | 403645 | 415840 | +3.0% | |short |1061663 |3003942 |+183.0% |1040392 |2834437 |+172.4% | ++++++++ There are cases where the performance drops 1-2%, but the majority of them are performance gains and some of them are pretty significant. For the disk workload, the spinlock bottleneck is the standalone mb_cache_spinlock in fs/mbcache.c. For the short workload, the spinlock bottleneck is the d_lock in the dentry structure. The following table shows the %time used up by the locks as reported by the perf command at 1000 users for disk workload & 1500 users for short workload with HT off. -+--+--+--+ Lock | ticket | qlock | qlock |
Re: [PATCH V2 1/9] perf tools: add test for reading object code
Hi Peter, On Wed, Jul 31, 2013 at 11:24 PM, Peter Zijlstra wrote: > On Wed, Jul 31, 2013 at 11:17:32AM -0300, Arnaldo Carvalho de Melo wrote: >> Em Wed, Jul 31, 2013 at 12:13:50AM +0300, Adrian Hunter escreveu: >> > Using the information in mmap events, perf tools can read object >> > code associated with sampled addresses. A test is added that >> > compares bytes read by perf with the same bytes read using >> > objdump. >> >> So this parses objdump output, and we also already have the annotation >> logic that does that too, have you thought about having common routines >> for these two cases? >> > > Or better yet, stop using objdump like this and start using libbfd > directly. The only reason we did horrible things like parsing objdump > output is because nobody knew how the underlying stuff actually worked > and we wanted to have something quick. I have similar patch for getting srcline info (using addr2line) based on Roberto Vitillo's patch. I'll send the series soon and then look at the objdump too. -- Thanks, Namhyung Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mmc: omap_hsmmc: Fix sleep too long in ISR context.
We found a problem when we removed a working sd card that the irqaction of omap_hsmmc can sleep to 3.6s. This cause our watchdog to work. In func omap_hsmmc_reset_controller_fsm, it should watch a 0->1 transition.It used loops_per_jiffy as the timer. The code is: > while ((!(OMAP_HSMMC_READ(host->base, SYSCTL) & bit)) >&& (i++ < limit)) >cpu_relax(); But the loops_per_jiffy is: > while(i++ < limit) > cpu_relax(); It add some codes so the time became long. Becasue those codes in ISR context, it can't use timer_before/after. I divived the time into 1ms and used udelay(1) to instead. It will cause do additional udelay(1).But from my test,it looks good. Reported-by: Yuzheng Ma Tested-by: Yuzheng Ma Signed-off-by: Jianpeng Ma --- drivers/mmc/host/omap_hsmmc.c | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/mmc/host/omap_hsmmc.c b/drivers/mmc/host/omap_hsmmc.c index 1865321..96daca1 100644 --- a/drivers/mmc/host/omap_hsmmc.c +++ b/drivers/mmc/host/omap_hsmmc.c @@ -977,6 +977,8 @@ static inline void omap_hsmmc_reset_controller_fsm(struct omap_hsmmc_host *host, unsigned long limit = (loops_per_jiffy * msecs_to_jiffies(MMC_TIMEOUT_MS)); + /*Divided time into us for unit 1,we can use udelay(1)*/ + i = limit / (MMC_TIMEOUT_MS * 1000); OMAP_HSMMC_WRITE(host->base, SYSCTL, OMAP_HSMMC_READ(host->base, SYSCTL) | bit); @@ -985,15 +987,19 @@ static inline void omap_hsmmc_reset_controller_fsm(struct omap_hsmmc_host *host, * Monitor a 0->1 transition first */ if (mmc_slot(host).features & HSMMC_HAS_UPDATED_RESET) { - while ((!(OMAP_HSMMC_READ(host->base, SYSCTL) & bit)) - && (i++ < limit)) - cpu_relax(); + while (i--) { + if ((OMAP_HSMMC_READ(host->base, SYSCTL) & bit)) + break; + udelay(1); + } } - i = 0; - while ((OMAP_HSMMC_READ(host->base, SYSCTL) & bit) && - (i++ < limit)) - cpu_relax(); + i = limit / (MMC_TIMEOUT_MS * 1000); + while (i--) { + if (!(OMAP_HSMMC_READ(host->base, SYSCTL) & bit)) + break; + udealy(1); + } if (OMAP_HSMMC_READ(host->base, SYSCTL) & bit) dev_err(mmc_dev(host->mmc), -- 1.8.1.2 Thanks! Jianpeng MaN�Р骒r��yb�X�肚�v�^�)藓{.n�+�伐�{��赙zXФ�≤�}��财�z�:+v�����赙zZ+��+zf"�h���~i���z��wア�?�ㄨ��&�)撷f��^j谦y�m��@A�a囤� 0鹅h���i
Re: [PATCH 4/9] dma: edma: Find missed events and issue them
On 07/31/2013 04:18 AM, Sekhar Nori wrote: > On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote: >> Hi Sekhar, >> >> On 07/30/2013 02:05 AM, Sekhar Nori wrote: >>> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: In an effort to move to using Scatter gather lists of any size with EDMA as discussed at [1] instead of placing limitations on the driver, we work through the limitations of the EDMAC hardware to find missed events and issue them. The sequence of events that require this are: For the scenario where MAX slots for an EDMA channel is 3: SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null The above SG list will have to be DMA'd in 2 sets: (1) SG1 -> SG2 -> SG3 -> Null (2) SG4 -> SG5 -> SG6 -> Null After (1) is succesfully transferred, the events from the MMC controller donot stop coming and are missed by the time we have setup the transfer for (2). So here, we catch the events missed as an error condition and issue them manually. >>> >>> Are you sure there wont be any effect of these missed events on the >>> peripheral side. For example, wont McASP get into an underrun condition >>> when it encounters a null PaRAM set? Even UART has to transmit to a >> >> But it will not encounter null PaRAM set because McASP uses contiguous >> buffers for transfer which are not scattered across physical memory. >> This can be accomplished with an SG of size 1. For such SGs, this patch >> series leaves it linked Dummy and does not link to Null set. Null set is >> only used for SG lists that are > MAX_NR_SG in size such as those >> created for example by MMC and Crypto. >> >>> particular baud so I guess it cannot wait like the way MMC/SD can. >> >> Existing driver have to wait anyway if they hit MAX SG limit today. If >> they don't want to wait, they would have allocated a contiguous block of >> memory and DMA that in one stretch so they don't lose any events, and in >> such cases we are not linking to Null. > > As long as DMA driver can advertize its MAX SG limit, peripherals can > always work around that by limiting the number of sync events they > generate so as to not having any of the events getting missed. With this > series, I am worried that EDMA drivers is advertizing that it can handle > any length SG list while not taking care of missing any events while > doing so. This will break the assumptions that driver writers make. This is already being done by some other DMA engine drivers ;). We can advertise more than we can handle at a time, that's the basis of this whole idea. I understand what you're saying but events are not something that have be serviced immediately, they can be queued etc and the actually transfer from the DMA controller can be delayed. As long as we don't miss the event we are fine which my series takes care off. So far I have tested this series on following modules in various configurations and have seen no issues: - Crypto AES - MMC/SD - SPI (128x160 display) >>> Also, wont this lead to under-utilization of the peripheral bandwith? >>> Meaning, MMC/SD is ready with data but cannot transfer because the DMA >>> is waiting to be set-up. >> >> But it is waiting anyway even today. Currently based on MAX segs, MMC >> driver/subsystem will make SG list of size max_segs. Between these >> sessions of creating such smaller SG-lists, if for some reason the MMC >> controller is sending events, these will be lost anyway. > > But if MMC/SD driver knows how many events it should generate if it > knows the MAX SG limit. So there should not be any missed events in > current code. And I am not claiming that your solution is making matters > worse. But its not making it much better as well. This is not true for crypto, the events are not deasserted and crypto continues to send events. This is what led to the "don't trigger in Null" patch where I'm setting the missed flag to avoid recursion. >> This can be used only for buffers that are contiguous in memory, not >> those that are scattered across memory. > > I was hinting at using the linking facility of EDMA to achieve this. > Each PaRAM set has full 32-bit source and destination pointers so I see > no reason why non-contiguous case cannot be handled. > > Lets say you need to transfer SG[0..6] on channel C. Now, PaRAM sets are > typically 4 times the number of channels. In this case we use one DMA > PaRAM set and two Link PaRAM sets per channel. P0 is the DMA PaRAM set > and P1 and P2 are the Link sets. > > Initial setup: > > SG0 -> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL > ^ ^ ^ > | | | > P0 -> P1 -> P2 -> NULL > > P[0..2].TCINTEN = 1, so get an interrupt after each SG element > completion. On each completion interrupt, hardware automatically copies > the linked PaRAM set into the DMA PaRAM set so after SG0 is transferred > out, the state of hardware is: > > SG1 -> SG2 -> SG3 -> SG3 -> SG6 -> NULL > ^
Re: [PATCH v2] dma: mmp_pdma: fix a memory alloc error
2013/7/29 Vinod Koul : > On Thu, Jul 25, 2013 at 10:22:50AM +0800, Xiang Wang wrote: > > Btw its generally a good idea to remove the parts not required for further > discussion > >> Would you please explain a little more about why we should use >> GFP_NOWAIT here? This memory is not dedicated for DMA controller. Do >> we have special reasons to use GFP_NOWAIT? Thanks! > There are few reasons > - the dma requests can be triggered from atomic context > - so for above you can argue to use the GFP_ATOMIC and the guideline for dma > drivers [1] was discussed briefly > > ~Vinod > [1]: http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/02316.html > > -- Hi, Vinod Thanks for your explanation. But want to double confirm with you that the memory alloc in my patch is in probe function, still need to use GFP_NOWAIT? Thanks! -- Regards, Xiang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -v2] sched,x86: optimize switch_mm for multi-threaded workloads
On Wed, Jul 31, 2013 at 7:14 PM, Rik van Riel wrote: > > Is this better? Not that I really care which version gets applied :) I'm ok with this. That said, I'm assuming this goes in through Ingo, since it is both x86 and scheduler-related. I can take it directly, but see no real reason to. Ingo? Talking about x86-specific - I'm assuming other architectures have similar patterns? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/RFT PATCH 4/5] ARM: mm: change max*pfn to include the physical offset of memory
On Wednesday 31 July 2013 06:56 AM, Russell King - ARM Linux wrote: > On Fri, Jul 12, 2013 at 05:48:13PM -0400, Santosh Shilimkar wrote: >> diff --git a/arch/arm/include/asm/dma-mapping.h >> b/arch/arm/include/asm/dma-mapping.h >> index 5b579b9..b2d5937 100644 >> --- a/arch/arm/include/asm/dma-mapping.h >> +++ b/arch/arm/include/asm/dma-mapping.h >> @@ -47,12 +47,12 @@ static inline int dma_set_mask(struct device *dev, u64 >> mask) >> #ifndef __arch_pfn_to_dma >> static inline dma_addr_t pfn_to_dma(struct device *dev, unsigned long pfn) >> { >> -return (dma_addr_t)__pfn_to_bus(pfn); >> +return (dma_addr_t)__pfn_to_bus(pfn + PHYS_PFN_OFFSET); >> } >> >> static inline unsigned long dma_to_pfn(struct device *dev, dma_addr_t addr) >> { >> -return __bus_to_pfn(addr); >> +return __bus_to_pfn(addr) - PHYS_PFN_OFFSET; >> } >> >> static inline void *dma_to_virt(struct device *dev, dma_addr_t addr) >> @@ -64,15 +64,16 @@ static inline dma_addr_t virt_to_dma(struct device *dev, >> void *addr) >> { >> return (dma_addr_t)__virt_to_bus((unsigned long)(addr)); >> } >> + >> #else >> static inline dma_addr_t pfn_to_dma(struct device *dev, unsigned long pfn) >> { >> -return __arch_pfn_to_dma(dev, pfn); >> +return __arch_pfn_to_dma(dev, pfn + PHYS_PFN_OFFSET); >> } >> >> static inline unsigned long dma_to_pfn(struct device *dev, dma_addr_t addr) >> { >> -return __arch_dma_to_pfn(dev, addr); >> +return __arch_dma_to_pfn(dev, addr) - PHYS_PFN_OFFSET; >> } >> >> static inline void *dma_to_virt(struct device *dev, dma_addr_t addr) >> @@ -86,6 +87,13 @@ static inline dma_addr_t virt_to_dma(struct device *dev, >> void *addr) >> } >> #endif > > Note that I don't think the above are correct - the 'pfn' argument to the > above functions already includes the PFN offset of physical memory - > they're all physical_address >> PAGE_SHIFT values. > Right. Updated patch pushed into the patch system. (patch 7805/1) Regards, Santosh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -v2] sched,x86: optimize switch_mm for multi-threaded workloads
On Wed, 31 Jul 2013 17:41:39 -0700 Linus Torvalds wrote: > However, as Rik points out, activate_mm() is different in that we > shouldn't have any preexisting MMU state anyway. And besides, that > should never trigger the "prev == next" case. > > But it does look a bit messy, and even your comment is a bit > misleading (it might make somebody think that all of switch_mm() is > protected from interrupts) Is this better? Not that I really care which version gets applied :) ---8<--- Subject: sched,x86: optimize switch_mm for multi-threaded workloads Dick Fowles, Don Zickus and Joe Mario have been working on improvements to perf, and noticed heavy cache line contention on the mm_cpumask, running linpack on a 60 core / 120 thread system. The cause turned out to be unnecessary atomic accesses to the mm_cpumask. When in lazy TLB mode, the CPU is only removed from the mm_cpumask if there is a TLB flush event. Most of the time, no such TLB flush happens, and the kernel skips the TLB reload. It can also skip the atomic memory set & test. Here is a summary of Joe's test results: * The __schedule function dropped from 24% of all program cycles down to 5.5%. * The cacheline contention/hotness for accesses to that bitmask went from being the 1st/2nd hottest - down to the 84th hottest (0.3% of all shared misses which is now quite cold) * The average load latency for the bit-test-n-set instruction in __schedule dropped from 10k-15k cycles down to an average of 600 cycles. * The linpack program results improved from 133 GFlops to 144 GFlops. Peak GFlops rose from 133 to 153. Reported-by: Don Zickus Reported-by: Joe Mario Tested-by: Joe Mario Signed-off-by: Rik van Riel Signed-off-by: Rik van Riel --- arch/x86/include/asm/mmu_context.h | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index cdbf367..3ac6089 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -59,7 +59,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { + /* On established mms, the mm_cpumask is only changed +* from irq context, from ptep_clear_flush while in +* lazy tlb mode, and here. Irqs are blocked during +* schedule, protecting us from simultaneous changes. +*/ + cpumask_set_cpu(cpu, mm_cpumask(next)); /* We were in lazy tlb mode and leave_mm disabled * tlb flush IPI delivery. We must reload CR3 * to make sure to use no freed page tables. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads
On Wed, Jul 31, 2013 at 6:58 PM, Rik van Riel wrote: > > Would you like me to document the things we found in the comment, > and resend a patch, or is the patch good as-is? Please re-send with commentary.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 3/4] tracing/kprobes: Fail to unregister if probe event files are open
On Wed, 2013-07-31 at 18:42 -0400, Steven Rostedt wrote: > > > This should be fine. Either event_remove() path takes event_mutex > > first and then ->write() fails, or ftrace_event_enable_disable() > > actually disables this even successfully. > > Actually I meant while in unregister_trace_probe(), it gets by the > trace_probe_is_enabled() part first, then the write succeeds (as the > event_mutex isn't taken till unregister_probe_event()). The the > unregister_probe_event fails, but the tp was freed. The event files > still reference the tp and this is where a crash can happen without this > patch set. And it's not just theoretical. I worked on a program to try to trigger this bug, and succeeded :-) (I've never been so happy being able to crash my kernel) [ 128.999772] BUG: unable to handle kernel paging request at 000500f9 [ 129.15] IP: [] probes_open+0x3b/0xa7 [ 129.15] PGD 7808a067 PUD 0 [ 129.15] Oops: [#1] PREEMPT SMP [ 129.15] Dumping ftrace buffer: [ 129.15] - [ 129.15] Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt kvm_intel snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm kvm snd_page_alloc snd_timer shpchp snd microcode i2c_i801 soundcore pata_acpi firewire_ohci firewire_core crc_itu_t ata_generic i915 drm_kms_helper drm i2c_algo_bit i2c_core video [ 129.15] CPU: 1 PID: 2070 Comm: test-kprobe-rem Not tainted 3.11.0-rc3-test+ #47 [ 129.15] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 [ 129.15] task: 880077756440 ti: 880076e52000 task.ti: 880076e52000 [ 129.15] RIP: 0010:[] [] probes_open+0x3b/0xa7 [ 129.15] RSP: 0018:880076e53c38 EFLAGS: 00010203 [ 129.15] RAX: 00050001 RBX: 88007844f440 RCX: 0003 [ 129.15] RDX: 0003 RSI: 0003 RDI: 880076e52000 [ 129.15] RBP: 880076e53c58 R08: 880076e53bd8 R09: [ 129.15] R10: 880077756440 R11: 0006 R12: 810dee35 [ 129.15] R13: 880079250418 R14: R15: 88007844f450 [ 129.15] FS: 7f87a276f700() GS:88007d48() knlGS: [ 129.15] CS: 0010 DS: ES: CR0: 8005003b [ 129.15] CR2: 000500f9 CR3: 77262000 CR4: 07e0 [ 129.15] Stack: [ 129.15] 880076e53c58 81219ea0 88007844f440 810dee35 [ 129.15] 880076e53ca8 81130f78 8800772986c0 8800796f93a0 [ 129.15] 81d1b5d8 880076e53e04 88007844f440 [ 129.15] Call Trace: [ 129.15] [] ? security_file_open+0x2c/0x30 [ 129.15] [] ? unregister_trace_probe+0x4b/0x4b [ 129.15] [] do_dentry_open+0x162/0x226 [ 129.15] [] finish_open+0x46/0x54 [ 129.15] [] do_last+0x7f6/0x996 [ 129.15] [] ? inode_permission+0x42/0x44 [ 129.15] [] path_openat+0x232/0x496 [ 129.15] [] do_filp_open+0x3a/0x8a [ 129.15] [] ? __alloc_fd+0x168/0x17a [ 129.15] [] do_sys_open+0x70/0x102 [ 129.15] [] ? trace_hardirqs_on_caller+0x160/0x197 [ 129.15] [] SyS_open+0x1e/0x20 [ 129.15] [] system_call_fastpath+0x16/0x1b [ 129.15] Code: e5 41 54 53 48 89 f3 48 83 ec 10 48 23 56 78 48 39 c2 75 6c 31 f6 48 c7 c7 d0 e6 a4 81 e8 3a 91 43 00 48 8b 05 f2 f8 96 00 eb 0c 80 f8 00 00 00 03 75 31 48 8b 00 48 3d 60 e7 a4 81 75 ec eb [ 129.15] RIP [] probes_open+0x3b/0xa7 [ 129.15] RSP [ 129.15] CR2: 000500f9 [ 130.21] ---[ end trace 35f17d68fc569897 ]--- Attached is the program I used. It took lots of tweaks and doesn't always trigger the bug on the first run. It can take several runs to trigger. Each bug I've seen has been rather random. Twice it crashed due to memory errors, once it just screwed up the kprobes, but the system still ran fine. I had another kprobes error bug, when when I went to do more tracing, it crashed. What my program does is creates two threads (well, one thread and the main thread). They place themselves onto CPU 0 and 1. Then the following occurs: CPU 0 CPU 1 - - create sigprocmask probe loop: loop: fd = open(events/kprobes/sigprocmask/enable) pthread_barrier_wait() nanosleep(interval) pthread_barrier_wait() write(fd, "1", 1); write(fd, "0", 1); truncate kprobe_events // deletes all probes pthread_barrier_wait(); pthread_barrier_wait(); write(fd, "0", 1); // just in case create sigprocmask probe pthread_barrier_wait()
[PATCH] mfd: max8997: cast void pointer to data of max8997_pmic_dt_match
Casting (void *) data of max8997_pmic_dt_match is necessary, because variable 'data' of struct 'of_device_id' is defined as 'const void *data'. Thus, pointer should be used instead of value. Signed-off-by: Jingoo Han --- drivers/mfd/max8997.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/mfd/max8997.c b/drivers/mfd/max8997.c index 1471405..e6d668a 100644 --- a/drivers/mfd/max8997.c +++ b/drivers/mfd/max8997.c @@ -51,7 +51,7 @@ static struct mfd_cell max8997_devs[] = { #ifdef CONFIG_OF static struct of_device_id max8997_pmic_dt_match[] = { - { .compatible = "maxim,max8997-pmic", .data = TYPE_MAX8997 }, + { .compatible = "maxim,max8997-pmic", .data = (void *)TYPE_MAX8997 }, {}, }; #endif -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/3] raid5: offload stripe handle to workqueue
On Wed, Jul 31, 2013 at 06:33:32AM -0400, Tejun Heo wrote: > Hello, > > On Wed, Jul 31, 2013 at 09:24:34AM +0800, Shaohua Li wrote: > > stripe is the work unit actually. As I said, if I queue a work for each > > stripe, > > just queue_work() will make the system blast because of the pwq->pool->lock > > contention. dispatching one work has another side effect that I can't add > > block > > Hmmm I see. I'm not familiar with the code base and could be > missing something but how does the custom stripe dispatch queue > synchronize? Doesn't that make use of a lock anyway? If so, how > would scheduling separate work items be worse? It does have lock, but when a stripe is queued to handle, no lock is required. So the workqueue lock will be high contended. > Also, can you please > elaborate the block plug part? Basically I do: blk_start_plug() handle_stripe() //may dispatch request blk_end_plug() If only handle one stripe between block plug, the plug is useless, so I need handle several stripes. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/9] ARM: edma: Don't clear EMR of channel in edma_stop
On 07/31/2013 04:35 AM, Sekhar Nori wrote: > On Wednesday 31 July 2013 10:35 AM, Joel Fernandes wrote: >> On 07/30/2013 03:29 AM, Sekhar Nori wrote: >>> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: We certainly don't want error conditions to be cleared anywhere >>> >>> 'anywhere' is a really loaded term. >>> as this will make us 'forget' about missed events. We depend on knowing which events were missed in order to be able to reissue them. >>> This fixes a race condition where the EMR was being cleared by the transfer completion interrupt handler. Basically, what was happening was: Missed event | | V SG1-SG2-SG3-Null \ \__TC Interrupt (Almost same time as ARM is executing TC interrupt handler, an event got missed and also forgotten by clearing the EMR). >>> >>> Sorry, but I dont see how edma_stop() is coming into picture in the race >>> you describe? >> >> In edma_callback function, for the case of DMA_COMPLETE (Transfer >> completion interrupt), edma_stop() is called when all sets have been >> processed. This had the effect of clearing the EMR. > > Ah, thanks. I was missing the fact that the race comes into picture only > when using the DMA engine driver. I guess that should be mentioned > somewhere since it is not immediately obvious. > > The patch looks good to me. So if you respin just this one with some > updated explanation based on what you wrote below, I will take it. Sure I'll do that. Also the trigger_channel patch, will you be taking that one too? I can send these 2 in a series as they touch arch/arm/common/edma.c Thanks, -Joel > > Thanks, > Sekhar > > -- > To unsubscribe from this list: send the line "unsubscribe linux-omap" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mfd: lpc_ich: Staticize struct 'lpc_chipset_info'
'lpc_chipset_info' is used only in this file. Fix the following sparse warning: drivers/mfd/lpc_ich.c:216:21: warning: symbol 'lpc_chipset_info' was not declared. Should it be static? Signed-off-by: Jingoo Han --- drivers/mfd/lpc_ich.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/mfd/lpc_ich.c b/drivers/mfd/lpc_ich.c index 2403332..9483bc8 100644 --- a/drivers/mfd/lpc_ich.c +++ b/drivers/mfd/lpc_ich.c @@ -213,7 +213,7 @@ enum lpc_chipsets { LPC_COLETO, /* Coleto Creek */ }; -struct lpc_ich_info lpc_chipset_info[] = { +static struct lpc_ich_info lpc_chipset_info[] = { [LPC_ICH] = { .name = "ICH", .iTCO_version = 1, -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads
On 07/31/2013 08:41 PM, Linus Torvalds wrote: On Wed, Jul 31, 2013 at 4:14 PM, Paul Turner wrote: We attached the following explanatory comment to our version of the patch: /* * In the common case (two user threads sharing mm * switching) the bit will be set; avoid doing a write * (via atomic test & set) unless we have to. This is * safe, because no other CPU ever writes to our bit * in the mask, and interrupts are off (so we can't * take a TLB IPI here.) If we don't do this, then * switching threads will pingpong the cpumask * cacheline. */ So as mentioned, the "interrupts will be off" is actually dubious. It's true for the context switch case, but not for the activate_mm(). However, as Rik points out, activate_mm() is different in that we shouldn't have any preexisting MMU state anyway. And besides, that should never trigger the "prev == next" case. But it does look a bit messy, and even your comment is a bit misleading (it might make somebody think that all of switch_mm() is protected from interrupts) . Anyway, I'm perfectly ok with the patch itself, but I just wanted to make sure people had thought about these things. Would you like me to document the things we found in the comment, and resend a patch, or is the patch good as-is? -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch]workqueue: copy attr with all fields.
On Wed, Jul 31, 2013 at 06:27:52AM -0400, Tejun Heo wrote: > Hello, > > On Wed, Jul 31, 2013 at 09:39:29AM +0800, Shaohua Li wrote: > > Hmm, I didn't agree it's more confusing to change copy_workqueue_attrs(), > > the > > We're talking past each other. I'm not saying copy_workqueue_attrs() > shouldn't copy no_numa. I'm saying get_unbound_pool() should clear > no_numa so that pools don't have random no_numa settings. > > > name of the function suggests it is a 'copy'. And clearing no_numa in > > apply_workqueue_attrs() after copy_workqueue_attrs() looks like a hack to > > me. > > Why would apply_workqueue_attrs() modify no_numa when it *is* dealing > with an actual workqueue attr. I've been talking about > get_unbound_pool() not apply_workqueue_attrs() the whole time. > > > But it depends on you, feel free to fix it by yourself. > > Please update the patch to add no_numa clearing to get_unbound_pool() > and explain what's going on. Something like this? Subject: workqueue: copy workqueue_attrs with all fields $echo '0' > /sys/bus/workqueue/devices/xxx/numa $cat /sys/bus/workqueue/devices/xxx/numa I got 1. It should be 0, the reason is copy_workqueue_attrs() called in apply_workqueue_attrs() doesn't copy no_numa field. Signed-off-by: Shaohua Li diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 0b72e81..daf3f03 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3411,6 +3411,7 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to, { to->nice = from->nice; cpumask_copy(to->cpumask, from->cpumask); + to->no_numa = from->no_numa; } /* hash value of the content of @attr */ @@ -3578,6 +3579,12 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) lockdep_set_subclass(>lock, 1); /* see put_pwq() */ copy_workqueue_attrs(pool->attrs, attrs); + /* +* no_numa isn't a worker_pool attribute, always clear it. See 'struct +* workqueue_attrs' comments for detail. +*/ + pool->attrs->no_numa = 0; + /* if cpumask is contained inside a NUMA node, we belong to that node */ if (wq_numa_enabled) { for_each_node(node) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mfd: twl4030-power: Staticize local functions
twl4030_power_configure_scripts(), twl4030_power_configure_resources(), twl4030_power_probe() are used only in this file. Fix the following sparse warnings: drivers/mfd/twl4030-power.c:496:5: warning: symbol 'twl4030_power_configure_scripts' was not declared. Should it be static? drivers/mfd/twl4030-power.c:512:5: warning: symbol 'twl4030_power_configure_resources' was not declared. Should it be static? drivers/mfd/twl4030-power.c:556:5: warning: symbol 'twl4030_power_probe' was not declared. Should it be static? Signed-off-by: Jingoo Han --- drivers/mfd/twl4030-power.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/mfd/twl4030-power.c b/drivers/mfd/twl4030-power.c index a5fd3c7..adce79d 100644 --- a/drivers/mfd/twl4030-power.c +++ b/drivers/mfd/twl4030-power.c @@ -493,7 +493,7 @@ int twl4030_remove_script(u8 flags) return err; } -int twl4030_power_configure_scripts(struct twl4030_power_data *pdata) +static int twl4030_power_configure_scripts(struct twl4030_power_data *pdata) { int err; int i; @@ -509,7 +509,7 @@ int twl4030_power_configure_scripts(struct twl4030_power_data *pdata) return 0; } -int twl4030_power_configure_resources(struct twl4030_power_data *pdata) +static int twl4030_power_configure_resources(struct twl4030_power_data *pdata) { struct twl4030_resconfig *resconfig = pdata->resource_config; int err; @@ -553,7 +553,7 @@ static bool twl4030_power_use_poweroff(struct twl4030_power_data *pdata, return false; } -int twl4030_power_probe(struct platform_device *pdev) +static int twl4030_power_probe(struct platform_device *pdev) { struct twl4030_power_data *pdata = pdev->dev.platform_data; struct device_node *node = pdev->dev.of_node; -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 3/7] cgroup: restructure the failure path in cgroup_write_event_control()
It uses a single label and checks the validity of each pointer. This is err-prone, and actually we had a bug because one of the check was insufficient. Use multi lables as we do in other places. v2: - drop initializations of local variables. Signed-off-by: Li Zefan --- kernel/cgroup.c | 39 ++- 1 file changed, 18 insertions(+), 21 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index fb11569..ae905a5 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -3934,11 +3934,11 @@ static void cgroup_event_ptable_queue_proc(struct file *file, static int cgroup_write_event_control(struct cgroup *cgrp, struct cftype *cft, const char *buffer) { - struct cgroup_event *event = NULL; + struct cgroup_event *event; struct cgroup *cgrp_cfile; unsigned int efd, cfd; - struct file *efile = NULL; - struct file *cfile = NULL; + struct file *efile; + struct file *cfile; char *endp; int ret; @@ -3964,31 +3964,31 @@ static int cgroup_write_event_control(struct cgroup *cgrp, struct cftype *cft, efile = eventfd_fget(efd); if (IS_ERR(efile)) { ret = PTR_ERR(efile); - goto fail; + goto out_kfree; } event->eventfd = eventfd_ctx_fileget(efile); if (IS_ERR(event->eventfd)) { ret = PTR_ERR(event->eventfd); - goto fail; + goto out_put_efile; } cfile = fget(cfd); if (!cfile) { ret = -EBADF; - goto fail; + goto out_put_eventfd; } /* the process need read permission on control file */ /* AV: shouldn't we check that it's been opened for read instead? */ ret = inode_permission(file_inode(cfile), MAY_READ); if (ret < 0) - goto fail; + goto out_put_cfile; event->cft = __file_cft(cfile); if (IS_ERR(event->cft)) { ret = PTR_ERR(event->cft); - goto fail; + goto out_put_cfile; } /* @@ -3998,18 +3998,18 @@ static int cgroup_write_event_control(struct cgroup *cgrp, struct cftype *cft, cgrp_cfile = __d_cgrp(cfile->f_dentry->d_parent); if (cgrp_cfile != cgrp) { ret = -EINVAL; - goto fail; + goto out_put_cfile; } if (!event->cft->register_event || !event->cft->unregister_event) { ret = -EINVAL; - goto fail; + goto out_put_cfile; } ret = event->cft->register_event(cgrp, event->cft, event->eventfd, buffer); if (ret) - goto fail; + goto out_put_cfile; efile->f_op->poll(efile, >pt); @@ -4029,16 +4029,13 @@ static int cgroup_write_event_control(struct cgroup *cgrp, struct cftype *cft, return 0; -fail: - if (cfile) - fput(cfile); - - if (event && event->eventfd && !IS_ERR(event->eventfd)) - eventfd_ctx_put(event->eventfd); - - if (!IS_ERR_OR_NULL(efile)) - fput(efile); - +out_put_cfile: + fput(cfile); +out_put_eventfd: + eventfd_ctx_put(event->eventfd); +out_put_efile: + fput(efile); +out_kfree: kfree(event); return ret; -- 1.8.0.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 4/7] cgroup: rename cgroup_pidlist->mutex
It's a rw_semaphore not a mutex. Signed-off-by: Li Zefan --- kernel/cgroup.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index ae905a5..6662811 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -3436,7 +3436,7 @@ struct cgroup_pidlist { /* pointer to the cgroup we belong to, for list removal purposes */ struct cgroup *owner; /* protects the other fields */ - struct rw_semaphore mutex; + struct rw_semaphore rwsem; }; /* @@ -3509,7 +3509,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp, struct pid_namespace *ns = task_active_pid_ns(current); /* -* We can't drop the pidlist_mutex before taking the l->mutex in case +* We can't drop the pidlist_mutex before taking the l->rwsem in case * the last ref-holder is trying to remove l from the list at the same * time. Holding the pidlist_mutex precludes somebody taking whichever * list we find out from under us - compare release_pid_array(). @@ -3518,7 +3518,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp, list_for_each_entry(l, >pidlists, links) { if (l->key.type == type && l->key.ns == ns) { /* make sure l doesn't vanish out from under us */ - down_write(>mutex); + down_write(>rwsem); mutex_unlock(>pidlist_mutex); return l; } @@ -3529,8 +3529,8 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp, mutex_unlock(>pidlist_mutex); return l; } - init_rwsem(>mutex); - down_write(>mutex); + init_rwsem(>rwsem); + down_write(>rwsem); l->key.type = type; l->key.ns = get_pid_ns(ns); l->owner = cgrp; @@ -3591,7 +3591,7 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type, l->list = array; l->length = length; l->use_count++; - up_write(>mutex); + up_write(>rwsem); *lp = l; return 0; } @@ -3669,7 +3669,7 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos) int index = 0, pid = *pos; int *iter; - down_read(>mutex); + down_read(>rwsem); if (pid) { int end = l->length; @@ -3696,7 +3696,7 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos) static void cgroup_pidlist_stop(struct seq_file *s, void *v) { struct cgroup_pidlist *l = s->private; - up_read(>mutex); + up_read(>rwsem); } static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos) @@ -3742,7 +3742,7 @@ static void cgroup_release_pid_array(struct cgroup_pidlist *l) * pidlist_mutex, we have to take pidlist_mutex first. */ mutex_lock(>owner->pidlist_mutex); - down_write(>mutex); + down_write(>rwsem); BUG_ON(!l->use_count); if (!--l->use_count) { /* we're the last user if refcount is 0; remove and free */ @@ -3750,12 +3750,12 @@ static void cgroup_release_pid_array(struct cgroup_pidlist *l) mutex_unlock(>owner->pidlist_mutex); pidlist_free(l->list); put_pid_ns(l->key.ns); - up_write(>mutex); + up_write(>rwsem); kfree(l); return; } mutex_unlock(>owner->pidlist_mutex); - up_write(>mutex); + up_write(>rwsem); } static int cgroup_pidlist_release(struct inode *inode, struct file *file) -- 1.8.0.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: caam - Remove unused functions from Job Ring
On Wed, Jul 31, 2013 at 03:48:56PM +0530, Ruchika Gupta wrote: > Signed-off-by: Ruchika Gupta Patch applied. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hw_random: mxc-rnga: Check the return value from clk_prepare_enable()
On Sun, Jul 21, 2013 at 02:41:38PM -0300, Fabio Estevam wrote: > From: Fabio Estevam > > clk_prepare_enable() may fail, so let's check its return value and propagate > it > in the case of error. > > Signed-off-by: Fabio Estevam Patch applied. Thanks! -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: caam - RNG instantiation by directly programming DECO
On Thu, Jul 04, 2013 at 11:26:03AM +0530, Ruchika Gupta wrote: > Remove the dependency of RNG instantiation on Job Ring. Now > RNG instantiation for devices with RNG version > 4 is done > by directly programming DECO 0. > > Signed-off-by: Ruchika Gupta Patch applied. Thanks a lot! -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: crypto: ux500: Fix logging, make arrays const, neatening
On Mon, Jul 22, 2013 at 04:19:35PM -0700, Joe Perches wrote: > On Tue, 2013-07-23 at 00:35 +0200, Linus Walleij wrote: > > Have you tested this on hardware or is it compile-tested only? > > Hi Linus. > > Compile tested only. Not tested on real devices. -ENOHARDWARE. Patch applied. Thanks! -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] block: support embedded device command line partition
> From: Karel Zak [mailto:k...@redhat.com] > Sent: Wednesday, July 31, 2013 9:25 PM > To: Caizhiyong > Cc: Andrew Morton; linux-kernel@vger.kernel.org; Wanglin (Albert); Quyaxin > Subject: Re: [PATCH] block: support embedded device command line partition > > On Sat, Jul 27, 2013 at 01:56:24PM +, Caizhiyong wrote: > > +static int parse_partitions(struct parsed_partitions *state, > > + struct cmdline_parts *parts) > > +{ > > + int slot; > > + uint64_t from = 0; > > + uint64_t disk_size; > > + char buf[BDEVNAME_SIZE]; > > + struct cmdline_subpart *subpart; > > + > > + bdevname(state->bdev, buf); > > + > > + while (parts && strncmp(buf, parts->name, BDEVNAME_SIZE)) > > + parts = parts->next_parts; > > + > > + if (!parts) > > + return 0; > > + > > + disk_size = (uint64_t) get_capacity(state->bdev->bd_disk) << 9; > > + > > + for (slot = 1, subpart = parts->subpart; > > +subpart && slot < state->limit; > > +subpart = subpart->next_subpart, slot++) { > > + > > + unsigned label_max; > > + struct partition_meta_info *info; > > + > > + if (subpart->from == OFFSET_CONTINUOUS) > > + subpart->from = from; > > + else > > + from = subpart->from; > > + > > + if (from >= disk_size) > > + break; > > + > > + if (subpart->size > (disk_size - from)) > > + subpart->size = disk_size - from; > > + > > + from += subpart->size; > > + > > + put_partition(state, slot, le64_to_cpu(subpart->from >> 9), > > + le64_to_cpu(subpart->size >> 9)); > > Why le64_to_cpu()? > > I guess that memparse() does not return explicitly LE and it also > seems that your code on another places uses ->size and ->from > without any extra care. Right. I will remove. > > Karel > > -- > Karel Zak > http://karelzak.blogspot.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] Convert PowerPC macro spin_event_timeout() to architecture independent macro
On 07/31/13 17:20, Timur Tabi wrote: > On 07/31/2013 07:16 PM, Stephen Boyd wrote: >> cpu_relax() is usually just a compiler barrier or an instruction hint to >> the cpu that it should cool down because we're spinning in a tight loop. >> It certainly shouldn't be calling into the scheduler. > Ah yes, I remember now. So it does seem that if we can fix the problem > of non-incrementing 'jiffies', then this macro can be used in interrupts. That's encouraging. It looks like you introduced it to use in interrupt context but then it got shot down[1]? I lost track in all the versions. > > Of course, that assumes that spinning in interrupt context is a good > idea to begin with. Maybe we shouldn't be encouraging it? I read through the v5 discussion and it seems I'm about to walk through some tall grass on the way to Cerulean City. Andrew Morton, I choose you! Use your mind-power move to convince everyone that having a macro for spinning on a register in interrupt context is a good thing. At least it will be more obvious. > FYI, you might want to look at the code reviews for spin_event_timeout() on the linuxppc-dev mailing list, back in March 2009. >> Sure. Any pointers? Otherwise I'll go digging around the archives. > https://lists.ozlabs.org/pipermail/linuxppc-dev/2009-March/thread.html > [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2009-May/072521.html -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hotplug: remove unnecessary BUG_ON in __offline_pages()
On 2013/8/1 0:55, Dave Hansen wrote: > On 07/29/2013 11:49 PM, Xishi Qiu wrote: >> I think we can remove "BUG_ON(start_pfn >= end_pfn)" in __offline_pages(), >> because in memory_block_action() "nr_pages = PAGES_PER_SECTION * >> sections_per_block" >> is always greater than 0. > ... >> --- a/mm/memory_hotplug.c >> +++ b/mm/memory_hotplug.c >> @@ -1472,7 +1472,6 @@ static int __ref __offline_pages(unsigned long >> start_pfn, >> struct zone *zone; >> struct memory_notify arg; >> >> -BUG_ON(start_pfn >= end_pfn); >> /* at least, alignment against pageblock is necessary */ >> if (!IS_ALIGNED(start_pfn, pageblock_nr_pages)) >> return -EINVAL; > > I think you're saying that you don't see a way to hit this BUG_ON() in > practice. That does appear to be true, unless sections_per_block ended > up 0 or negative. The odds of getting in to this code if > 'sections_per_block' was bogus are pretty small. > Yes, I find there is an only to hit this BUG_ON() in v3.11, and "sections_per_block" seems to be always greater than 0. > Or, is this a theoretical thing that folks might run in to when adding > new features or developing? It's in a cold path and the cost of the > check is miniscule. The original author (cc'd) also saw a need to put > this in probably because he actually ran in to this. > In v2.6.32, If info->length==0, this way may hit this BUG_ON(). acpi_memory_disable_device() remove_memory(info->start_addr, info->length) offline_pages() Later Fujitsu's patch rename this function and the BUG_ON() is unnecessary. Thanks, Xishi Qiu > In any case, it looks fairly safe to me: > > Reviewed-by: Dave Hansen > . > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] syslog_ns: make syslog handling per namespace
On 07/29/2013 10:31 AM, Rui Xiang wrote: > This patch makes syslog buf and other fields per > namespace. > > Here use ns->log_buf(log_buf_len, logbuf_lock, > log_first_seq, logbuf_lock, and so on) fields > instead of global ones to handle syslog. > > Syslog interfaces such as /dev/kmsg, /proc/kmsg, > and syslog syscall are all containerized for > container users. > /dev/kmsg is used by the syslog api closelog, openlog, syslog, vsyslog, this should be per user namespace, but seems in your patch, the syslog message generated through these APIs on host can be exported to the /dev/kmsg of container, is this want we want? Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] chipidea: Use devm_request_irq()
On Wed, Jul 31, 2013 at 10:15:12AM -0400, Tejun Heo wrote: > hello, > > On Wed, Jul 31, 2013 at 09:55:26PM +0800, Peter Chen wrote: > > I think the main point is we should allocate managed resource which is used > > at interrupt handler before devm_request_irq, and all resources used > > at interrupt > > handler should be managed. > > > > If we use non-managed resource at interrupt handler, but using managed > > interrupt > > handler, things still will go wrong if there is an odd (unexpected) > > interrupt after > > we finish deactivation at removal. > > In general, applying devm partially isn't a good idea. It's very easy > to get into trouble thanks to release order dependency which isn't > immediately noticeable and there have been actual bugs caused by that. > The strategies which seem to work are either > > * Convert everything over to devm by wrapping deactivation in a devres > callback too. As long as your init sequence is sane (ie. irq > shouldn't be request before things used by irq are ready). > > * Allocate all resources using devres but shut down the execution > engine in the remove_one(). Again, as all releases are controlled > by devres, you won't have to worry about messing up the release > sequencing. > thanks, Tejun. So, Alex and Fabio, this patch may not be suitable currently, since many resources at both EHCI and device side are non-managed. -- Best Regards, Peter Chen -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] PCI: update device mps when doing pci hotplug
>> Yes, I think that would be safe. If the switch is set to a larger MPS >> than the hot-added device supports, I don't think we can safely use >> the device. > > I opened a bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=60671 > for this problem. Please correct any mistakes in my summary and > reference it in your changelog. OK, thanks! > > Bjorn > > -- Thanks! Yijing -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] PCI: update device mps when doing pci hotplug
> Yes, I think that would be safe. If the switch is set to a larger MPS > than the hot-added device supports, I don't think we can safely use > the device. OK, I will refresh my patch, after test in my machine, I will send it out. Thanks! Yijing. > > Bjorn > > -- Thanks! Yijing -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] xfs: Implement FALLOC_FL_COLLAPSE_RANGE
On Wed, Jul 31, 2013 at 11:42:14PM +0900, Namjae Jeon wrote: > From: Namjae Jeon > > New fallocate flag FALLOC_FL_COLLAPSE_RANGE implementation for XFS. A good start, but there's lots of work needed to make this safe for general use. > Signed-off-by: Namjae Jeon > Signed-off-by: Ashish Sangwan > --- > fs/xfs/xfs_bmap.c | 92 > + > fs/xfs/xfs_bmap.h |3 ++ > fs/xfs/xfs_file.c | 26 -- > fs/xfs/xfs_iops.c | 35 +++ > fs/xfs/xfs_vnodeops.c | 62 + > fs/xfs/xfs_vnodeops.h |2 ++ > 6 files changed, 217 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c > index 05c698c..ae677b1 100644 > --- a/fs/xfs/xfs_bmap.c > +++ b/fs/xfs/xfs_bmap.c > @@ -6145,3 +6145,95 @@ next_block: > > return error; > } > + > +/* > + * xfs_update_logical() > + * Updates starting logical block of extents by subtracting > + * shift from them. At max XFS_LINEAR_EXTS number extents > + * will be updated and *current_ext is the extent number which > + * is currently being updated. Single space after the "*" in the comments. Also, format them out to 80 columns. > + */ > +int > +xfs_update_logical( > + xfs_trans_t *tp, > + struct xfs_inode*ip, > + int *done, > + xfs_fileoff_t start_fsb, > + xfs_fileoff_t shift, > + xfs_extnum_t*current_ext) This belongs in xfs_bmap_btree.c, named something like xfs_bmbt_shift_rec_offsets(). Also, not typedefs for structures, please. (i.e. struct xfs_trans). > +{ > + xfs_btree_cur_t *cur; struct xfs_btree_cur*cur; And for all the other structures, too. > + xfs_bmbt_rec_host_t *gotp; > + xfs_mount_t *mp; > + xfs_ifork_t *ifp; > + xfs_extnum_tnexts = 0; > + xfs_fileoff_t startoff; > + int error = 0; > + int i; > + > + ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK); > + mp = ip->i_mount; Do the assigment in the declaration on mp. struct xfs_mount*mp = ip->i_mount; > + > + if (!(ifp->if_flags & XFS_IFEXTENTS) && > + (error = xfs_iread_extents(tp, ip, XFS_DATA_FORK))) > + return error; H - that rings alarm bells. > + > + if (!*current_ext) { > + gotp = xfs_iext_bno_to_ext(ifp, start_fsb, current_ext); > + /* > + * gotp can be null in 2 cases: 1) if there are no extents > + * or 2) start_fsb lies in a hole beyond which there are > + * no extents. Either way, we are done. > + */ > + if (!gotp) { > + *done = 1; > + return 0; > + } > + } So, you do a lookup on an extent index, which returns a incore extent record. > + > + if (ifp->if_flags & XFS_IFBROOT) > + cur = xfs_bmbt_init_cursor(mp, tp, ip, XFS_DATA_FORK); > + else > + cur = NULL; > + > + while (nexts++ < XFS_LINEAR_EXTS && What has XFS_LINEAR_EXTS got to do with how many extents can be moved in a single transaction at a time? > +*current_ext < XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK)) { > + gotp = xfs_iext_get_ext(ifp, (*current_ext)++); > + startoff = xfs_bmbt_get_startoff(gotp); > + if (cur) { > + if ((error = xfs_bmbt_lookup_eq(cur, > + startoff, > + xfs_bmbt_get_startblock(gotp), > + xfs_bmbt_get_blockcount(gotp), > + ))) Please don't put assignments inside if() statements where possible. i.e. error = xfs_bmbt_lookup_eq(cur, if (error) Is the normal way of doing this. > + goto del_cursor; > + XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor); > + } > + startoff -= shift; > + xfs_bmbt_set_startoff(gotp, startoff); So, we've looked up and extent, and reduced it's offset by shift. > + if (cur) { > + error = xfs_bmbt_update(cur, startoff, > + xfs_bmbt_get_startblock(gotp), > + xfs_bmbt_get_blockcount(gotp), > + xfs_bmbt_get_state(gotp)); > + if (error) > + goto del_cursor; And then we update the btree record in place. Ok, major alarm bells are going off at this point. Is the record still in the right place? Where did you validate that the shift downwards didn't overlap the previous extent in the tree? What happens if the start or end of
Re: i915 backlight
On 08/01/2013 12:36 AM, Borislav Petkov wrote: > On Wed, Jul 31, 2013 at 06:22:52PM +0200, Borislav Petkov wrote: >> Dudes, >> >> has anyone already reported this (happens on Linus of today + >> tip/master): > > Oh, one more thing: I can't control the backlight anymore on this x230 > with the Fn-Fx keys and this is most probably related to that recent > backlight revert story. If there is a new approach I can test, please > let me know. Can you please run acpi_listen and then press the Fn-Fx key, see if the events are correctly sent out? > > Btw, it worked before on this machine with "acpi_backlight=vendor" on > the cmdline. >From the bug page: https://bugzilla.kernel.org/show_bug.cgi?id=51231#c80 I got the impression that both the acpi_video interface and the vendor interface thinkpad_screen are broken. So adding this cmdline here works suggests that either thinkpad_screen works or thinkpad vendor driver doesn't get loaded or doesn't create that interface for some reason. Alternatively, if the intel_backlight interface works(highly possible), you can use xorg.conf to specify the that backlight interface for X. Section "Device" Option "Backlight" "intel_backlight" Identifier "Card0" Driver "intel" BusID "PCI:0:2:0" EndSection Thanks, Aaron -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/